Scraping Forms with DOM and PHP for Posting with CURL

September 1st, 2009 by admin Leave a reply »

I posted in the past about scraping forms for use when you need to make a post using CURL along with a little bit of code. I’ve recently moved on to using DOM and this little function I wrote below. I finally got fed up with sites adding new hidden fields and having to write preg_matches for unique variables that a lot of sites are stuffing into their forms these days to prevent spammers and/or track sessions. 

If you don’t have the plugin already in you need the php-xml plugin. Depending on your setup ‘yum install php-xml’ should do the trick.

$html is the html code of the site you’d get using file_get_contents, CURL or whatever.

$form_number is the number of the form you want – 1. It’s usually ok just to leave this at 0 but sometimes sites have more than 1 form on the page so you have to specify. 

The postData array is returned and ready for posting in CURL, all you need to do is find the few fields you actually need to specify and update those fields within the postData array. It’s usually as easy as $postData['email'] = myemail@emailme.com;

Updated Sept. 1st, 2009. This is my inputs.php. Some input fields I’ve found had some unique text names to make posting data more annoying. To quickly build around this I built a few functions to pull those as well by providing known parameters such as the inputs width, textsize, etc. It’s similar to what iMacros does.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
<?php
function getInputs($html, $form_number = 0) {
 
	$dom = new DOMDocument ( );
	@$dom->loadHTML ( $html ); //@ is there cuz this will throw up a bunch of errors if the html code isn't perfect
 
 
	$forms = $dom->getElementsByTagName ( "form" );
 
	$form = $forms->item ( $form_number );
 
	// Gets input areas and also checks to make sure the form exists
 
 
	if ($form) {
		$inputs = $form->getElementsByTagName ( "input" );
	} else {
		echo "Form does not exist! Line: " . __LINE__ . "\n";
		return '';
	}
 
	foreach ( $inputs as $input ) {
 
		$attval = $input->getAttribute ( 'name' );
 
		if (! empty ( $attval ))
			$postData [$attval] = $input->getAttribute ( 'value' );
 
	}
 
	// Gets textareas
 
 
	$inputs = $form->getElementsByTagName ( "textarea" );
 
	foreach ( $inputs as $input ) {
 
		$attval = $input->getAttribute ( 'name' );
 
		if (! empty ( $attval ))
			$postData [$attval] = $input->nodeValue;
 
	}
 
	// Gets buttons
 
 
	$inputs = $form->getElementsByTagName ( "button" );
 
	foreach ( $inputs as $input ) {
 
		$attval = $input->getAttribute ( 'value' );
 
		if (! empty ( $attval ))
			$postData [$attval] = $input->getAttribute ( 'value' );
 
	}
 
	// Gets selects
 
 
	$inputs = $form->getElementsByTagName ( "select" );
 
	foreach ( $inputs as $input ) {
		$attval = $input->getAttribute ( 'name' );
 
		if (! empty ( $attval ))
			$postData [$attval] = '';
 
	}
 
	return $postData;
 
}
 
function getForms($html) {
 
	$dom = new DOMDocument ( );
	@$dom->loadHTML ( $html );
	$xpath = new DOMXPath ( $dom );
 
	$forms = $xpath->evaluate ( "/html/body//form" );
 
	$returnform = array ();
 
	for($i = 0; $i < $forms->length; $i ++) {
		$form = $forms->item ( $i );
		$returnform [] = $form->getAttribute ( 'action' );
	}
 
	return $returnform;
 
}
 
// Finds unique variables by finding other variables within the form
 
 
function findUniqueInput($html, $variables, $form_number = 0) {
 
	$dom = new DOMDocument ( );
	@$dom->loadHTML ( $html ); //@ is there cuz this will throw up a bunch of errors if the html code isn't perfect
 
 
	$forms = $dom->getElementsByTagName ( "form" );
 
	$form = $forms->item ( $form_number );
 
	// Gets input areas and also checks to make sure the form exists
 
 
	if ($form) {
		$inputs = $form->getElementsByTagName ( "input" );
	} else {
		echo "Form does not exist! Line: " . __LINE__ . "\n";
		return '';
	}
 
	$keys = array_keys ( $variables );
 
	foreach ( $inputs as $input ) {
 
		$good = true;
 
		for($i = 0; $i < count ( $keys ); $i ++) {
 
			if ($input->getAttribute ( $keys [$i] ) == $variables [$keys [$i]]) {
 
			} else {
				$good = false;
			}
		}
 
		if ($good === true) {
			echo $input->getAttribute ( 'name' ) . "\n";
			echo "Found Input!" . "\n";
			return $input->getAttribute ( 'name' );
		}
 
	}
 
}
 
function findUniqueTextArea($html, $variables, $form_number = 0) {
 
	$dom = new DOMDocument ( );
	@$dom->loadHTML ( $html ); //@ is there cuz this will throw up a bunch of errors if the html code isn't perfect
 
 
	$forms = $dom->getElementsByTagName ( "form" );
 
	$form = $forms->item ( $form_number );
 
	// Gets input areas and also checks to make sure the form exists
 
 
	if ($form) {
		$inputs = $form->getElementsByTagName ( "textarea" );
	} else {
		echo "Form does not exist! Line: " . __LINE__ . "\n";
		return '';
	}
 
	$keys = array_keys ( $variables );
 
	foreach ( $inputs as $input ) {
 
		$good = true;
 
		for($i = 0; $i < count ( $keys ); $i ++) {
 
			if ($input->getAttribute ( $keys [$i] ) == $variables [$keys [$i]]) {
 
			} else {
				$good = false;
			}
		}
 
		if ($good === true) {
			echo $input->getAttribute ( 'name' ) . "\n";
			echo "Found TextArea!" . "\n";
			return $input->getAttribute ( 'name' );
		}
 
	}
 
}
 
?>

Example getting a unique input name.

1
2
	$variables = array ("maxlength" => '10', "size" => '10', "tabindex" => '1' ); // the other variables of the text input that you want to retrieve
	$uniqueName = $this->findUniqueInput ( $text, $variables );

function getInputs example

1
2
3
4
5
6
7
$text = $c->getFile("http://www.wordpress.com"); // integrates with my curl class (you can replace this with file_get or something similar depending on what you use.)
                $postData = getInputs ( $text ); // gets the inputs from the html returned
		$postData [login] = $login; // sets the unique variables needed for the post
		$postData [password] = $password;
 
 
$this->curl->getFile("http://www.wordpress.com", $postData); // posts our data to wordpress, p.s. this is just an example and won't actually work with wordpress.
Advertisement

3 comments

  1. hammad says:

    Thats great. How to get other types, radios, checkboxes, selects and multiple selects?

  2. Shaun says:

    “I haven’t RUN across needing it for anything else yet…”
    Seriously, don’t be a moron.

  3. admin says:

    My bad, “Ran” corrected :)

Leave a Reply