I posted in the past about scraping forms for use when you need to make a post using CURL along with a little bit of code. I’ve recently moved on to using DOM and this little function I wrote below. I finally got fed up with sites adding new hidden fields and having to write preg_matches for unique variables that a lot of sites are stuffing into their forms these days to prevent spammers and/or track sessions.
If you don’t have the plugin already in you need the php-xml plugin. Depending on your setup ‘yum install php-xml’ should do the trick.
$html is the html code of the site you’d get using file_get_contents, CURL or whatever.
$form_number is the number of the form you want – 1. It’s usually ok just to leave this at 0 but sometimes sites have more than 1 form on the page so you have to specify.
The postData array is returned and ready for posting in CURL, all you need to do is find the few fields you actually need to specify and update those fields within the postData array. It’s usually as easy as $postData['email'] = myemail@emailme.com;
Updated Sept. 1st, 2009. This is my inputs.php. Some input fields I’ve found had some unique text names to make posting data more annoying. To quickly build around this I built a few functions to pull those as well by providing known parameters such as the inputs width, textsize, etc. It’s similar to what iMacros does.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 | <?php function getInputs($html, $form_number = 0) { $dom = new DOMDocument ( ); @$dom->loadHTML ( $html ); //@ is there cuz this will throw up a bunch of errors if the html code isn't perfect $forms = $dom->getElementsByTagName ( "form" ); $form = $forms->item ( $form_number ); // Gets input areas and also checks to make sure the form exists if ($form) { $inputs = $form->getElementsByTagName ( "input" ); } else { echo "Form does not exist! Line: " . __LINE__ . "\n"; return ''; } foreach ( $inputs as $input ) { $attval = $input->getAttribute ( 'name' ); if (! empty ( $attval )) $postData [$attval] = $input->getAttribute ( 'value' ); } // Gets textareas $inputs = $form->getElementsByTagName ( "textarea" ); foreach ( $inputs as $input ) { $attval = $input->getAttribute ( 'name' ); if (! empty ( $attval )) $postData [$attval] = $input->nodeValue; } // Gets buttons $inputs = $form->getElementsByTagName ( "button" ); foreach ( $inputs as $input ) { $attval = $input->getAttribute ( 'value' ); if (! empty ( $attval )) $postData [$attval] = $input->getAttribute ( 'value' ); } // Gets selects $inputs = $form->getElementsByTagName ( "select" ); foreach ( $inputs as $input ) { $attval = $input->getAttribute ( 'name' ); if (! empty ( $attval )) $postData [$attval] = ''; } return $postData; } function getForms($html) { $dom = new DOMDocument ( ); @$dom->loadHTML ( $html ); $xpath = new DOMXPath ( $dom ); $forms = $xpath->evaluate ( "/html/body//form" ); $returnform = array (); for($i = 0; $i < $forms->length; $i ++) { $form = $forms->item ( $i ); $returnform [] = $form->getAttribute ( 'action' ); } return $returnform; } // Finds unique variables by finding other variables within the form function findUniqueInput($html, $variables, $form_number = 0) { $dom = new DOMDocument ( ); @$dom->loadHTML ( $html ); //@ is there cuz this will throw up a bunch of errors if the html code isn't perfect $forms = $dom->getElementsByTagName ( "form" ); $form = $forms->item ( $form_number ); // Gets input areas and also checks to make sure the form exists if ($form) { $inputs = $form->getElementsByTagName ( "input" ); } else { echo "Form does not exist! Line: " . __LINE__ . "\n"; return ''; } $keys = array_keys ( $variables ); foreach ( $inputs as $input ) { $good = true; for($i = 0; $i < count ( $keys ); $i ++) { if ($input->getAttribute ( $keys [$i] ) == $variables [$keys [$i]]) { } else { $good = false; } } if ($good === true) { echo $input->getAttribute ( 'name' ) . "\n"; echo "Found Input!" . "\n"; return $input->getAttribute ( 'name' ); } } } function findUniqueTextArea($html, $variables, $form_number = 0) { $dom = new DOMDocument ( ); @$dom->loadHTML ( $html ); //@ is there cuz this will throw up a bunch of errors if the html code isn't perfect $forms = $dom->getElementsByTagName ( "form" ); $form = $forms->item ( $form_number ); // Gets input areas and also checks to make sure the form exists if ($form) { $inputs = $form->getElementsByTagName ( "textarea" ); } else { echo "Form does not exist! Line: " . __LINE__ . "\n"; return ''; } $keys = array_keys ( $variables ); foreach ( $inputs as $input ) { $good = true; for($i = 0; $i < count ( $keys ); $i ++) { if ($input->getAttribute ( $keys [$i] ) == $variables [$keys [$i]]) { } else { $good = false; } } if ($good === true) { echo $input->getAttribute ( 'name' ) . "\n"; echo "Found TextArea!" . "\n"; return $input->getAttribute ( 'name' ); } } } ?> |
Example getting a unique input name.
1 2 | $variables = array ("maxlength" => '10', "size" => '10', "tabindex" => '1' ); // the other variables of the text input that you want to retrieve $uniqueName = $this->findUniqueInput ( $text, $variables ); |
function getInputs example
1 2 3 4 5 6 7 | $text = $c->getFile("http://www.wordpress.com"); // integrates with my curl class (you can replace this with file_get or something similar depending on what you use.) $postData = getInputs ( $text ); // gets the inputs from the html returned $postData [login] = $login; // sets the unique variables needed for the post $postData [password] = $password; $this->curl->getFile("http://www.wordpress.com", $postData); // posts our data to wordpress, p.s. this is just an example and won't actually work with wordpress. |
Thats great. How to get other types, radios, checkboxes, selects and multiple selects?
“I haven’t RUN across needing it for anything else yet…”
Seriously, don’t be a moron.
My bad, “Ran” corrected