Web Scraping Using PHP and jQuery – Managing My Impression

I was asked by a friend to write code that would scrape a DLP website’s content of letters to use in an academic study (the website’s copyright allows for the non-commercial use of the data). I’d not tried this before and was excited by the challenge, especially considering I’m becoming more involved in “big data” studies and I need to understand how one might go about developing web scraping programs. I started with the programming languages I know best: PHP & jQuery. And yes, I know that there are better programming languages available to write code for webscraping. I’ve used PERL, Python, JAVA, and other programming languages in the past, but I’m currently much more versed in PHP than anything else! If I had been unable to quickly build this in PHP, than of course I’d have turned to Python or PERL; but in the end I was able to write some code and it worked. I’m happy with the results and so was my friend.

First, I had to figure out what PHP had under the hood that would allow me to load URLs and retrieve information. I did some searching via Google and figured out the best option was to use the cURL library (http://php.net/manual/en/book.curl.php). The cURL lib allows one to connect to a variety of servers and protocols and was perfect for my needs; don’t forget to check your PHP install to see if you have the cURL library installed and activated. I did a quick search on cURL and PHP and came across http://www.digimantra.com/technology/php/get-data-from-a-url-using-curl-php/ where I found a custom function that I thought I could edit to suit my needs:

// FUNCTION TO GET DATA USING cURL //
// Based on example from PHP Manual (http://www.digimantra.com/technology/php/get-data-from-a-url-using-curl-php/)
function get_data($url) {
$ch = curl_init();
$timeout = 5;
$userAgent = 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)'; // tell them we're Mozilla
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
$data = curl_exec($ch);
curl_close($ch);
return $data; // return data as string
}

Next I needed a way to grab specific DOM elements from the pages being scraped; I needed to find a <span> tag that had a specific attribute containing a value that was both a function name and a URL. I am very familiar with jQuery syntax and CSS3 syntax that allows one to find specific DOM elements using patterns. Low and behold I discovered that someone had developed a PHP class to do similar things named “simplehtmldom” (http://sourceforge.net/projects/simplehtmldom/). I downloaded simplehtmldom from sourceforge, read the documentation, and created code that would find my elements and return the URLs I needed.

// simplehtmldom commands
// $page = str_get_html($str);
// $page->find();
// $element->onclick;
$page= str_get_html($str); // get HTML from string returned from cURL

foreach($page->find('span[onclick^=displayPopup]') as $element) {  // from the page find DOM elements that have attributes of "onclick" with values starting with "displayPopup"
	$value = $element->onclick; // get the value of the attribute
	$string = str_replace("'", "|", $value); // for easier formatting of the value I substitute single quote for pipe
	if(preg_match('/|(.*)|/', $string, $matches) === 1){
	    $links[] = $matches[1]; // add links to array
	} 	
}

Now I have the actual URLs from which I want to get a copy of the data in an array. I need to loop through the $links array and use cURL once again to get the data. While I’m looping through the array I need to check to see if the URL is pointing to an HTML file or a PDF file (my only two options in this case). If it is an HTML file, I use the get_data() function to grab the data and use PHP file commands to write/create a file in a local directory to store the data. If it’s a PDF, I need to use different cURL commands to grab the data and create a PDF file locally.

foreach($links as $k=>$v) {
	$k = $start + $k;		
	if(substr($v, -3) != "pdf") { // if not PDF then create an HTML file in our web directory
		$str = get_data($v);
		$page= str_get_html($str); // we can use ->plaintext; if we don't want HTML tags
		$fh = fopen('./data/'.$k.'_record.html', 'w') or die("Can't create file!"); // create or overwrite
		fwrite($fh, $page);
		fclose($fh);
	} else {  // otherwise get the PDF data and write it to a local PDF file in our web directory
		$filename = substr(strrchr($v, "/"), 1);
		$pdf = './data/'.$k.'_'.$filename;
		$ch = curl_init($v);
		curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
		$data = curl_exec($ch);
		curl_close($ch);
		file_put_contents($pdf, $data); // create or overwrite
	}
}

That’s it for the scraping engine!

Now we need to create a way pass a start and end value (increments of 50 and maxes out at 4000) to the PHP scraping engine. I know there are many ways to tackle this and I specifically considered executing the code from a terminal, in a CRON job, or from a browser. I again went with my strengths and chose to use an AJAX call via jQuery. I created another file and included the most recent jQuery engine. I then created a recursive jQuery function that would make an AJAX POST call to the PHP engine, pause for 5 seconds, and then do it again. The function accepts four parameters: url, start, increment, and end.

	
	// jQuery function to make recurring AJAX call to PHP script
	// it waits until the call is complete 
	// then waits another 5 seconds and calls itself
	function make_call($url, $start, $increment, $theEnd){
		var $newStart = $start + $increment // increment 

		$('#log_complete').append('<li>Scraping '+$start+' through '+$newStart+' records.</li>'); 
		// display where we are at in the browser

		var feedback = $.ajax({
	        type: "POST",
	        url: "webScraperEngine.php", // PHP page
	        data: { url: $url, start: $start, increment: $increment, theEnd: $theEnd }, // send start number
	        async: false
	    }).complete(function(){
	    	   if($newStart > $theEnd) { 
	    	   		$('div.feedback-box-complete').html('No longer running AJAX Calls'); // for display purposes
	    	   		return false; 
	    	   	} // exit after we've reached the end (total number from original search)
	        setTimeout(function(){ make_call($url, $newStart, $increment, $theEnd); }, 5000); // recursive call after waiting 5 seconds
	    }).responseText;

    		$('div.feedback-box-complete').html('Running AJAX Calls'); // for display purposes
	}

Put this all together and we have a basic web scraper that does a satisfactory job of iterating through search results and grabbing copies of HTML and PDF files and storing them locally. I was excited to get it finished using my familiar PHP and jQuery languages and it was a nice exercise to think this problem through logically. Again, I’m SURE there are better, more efficient ways of doing this… but I’m happy and my friend is happy.

Fun times.