Web Scraping with PHP

Last updated 4 years, 4 months ago | 1685 views 75     5

Tags:- PHP

PHP | Web Scraping

Web scraping is used to get specific data from another website when there is no API available for it.

There are many techniques for web scraping. You can read more about this on Wikipedia

In this post, we use document parsing which is a way to convert HTML into DOM (Document Object Model) in which we can traverse through.

The code below gets all the h1 tags from URL "https://www.desipearl.com/".

<?php

//Get the html from the url
$html = file_get_contents('https://www.desipearl.com/');

$dom_object = new DOMDocument();

// Disable libxml errors
libxml_use_internal_errors(TRUE); 

if(!empty($html)){

	$dom_object->loadHTML($html);
	
	// Clear libxml error buffer
	libxml_clear_errors(); 
	
	$xpath = new DOMXPath($dom_object);

	// Get all the h1's text
	$result = $xpath->query('//h1');

	if($result->length > 0){
		foreach($result as $row){
			echo $row->nodeValue . "<br/>";
		}
	}
}

?>

 

The output of the above code is:

Earn money from your website and social media pages easily with us!
What do we offer
Ad Formats
Reviews

In the above code, file_get_contents return the HTML string from the URL. Then new DOMDocument() convert the HTML string into actual Document Object Model.

After that libxml_use_internal_errors(TRUE) function disable libxml errors. Then if(!empty($html)) code check for an actual html. Then loadHTML($html) function load the html that was returned.

The libxml_clear_errors() function clears the error that might produce by style attributes embedded in elements, invalid attributes, and invalid elements that not a part of the HTML specification for the doctype.

Next $xpath a new instance of DOMXPath($dom_object) which allows us to do some queries with the DOM Document. Then the query $result = $xpath->query('//h1') select all the h1 tags . "//" before the h1 tag is used to make the location unspecific of the element. You can read more about query and DOMXPath here.

Finally, check the length and loop through the rows to echo out each of the h1 text. The nodeValue contains the text inside the h1 that was selected.

 


 

Another Example

The code below gets heading and content from section block home of the URL.

<?php
$content = array();

//Get the html from the url
$html = file_get_contents('https://www.desipearl.com/');

$dom_object = new DOMDocument();

// Disable libxml errors
libxml_use_internal_errors(TRUE); 

if(!empty($html)){

	$dom_object->loadHTML($html);
	
	// Clear libxml error buffer
	libxml_clear_errors(); 
	
	$xpath = new DOMXPath($dom_object);

	// Get all contents from section block home
	$result = $xpath->query('//div[@class = "section-block-home"]');

	if($result->length > 0){
		
		foreach($result as $row){
			
			// Get the heading of page
			$heading = $xpath->query('h1', $row)->item(0)->nodeValue;
			
			// Get block contents
			$lists = $xpath->query('div/ul/li[@class = "reveal-animate"]', $row);
			
			// Loop through the contents
			foreach($lists as $list)
			{
				// Store in an array
				$content[] = $list->nodeValue;
			}
			
			$data = array(
			'heading' => $heading,
			'desc' => $content
 			);
			
		}
	}
	
}

echo "<pre>";
print_r($data);
echo "</pre>";

?>

 

The output of the above code is:

Array
(
    [heading] => Earn money from your website and social media pages easily with us!
										
    [desc] => Array
        (
            [0] => 01Desipearl is an online media content platform that is renowned for its native ads,
                   specifically designed and developed for Indian websites.
												
            [1] => 02We deliver new content that is refreshed every couple of minutes in 8 different 
                   languages based on the search history and online behaviour of the readers.
												
            [2] => 03Our customized widgets give you higher CTRs and monetary returns.
												
        )

)