December 29, 2008 9:40 am GMT
How to write a content scraper or feed aggregator for Wordpress in 10 minutes with PHP and cURL
by Gary IllyesA few days ago we published an article which sheds some light on the ethics of content scraping.
Content scraping — in short –means that a webmaster copies a 3rd party’s content in an automated way. If you copy the whole article, then then this is content theft, if you republish the excerpt of the article and link back to your source, then you created a service similar to Technorati.
So, what will you need for this script? Obviously, You will need PHP installed. CURL is also needed as it’s much more fast than — for example — fopen(). If you want to automate the publishing of the scraped content on your blog, then you have to have Wordpress installed and configured to receive the posts via e-mail. And that’s all.
In this post we will scrape the content using the 3rd party blogs’ RSS feeds. So, think about some feed URLs for your own script, we will use Technorati’s developer API cos it’s easier for us.
Let’s see the script, step by step. First we create two arrays: one will be empty by default, this will contain the fetched RSS items, and another one which will contain the RSS URLs. The second array can be also grabbed from a database, we hardcoded it in the source because it’s not likely we’ll ever modify the script again. Then write some functions which can parse the RSS feed as in it’s original format it’s pretty useless for us. So, here’s the script, if something is unclear, ask in the comments.
/*construct our arrays, first is empty, will be filled by the functions*/
$rss_items = array();
/*the below array contains the URLs we'll grab the RSS from*/
$RSS_URI = array("http://api.technorati.com/search?key=YOUR-API-KEY&query=health&format=rss&language=en",
"http://api.technorati.com/search?key=YOUR-API-KEY&query=medical&format=rss&language=en");
/*simple function to parse the XML file, aka the RSS and push the contents in the $rss_items array*/
function parseRSS($xml){
global $rss_items;
$cnt = count($xml->channel->item);
if($cnt > 1){
for($i=0; $i<=5; $i++){
$url = $xml->channel->item[$i]->link;
$title = $xml->channel->item[$i]->title;
$desc = $xml->channel->item[$i]->description;
$poster = parse_url($url, PHP_URL_HOST);
$cont = $desc."[...] \r\n";
$cont .= "More on $poster\r\n";
$titem = array( "url" => $url,
"title" => $title,
"desc"=> $cont,
"auth"=> $poster);
$rss_items[] = $titem;
}
}
}
/*grab the content of every URL we specified in the $RSS_URI array*/
foreach ($RSS_URI as $rss){
$ch = curl_init($rss);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
$data = curl_exec($ch);
curl_close($ch);
$doc = new SimpleXmlElement($data, LIBXML_NOCDATA);
if(isset($doc->channel)){
parseRSS($doc);
}
}
/*construct and send the mails to Wordpress's post-by-email address*/
foreach($rss_items as $rss_item){
$subject = $rss_item['title'];
$content = $rss_item['desc'];
$from = "YOUR-ADMIN-MAIL-ADDRESS";
$headers = "From: $from";
mail('WORDPRESS-MAIL-BY-POST-ADDRESS',$subject,$content,$headers);
}
/*contact our Wordpress engine to post the grabbed items*/
$ch = curl_init('http://YOUR-BLOG-URL/wp-mail.php');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
$posted = curl_exec($ch);
curl_close($ch);
Final step, save this file somewhere on your webserver. Be sure to hide well because if you put it in a publicly accessible place and someone wants to play, it accesses the script then your automated blog is filled with posts in no time.
Now let’s decode the script in human readable format:
- Will take every URL from the $RSS_URI array, grab the URLs’ content and transform it in a big multidimensional array
- For every item it grabbed and pushed in the $rss_items array, it will build and send an email — using PHP’s native mail() function — to the address you specified as post-by-email address
- Will open the http://YOUR-BLOG-URL/wp-mail.php address to make Wordpress to post the e-mails
A nice addition to this content scraper on Unix based systems can be to set up a cron job, which will automate the whole process. Making you to not touch your Wordpress for a very, very long time.
If you don’t like the cron idea, you will have to access the script using your browser, pointing it to the address where you’ve saved the file.
Have questions? Ask below.
















tony on Sat, 28th Mar 2009 9:34 am
You’re Guru man! is it possible to make same sample here of getting data from for an example some wordpress blogs, i would like to see what part of code you will change and how it works. Thanks super article and code!
Frank on Wed, 19th Aug 2009 9:31 am
Hi. This is just what I am after and I’ve tried many different things, but I’m still unable to get it working. Any help, or as Tony says, maybe a zipped example, would be fantastic!
The feed I am trying to make use of is here – http://bit.ly/3ko5JN
I presume the file should be saved as PHP too?
Cameron Kuc on Sun, 15th Nov 2009 5:10 pm
Lot of work for something Yahoo Pipes has been doing for years….