Posted by brant : 2008-07-23 at 9:26 pm

Web developers have to deal with parsing html in some form or another. I happen to be a PHP developer and always find myself trying to scrape content off a website for various reasons. I have come across my fair share of situations that caused me to waste hours of time. So here is my list of tips and bits of code that can help:

1) Look for an RSS feed or API first. Many people just jump right into scraping content only to realize that the website offers an RSS feed or some type of API to achieve the same thing. Pay attention and do research before wasting hours of time. In most cases there is more than just one website that offers similar information, so check the competition too.

2) There might be an easy way out. There are websites that can actually create RSS feeds for websites that don’t have an RSS feed. Keep in mind they have refresh limitations even when opting to pay for their premium services. However, if parsing isn’t your thing, then try out feedyes.com and ponyfish.com . I am not saying those are the best, but it’s a nice place to start.

3) Use cURL. cURL is a library that makes things much easier when attempting to parse HTML. It can modify the header s to spoof google bot, fake referral address, return how long an action took to complete, bring back errors easily, and much more. Take a minute and install this because down the road, you’ll wish you did.

4) Strip unnecessary whitespace before scraping. This may not make sense at first listen but it should eventually. HTML will work whether it’s nice and neat or whether it is on one big line. Let’s say I am trying to parse out all data from a particular div or table. If there are carriage returns there, then it can screw up simpler regular expressions. Not to mention if I plan on entering this into a database, I’d be stripping out the new lines, carriage returns, and tabs anyway. Here Is a very basic, yet useful function to strip all unnecessary whitespace:

function trimall($str, $charlist = "\t\n\r")	
{
  return str_replace(str_split($charlist), '', $str);
}

5) Pattern matching made easy. Anyone planning to scrape html is most likely going to be parsing tables or divs. Once all the unnecessary whitespace is removed then this function will solve most of the basic needs. The one issue with this regular expression is that it cannot handle nested tags properly. (ie. A div inside another div) If you could provide a regex that does, then feel free to post it.

  preg_match_all('%+(.*?)+%', $web_page['FILE'], $match);

6) Clean up that text for RSS Feeds. Finally, after all that hard work I often realize that the text I entered is not all in UTF-8. That is a big uttoh and will royally screw up RSS feeds. I’ve looked for hours on how to make sure the text wouldn’t break the rules for displaying an RSS feed. Finally, I came across one that will come in handy for developers sooner or later. I would explain the character classes and stuff, but truthfully I don’t even understand them all myself. Dontcha love honesty?

function CleanRSS($data){
$data = preg_replace('/[^\P{C}\t\r\n]/u', '', $data);	
return $data;
}

I fully expect some 10 year PHP vets to look at what this list of tips and shake their heads in disgust. I have only been doing this for 2 years and am far from perfect. I would love to hear and comments, corrections, or criticism you might have for me. Hopefully I have helped a few people along the way as well.


Similar Blogs:
The Olympics Suck
So How Smart Is GoogleBot And Forms?
Digg Sucks



Add Comment