Friday, July 12, 2013

PHP Simple HTML DOM Parser

I used to be a staff member on playstationtrophies.org, helping out with all kinds of things there.  One thing I did during my time there was create a new BBCode Guide Template, using a Simple HTML DOM Parser for PHP.

You can find it here:

http://www.vyrastas.com/guide_template.htm

BBCode

First, a little background.  On playstationtrophies.org you post trophy guides in the forums.  To format and include images, etc, you use BBCode - it's like HTML in that you surround text with tags to provide functionality.  It's a standard for some forums, like VBulletin-based ones.

Since they require guides to be posted in a specific format, the user has to take data from a game's trophy page (like this) and turn that into the proper format on the forums (like this), as shown below:


This usually takes some work since you can't simply copy and paste from the trophy page to the forums - the formatting is different and depending on your browser it will copy differently.

Templates

Enter the template.  What it does is generate the exact BBCode you need to have the forum guide in the proper format, based on the trophy list for the game you choose.  All the user does is fill in their descriptions, etc.

Now this wasn't entirely my idea, a different user had posted one first, but it was taken down... and since we had become quite dependent on it, I decided to make a new version since I was already versed in PHP and HTML.

Simple HTML DOM Parser

The solution is to use this really easy PHP library, called Simple HTML DOM Parser.  Essentially it takes the HTML structure of a page and lets you traverse it in code, extracting info that you can then output in PHP in a different format.  Simply include the php file in your code and you're good to go.  There's a FAQ / Manual here that explains the most common functionality, and you can also read through the code itself to figure that out.

So to use it, in the PHP section of your HTML, include the PHP file/library:
   include('simple_html_dom.php');

Then pull the webpage you desire into a variable:
   $html = file_get_html('http://www.urlhere.com');

From there you can traverse the HTML DOM (Document Object Model) a variety of ways.  The most useful for me was something like:
   foreach($html->find('tr') as $x) { 
      //code here to handle every <tr> tag found
   }

Another way to move from a specific variable/tag (like the $x above) is to use the first_child() function:
   $e = $x->first_child();

This moves to the next child tag of the parent (<tr> in our case).  So this would most likely give you the first <td> tag (since <tr> is part of a table).  Using if statements for these within a for loop is the way I did it for the guide template.

You can check the $e properties (each <td> tag) by doing things like:
if ($e->class == 'linkT')
   // if the class of the <td> tag is "linkT", do something
   echo $e->plaintext;
     // print outs to HTML any text found within that <td> tag

So for the BBCode template, it's a mix of the data from the DOM of the page we're traversing and the tags you need for that specific data.

Inspecting Elements

Of course, to do this you need to be familiar with a page's structure in order to traverse it properly and get the data you want.  To do this I use the Inspect Element feature in Google Chrome.  I use Chrome for pretty much everything now; it's by far the best browser.  Anyway, just right-click on any webpage and select the "Inspect Element" option.

In the window that comes up, you can expand sections and mouse over them, and it will highlight that particular spot on the web page itself.  It's extremely handy when working with a tool like Simple HTML DOM Parser.  Here's an example for the trophy page I mentioned earlier:


I dig down to the content I need, look at the HTML tags available, and figure out how to identify them.  For playstationtrophies.org, it's fairly easy since most of the elements have a consistent and unique CSS class.

From there, just use echo to print everything out, along with the BBCode tags.  Here's an example, for the trophy tiles:
echo '[IMG]http://www.ps3trophies.org/' . 
   $e->first_child()->first_child()->src . '[/IMG]';

See?  Simples.

Resources

You can find some more suggestions on how to use PHP Simple HTML DOM Parser here:  http://davidwalsh.name/php-notifications

Also, another DOM Parser worth checking out, called Ganon, can be found here: https://code.google.com/p/ganon/.  This one handles more complex HTML and allows for better modification of the HTML.

No comments:

Post a Comment