HTML screen-scraping in Ruby

Posted by Scott Laird Sat, 01 Nov 2003 20:54:13 GMT

My little author reading project is written in Ruby, my current scripting-language-of-choice.

Here’s a example of what it takes to grab web pages and extract content from them:

client=HTTPAccess2::Client.new
url="http://www.elliottbaybook.com/..."
parser = HTMLTree::XMLParser.new(false,false)
parser.feed(client.getContent(url))
xml=parser.document

xml.elements.each('//p[@class="small"]') do |node|
  event=BookEvent.new
  event.store="Elliott Bay Book Company"
  event.location="Elliott Bay Book Company"
  event.time=node.to_s.gsub(/<\<[^>]+>/,'')
  event.author=node.elements['./a[1]/b[1]'].text rescue nil
  event.title=nil
  event.note=node.elements['.'].to_s rescue ''

  next unless event.time and event.author

  if event.note =~ / at [0-9].* at ([^<>]*)/
    event.location=$1
  end

  event.time=BookTime.new_from_string(event.time)
  next unless event.time

  books.push(event)
end

The interesting bit is probably xml=parser.document; that’s where Ruby’s HTML parser hands its parse tree off to Ruby’s XML engine, REXML. This lets me use REXML’s XPATH engine for searching through the HTML mess that most bookstores use on their web sites. In this case, all author reading events are inside of <p class=”small”> tags, so I iterate through all of the matching tags and try to create a BookEvent object from each. The author name comes from a <a><b> block inside of the <p> block, and the time and location are extracted via regular expressions.

If book stores had decent web pages, this’d be really easy, but as it is, I had to apply a few heuristics and flat out guess at times, and I’ll have to revisit the code every time they reformat their web sites. But, Ruby worked out really well this time.

Posted in ,  | 6 comments

Comments

  1. Jordan said over 2 years later:

    I tried borrowing some of your code, but it’s very hard to tell where you’re getting all of these classes from (like HTMLtree::XMLParser). Could you please list the files/gems involved?

  2. Benj said over 2 years later:

    It looks like you need the ‘htmltools’ gem.

  3. Good design! said over 2 years later:

    Good design!

  4. Great work! said over 2 years later:

    Great work!

  5. Michael D Smith said over 3 years later:

    require ‘html/xmltree’

  6. http://www.dittmarkooperation.de said over 3 years later:

    very helpfull

Comments are disabled