HTML screen-scraping in Ruby
Posted by Scott Laird Sat, 01 Nov 2003 20:54:13 GMT
My little author reading project is written in Ruby, my current scripting-language-of-choice.
Here’s a example of what it takes to grab web pages and extract content from them:
client=HTTPAccess2::Client.new
url="http://www.elliottbaybook.com/..."
parser = HTMLTree::XMLParser.new(false,false)
parser.feed(client.getContent(url))
xml=parser.document
xml.elements.each('//p[@class="small"]') do |node|
event=BookEvent.new
event.store="Elliott Bay Book Company"
event.location="Elliott Bay Book Company"
event.time=node.to_s.gsub(/<\<[^>]+>/,'')
event.author=node.elements['./a[1]/b[1]'].text rescue nil
event.title=nil
event.note=node.elements['.'].to_s rescue ''
next unless event.time and event.author
if event.note =~ / at [0-9].* at ([^<>]*)/
event.location=$1
end
event.time=BookTime.new_from_string(event.time)
next unless event.time
books.push(event)
end
The interesting bit is probably xml=parser.document; that’s where Ruby’s HTML parser hands its parse tree off to Ruby’s XML engine, REXML. This lets me use REXML’s XPATH engine for searching through the HTML mess that most bookstores use on their web sites. In this case, all author reading events are inside of <p class=”small”> tags, so I iterate through all of the matching tags and try to create a BookEvent object from each. The author name comes from a <a><b> block inside of the <p> block, and the time and location are extracted via regular expressions.
If book stores had decent web pages, this’d be really easy, but as it is, I had to apply a few heuristics and flat out guess at times, and I’ll have to revisit the code every time they reformat their web sites. But, Ruby worked out really well this time.

I tried borrowing some of your code, but it’s very hard to tell where you’re getting all of these classes from (like HTMLtree::XMLParser). Could you please list the files/gems involved?
It looks like you need the ‘htmltools’ gem.
Good design!
Great work!
require ‘html/xmltree’
very helpfull