Here's a example of what it takes to grab web pages and extract content from them:
client=HTTPAccess2::Client.new url="http://www.elliottbaybook.com/..." parser = HTMLTree::XMLParser.new(false,false) parser.feed(client.getContent(url)) xml=parser.document xml.elements.each('//p[@class="small"]') do |node| event=BookEvent.new event.store="Elliott Bay Book Company" event.location="Elliott Bay Book Company" event.time=node.to_s.gsub(/<\<[^>]+>/,'') event.author=node.elements['./a/b'].text rescue nil event.title=nil event.note=node.elements['.'].to_s rescue '' next unless event.time and event.author if event.note =~ / at [0-9].* at ([^<>]*)/ event.location=$1 end event.time=BookTime.new_from_string(event.time) next unless event.time books.push(event) end
The interesting bit is probably xml=parser.document; that's where Ruby's HTML parser hands its parse tree off to Ruby's XML engine, REXML. This lets me use REXML's XPATH engine for searching through the HTML mess that most bookstores use on their web sites. In this case, all author reading events are inside of <p class="small"> tags, so I iterate through all of the matching tags and try to create a BookEvent object from each. The author name comes from a <a><b> block inside of the <p> block, and the time and location are extracted via regular expressions.
If book stores had decent web pages, this'd be really easy, but as it is, I had to apply a few heuristics and flat out guess at times, and I'll have to revisit the code every time they reformat their web sites. But, Ruby worked out really well this time.