My little author reading project is written in Ruby, my current scripting-language-of-choice.

Here's a example of what it takes to grab web pages and extract content from them:
parser =,false)

xml.elements.each('//p[@class="small"]') do |node|"Elliott Bay Book Company"
  event.location="Elliott Bay Book Company"
  event.time=node.to_s.gsub(/<\<[^>]+>/,'')['./a[1]/b[1]'].text rescue nil
  event.note=node.elements['.'].to_s rescue ''
  next unless event.time and
  if event.note =~ / at [0-9].* at ([^<>]*)/
  next unless event.time


The interesting bit is probably xml=parser.document; that's where Ruby's HTML parser hands its parse tree off to Ruby's XML engine, REXML. This lets me use REXML's XPATH engine for searching through the HTML mess that most bookstores use on their web sites. In this case, all author reading events are inside of <p class="small"> tags, so I iterate through all of the matching tags and try to create a BookEvent object from each. The author name comes from a <a><b> block inside of the <p> block, and the time and location are extracted via regular expressions.

If book stores had decent web pages, this'd be really easy, but as it is, I had to apply a few heuristics and flat out guess at times, and I'll have to revisit the code every time they reformat their web sites. But, Ruby worked out really well this time.