My little author reading project is written in Ruby, my current scripting-language-of-choice.

Here's a example of what it takes to grab web pages and extract content from them:

client=HTTPAccess2::Client.new
url="http://www.elliottbaybook.com/..."
parser = HTMLTree::XMLParser.new(false,false)
parser.feed(client.getContent(url))
xml=parser.document

xml.elements.each('//p[@class="small"]') do |node|
  event=BookEvent.new
  event.store="Elliott Bay Book Company"
  event.location="Elliott Bay Book Company"
  event.time=node.to_s.gsub(/<\<[^>]+>/,'')
  event.author=node.elements['./a[1]/b[1]'].text rescue nil
  event.title=nil
  event.note=node.elements['.'].to_s rescue ''
  
  next unless event.time and event.author
  
  if event.note =~ / at [0-9].* at ([^<>]*)/
    event.location=$1
  end
  
  event.time=BookTime.new_from_string(event.time)
  next unless event.time

  books.push(event)
end

The interesting bit is probably xml=parser.document; that's where Ruby's HTML parser hands its parse tree off to Ruby's XML engine, REXML. This lets me use REXML's XPATH engine for searching through the HTML mess that most bookstores use on their web sites. In this case, all author reading events are inside of <p class="small"> tags, so I iterate through all of the matching tags and try to create a BookEvent object from each. The author name comes from a <a><b> block inside of the <p> block, and the time and location are extracted via regular expressions.

If book stores had decent web pages, this'd be really easy, but as it is, I had to apply a few heuristics and flat out guess at times, and I'll have to revisit the code every time they reformat their web sites. But, Ruby worked out really well this time.