Problems with Rails and page caching

Posted by Scott Laird Tue, 09 Aug 2005 04:23:41 GMT

One of the biggest improvements in Typo 2.5 is page caching. By using Rails’s built-in page cache, we can get 100x the performance on many benchmarks without doing more then a few lines of work. This lets us serve high-volume weblogs (like weblog.rubyonrails.com) without requiring heroic measures like clustering.

Unfortunately, there are a number of hidden problems with Rail’s 0.13.1’s page cache implementation. We’ve had to work around a number of them in order to get Typo 2.5 out the door.

Basic page cache usage

Enabling Rails’s page cache is amazingly simple–just add caches_page :actionname to the top of your controller class and the :actionname action will spit out page cache files automatically. A couple small tweaks to Apache’s .htaccess file, and Apache will now serve cached files all on its own without involving Rails. If a client asks for http://blog.example.com/articles/2005/08/08/foo, Apache will first check for a articles/2005/08/08/foo.html file in Typo’s public directory. If that file exists, then it’s sent off to the client without touching Rails at all.

Sweeping

That part of caching is easy. It’s the other end that’s hard: sweeping the cache to remove stale cache entries. Rails provides a simple cache sweeper that can remove specified pages, but that’s not really good enough for us. With Typo, there are a number of events that end up touching a huge number of cached files. Adding a comment, for example, touches the cached article page, but it also changes the comment counter on the main index (if the article is still on the front page), the day, month, and year indexes, some number of category indexes, tag indexes, and potentially paginated versions of all of the above. The code to track these all down was trouble-prone and frequently missed one of the pages that needed to be changed; this led to stale caches. Even worse, some actions, like changing themes, need to invalidate all pages. Rails’s page cache doesn’t keep a list of cached pages, so there’s no clean way to sweep them all.

What we ended up doing was adding a page_caches table to the database and adding hooks to insert a new PageCache entry every time a page was cached. We also added a hook to remove entries from the page cache table whenever a page was manually swept, and then added a PageCache.sweep_all method to flush the entire page cache. For now, we’ve simply ripped out all of our old “smart” sweeping code and force a full sweep of the entire cache whenever anything substantial changes. Sooner or later we’ll start adding smart cache sweeping back in, but for now this works surprisingly well.

Query Parameters and Aliasing

Another shortcoming of Rails’s page cache implementation shows up when you start using query strings. Asking for http://blog.example.com/articles?page=2 ends up handing the ?page=2 parameter to the static .html cache page if it exists instead of calling Rails to ask for page 2. Even worse–if this cached page doesn’t exist, then Rails will generate it and store it for future access, even though it’s the second page of the index, not the first.

Finally, and worst of all, in Typo http://blog.example.com/articles is actually equivalent to http://blog.example.com/, because the article index view is the default index page. This means that the cached page for http://blog.example.com/articles?page=2 is actually /index.html, so anyone visiting page 2 of the article index screws up the front page of the blog. There’s no easy way around this with Rails 0.13.1; for now we’ve had to do work to keep ?page= from paginating anything. There’s one point that we could interrupt the page cache process from inside of Typo, but it doesn’t have any way to see the @request object or any of the query strings.

Long-term, we’re going to need to patch Rails to add a cachable property to @request that gets set to false when there’s a query string present, and also tweak Apache’s rewrite rules to skip static files if a query string is present. That assumes that Apache is even able to do that–every time I read the mod_rewrite documentation I end up with a headache. Since Typo officially supports lighttpd as well as Apache, we’ll need to get both of them to do the right thing, which is far from trivial.

Non 7-bit ASCII URLs and Caching

Finally, Rails screws up cached filenames when the URL has non-ASCII characters. So any URL with accented characters or any non-ASCII script is totally uncachable. At least with Apache and Webrick, Rails sees non-ASCII characters in the URL encoded using the usual %XX URL-encoding scheme. Unfortunately, both servers actually look for unencoded filenames. So Rails writes out the cache file for /foƶ as public/fo%C3%B6.html (assuming UTF-8 encoding), but Apache actually looks for public/fo<C3><B6>.html (where <C3> is a byte with the value of C3 in hex). This is actually not all that hard to fix–just add a URI::Util.decode to the right place inside of Rails–but it’s not clear what the security implications of this are.

Given all of these problems, I’ve been tempted to try using Rail’s action cache instead of the page cache–the action cache doesn’t let Apache serve the cached files directly, so Typo would have a brief chance to block the cache from handling specific files, and we could approach sweeping from the opposite direction. It’s not clear how big of a speedup the action cache would actually give us, though, compared to the massive win that we get from the page cache. We’d really like to keep using the page cache and fix all of its bugs to its usable by other Rails users.

Posted in , ,  | Tags , , , , ,  | 6 comments

Apache tuning for Rails and FastCGI

Posted by Scott Laird Wed, 20 Jul 2005 08:23:36 GMT

There’s a surprisingly small amount of documentation out there on tuning Apache for optimum Rails performance. Almost everyone mentions the first step (use FastCGI, not regular CGI), but that’s such a huge performance boost that it’s really obvious–waiting 2-3 seconds per hit for Rails to start up is an indicator that you’re doing something wrong.

Once you get past that, there’s not a lot of documentation. There are examples from place to place, but no one seems to discuss what they mean or why they should be used.

Ever since I switched to Typo, I’ve been seeing occasional HTTP 500 errors from Apache, suggesting that Apache was unable to talk to Typo. Looking in the logs shows that Apache was usually in the middle of restarting a FastCGI instance whenever the errors occurred. Digging through the mod_fastcgi shows that FastCGI can work in three different modes with Apache:

  1. Static. FastCGI servers are started when Apache is reloaded and remain running.
  2. Dynamic. FastCGI server processes are started whenever a FastCGI URL is hit. Excess processes are killed off when there’s no traffic.
  3. External. Apache and your FastCGI app talk via TCP sockets.

Dynamic mode is the default, but that’s not a good fit for Rails, because of its slow startup time. Switching to static mode really helps. To do that, I added this line to /etc/apache2/apache.conf on my Debian server:

FastCgiServer /var/web/typo/public/dispatch.fcgi -idle-timeout 120 \
       -initial-env RAILS_ENV=production -processes 2

Notice that I had to list the full path to Rails’s dispatch.fcgi file; on some systems you may be able to get away with only listing public/dispatch.fcgi, but that will almost certainly not work if you’re using virtual hosting.

By default, FastCGI assumes that your server will respond to queries within 30 seconds. I added the -idle-timeout 120 parameter just so I can deal with really slow responses better. Typo’s article admin page currently tries to list all 466 articles on one page, and that can take over 30 seconds to process.

The -processes parameter tells Apache how many FastCGI processes should run for this application. For 95% of users, 1 or 2 will be best. If you get a lot of traffic, then raising this to 3-4x the number of CPUs in your system might get you slightly better performance.

Finally, the -initial-env bit makes sure that Rails runs in production mode, talking to my production DB and not returning error backtraces to the user.

Posted in , ,  | Tags , , , , ,  | 40 comments

Behavior: CSS-like application of Javascript

Posted by Scott Laird Sat, 02 Jul 2005 23:00:25 GMT

Lamda the Ultimate just pointed out a cool new Javascript tool that should make AJAX-ifying web sites much cleaner and more maintainable. By using Behavior, you can strip all of the ugly little <script> and onclick="" tags out of your website and then specify all of the Javascript actions out of line using CSS selectors. Here’s the example from their website:

So, instead of this:

<li>
  <a onclick="this.parentNode.removeChild(this)" href="#">
    Click me to delete me
  </a>
</li>

You can use:

<ul id="example">
  <li>
    <a href="/someurl">Click me to delete me</a>
  </li>
</ul>

Then you feed something like this into Behavior:

var myrules = {
  '#example li' : function(el){
    el.onclick = function(){
      this.parentNode.removeChild(this);
    }
  }
};

Behaviour.register(myrules);

It’s a little too verbose to use in this example, but the basic mechanism is really cool. I’d love to see this extended one step further, with Behavior being able to parse a configuration more like this:

#example li:onclick {
  this.parentNode.removeChild(this);
}

You’d drop that into a file on your web server, say mylayout.jcss. Then you’d have a block like this at the top of the HTML file:

<script>Behavior.import("mylayout.jcss");</script>

I’m not exactly a Javascript wiz, but this looks vastly cleaner to me. I’d love to see something like this included into a future release of Rails.

Posted in  | Tags , , , ,  | no comments

Wacko finance.yahoo.com/mediaplex.com BMW Ad

Posted by Scott Laird Mon, 09 May 2005 20:07:49 GMT

A co-worker alerted me to a possible spyware problem on his Mac this morning–anytime he went to finance.yahoo.com, all of the ‘e’s in the body text of the page were replaced with ‘3’s linked to one of mediaplex.com’s ad servers. He was concerned that some nasty bit of spyware was infesting his Mac; today’s big Firefox security issues made him a bit nervous.

I couldn’t easily reproduce this on my Mac, so we went through his Firefox configs and couldn’t find anything out of the ordinary. Then we took a look at the source code for the page and saw this (sorry about the long lines; they’re that way in the original):

    <style xmlns="" type="text/css">
    @import url("http://us.js1.yimg.com/us.yimg.com/lib/hdr/ygma.css");
  </style></head><body><!-- <script>function yfi_scraper(){var url='http://us.ard.yahoo.com/SIG=1247e20ui/M=342581.6409333.7385346.1829737/D=fin/S=7037371:FAD/EXP=1115674357/A=2709560/R=0/SIG=12auoe33c/*http://adfarm.mediaplex.com/ad/ck/1433-28823-1039-3?mpt=1115667157014528',tg=document.getElementsByTagName('b');for(var i=0;i<tg.length;i++){var el=tg[i];if(el.className=='e0'||el.className=='e1'||el.className=='e2'||el.className=='e3'){var st=el.innerHTML;var ct=new Array();for(var j=0;j<st.length;j++){var ch=st.substring(j,j+1);if(ch.toLowerCase()=='e'){ch='<a href="'+url+'">3</a>';}ct[ct.length]=ch;el.innerHTML=ct.join('');}}}}if(document.all&&document.getElementById)setTimeout(yfi_scraper,4500);</script>--><script xmlns="" type="text/javascript">

The long Javascript line is what causes the problem–it replaces all of the ‘e’s with ‘3’s linked to an advertising site, but not until a timeout has expired. So, either Yahoo put this there on purpose, or someone attached to one of their ad providers has the ability to stick random Javascript into their pages.

At this point, we finally decided to click on one of the ‘3’ links and found an ad for the new BMW 3-series cars. Suddenly the whole thing makes sense–it’s a weird advertising campaign for BMW.

I’m kind of amazed by this–Yahoo is willing to let advertisers deface Yahoo’s websites? I find this really repugnant.

Update: A lot of people have already noticed this, including The Motley Fool, Adjab.com, and Two Four One. I suspect that Technorati will have a lot of other comments shortly.

Posted in  | Tags , ,  | 2 comments

Apache ToS marking?

Posted by Scott Laird Fri, 29 Apr 2005 00:15:59 GMT

I’ve spent a fair bit of effort getting QoS on my home DSL link working right, so VoIP isn’t overwhelmed by downloads or by people hitting my web server. At this point, I’m down to one remaining problem–when Google and friends fire up their web crawlers and find a new directory full of JPEGs, they can slow other HTTP traffic to a crawl.

If I could tell Apache (2.0) to change the IP ToS flags associated with HTTP web crawler traffic, then my network’s QoS config would do the right thing and send user-driven HTTP traffic ahead of web crawler traffic. Unfortunately, I don’t see any obvious way to do this. I’d rather filter based on HTTP User-Agent, not network block, and that means either using a really smart packet filter or having my web server do the work on its own. And, as far as I can see, Apache 2 doesn’t have a ToS-setting module available. Dean Gaudet wrote mod_iptos for Apache 1.3, but it hasn’t been ported to Apache 2, and I’m not very eager to do it myself.

Does anyone have any suggestions?

Posted in ,  | Tags , , ,  | 1 comment

Use Lynx, go to jail

Posted by Scott Laird Thu, 27 Jan 2005 15:25:09 GMT

This is almost too much to believe:

A Londonder made a tsnuami-relief donation using lynx – a text-based browser used by the blind, Unix-users and others – on Sun’s Solaris operating system. The site-operator decided that this “unusual” event in the system log indicated a hack-attempt, and the police broke down the donor’s door and arrested him.

(From Boing Boing)

Posted in  | Tags , ,  | no comments

TV Pilots via BitTorrent?

Posted by Scott Laird Tue, 21 Dec 2004 16:40:49 GMT

I’ve noticed over the past few weeks that most popular TV shows are available via BitTorrent. They’re generally edited to remove commercials. They’re frequently downscaled from HDTV sources, which means that their quality is fantastic. Modern video codecs can compress a 45 minute show into around 350 MB, which BitTorrent can download in the background in a matter of hours. Better yet, the very nature of BitTorrent means that the more users downloading a given file, the better the available bandwidth, because each downloaded copy is also available for upload; it’s not uncommon to see BitTorrent clients start sharing pieces of downloaded files within seconds of the download starting. This means that large files can be widely shared without a massive investment in download bandwidth.

According to the news this week, Hollywood has finally noticed BitTorrent and is moving to stop the rampant sharing of their property. I’m amazed that it’s taken them so long to get involved.

However, in the midst of their attack, I think they may have missed an opportunity. Hollywood and the TV networks produce a lot of content annually, and quite a bit of that is really just advertising. Hollywood trailers are really just ads for the full-length movie. Most TV show pilots are ads for the rest of the series. They’re teasers, intended to hook viewers and get them to pay (either in movie tickets or eyeball time) for the full product. In both cases, the media companies have produced copyrighted works that they really want people to watch, even if they aren’t directly compensated for the experience. The more widely they’re distributed, the more effective they are. This should lead directly to profits on the “real” product–the movie or TV series involved.

So, logically, media companies could come out ahead by producing sharable versions of their trailers and pilots, and then going ahead and sharing them themselves. With BitTorrent, they’d even have decent download statistics–they’d know how many people had downloaded things.

Of course, I don’t see this happening anytime soon. First, the last thing that the media companies want to do is to tell people “go install a BitTorrent client.” Actually, that’s the second-to-last thing–the last thing they want to do is to legitimize P2P filesharing. Even if they can get past those two issues, and get over the conceptual hurdles that follow them (”Download TV? That’s what pirates do, not media companies. We don’t do that.”), they’d still be left with a relatively small market–I doubt that there are more then a million people out there downloading and watching TV shows.

It’s an interesting opportunity for someone, though. First, the first company to do this will get an enormous PR boost. Second, there’s no real limit on how many different versions of a show they can distribute–they could do full HDTV, 640x480, and smaller sizes, all the way down to versions for mobile devices. The mobile aspect is another PR opportunity, and possibly even a VC opportunity.

So, while I don’t see this happening soon, and I certainly don’t see widespread adoption of this sort of thing by Hollywood, I’d be amazed if someone doesn’t take it up within the next couple years, even if it’s just for the PR burst.

Posted in ,  | Tags , , , ,  | 1 comment

Gmail invites, again

Posted by Scott Laird Mon, 20 Sep 2004 16:37:01 GMT

I have another half-dozen gmail invites. Leave a comment if you’re still looking for one.

Posted in  | Tags  | 21 comments

A very special hell

Posted by Scott Laird Wed, 08 Sep 2004 15:10:15 GMT

I was browsing meetup.com yesterday when I discovered something very, very scary:

slashdot.meetup.com

I have this mental image of a room full of your average knee-jerk slashdot posters, average age about 12, all yelling “first post” for hours on end. I doubt that the reality is that bad, but I’m not taking any chances.

Posted in  | Tags , ,  | no comments

Gmail invites

Posted by Scott Laird Thu, 26 Aug 2004 21:27:32 GMT

I have 4 spare gmail invites. Anyone want one?

Posted in  | Tags ,  | 12 comments

Apparently Google likes me too

Posted by Scott Laird Mon, 21 Jun 2004 23:51:42 GMT

A few months back, MSN’s search engine decided that this blog was a great source of information on Paris Hilton videos, and decided to feed me tons of traffic.

Today was Google’s turn. One of my SpaceShip One entries is showing up on the first page of search results for “spaceship one” on at least a couple of Google’s servers. Surprisingly, this is only generating a couple dozen hits per hour this afternoon.

Posted in ,  | Tags , , ,  | no comments

Wow, it looks like a bad day to be Six Apart

Posted by Scott Laird Fri, 14 May 2004 16:14:25 GMT

The new pricing for Movable Type (the software that runs this blog, along with a zillion others) is out. The previous release was free for non-commercial use. The new release is free–if you only have one author and fewer then 3 blogs. Any more then that, and you need to pay. The cheapest license is $99, marked down to $69 for the moment, and that only covers 3 authors. For commercial users, pricing starts at $299 (on sale now–only $199) for 5 authors, and goes up to $699 ($599) for 20 authors/15 blogs.

Needless to say, this is causing a bit of an uproar, and a lot of people are looking at switching from MT to other systems.

I guess I’m probably one of them. I’ve been half-heartedly looking for a different system for a while, but my needs are kind of unusual (as usual :-). Here’s a short list of what I’m looking for:

  • Simple, customizable blog engine, supports RSS and Atom, as well as at least one API supported by Ecto and NetNewsWire’s editor. Atom API support would be nice, but not all that critical, since I don’t have an Atom-aware editor yet.
  • Trackback and comment support. Preferably threaded. I actually like the concept behind Six Apart’s TypeKey, but that’s too much to ask, probably.
  • Support for non-blog pages. Take a look at http://svn.scottstuff.net for an example. Most of the pages are auto-generated, but I’d like to be able to share the template with my blog, and it’d be nice to be able to use the same comment engine.
  • Support for the Markdown markup language. I’ve found it to me vastly easier to work with then writing raw XHTML. That’s not to say that HTML is hard, but Markdown really lowers the amount of effort required.
  • Decent comment-spam tools. Admittedly, most comment spam is keyed to MT’s comment system, but that’ll change.
  • Tools for converting from MT. I don’t mind spending a bit of time on this, and I only have 190-ish posts here, but I’m not throwing them away, and it’d be nice to save the comments and trackbacks, too.
  • A photo gallery system that doesn’t suck. Since I haven’t found one that doesn’t suck yet, this is a difficult requirement. My goal is to be able to maintain one big master index in iView MediaPro on my Mac, and then sync the pictures and metadata onto my server from time to time, mostly using rsync and xml. Then, I want an automated script to pre-render thumbnails (on-demand thumbnails of 6 MP images are too slow for my poor server) and lay everything out. I’m currently using Album, but I’m not particularly fond of it. It just works better then anything else I’ve used. Systems that require manual, non-scriptable uploading of individual images need not apply.
  • A semi-integrated Wiki’d be nice, but I doubt I’d use it any time soon.
  • It needs to be scriptable and easy to enhance. Ideally, it’d be written in a language that I’m comfortable with; Ruby’d be best, and Perl’s okay. I can cope with Python and PHP, but I don’t really like either. A decent XML RPC/SOAP/REST interface would be nice, too.

If anyone has any suggestions, please leave comments. I suspect I’ll hear at least one recommendation for Drupal, but it’d be nice to hear other suggestions too. Can Drupal handle semi-static non-blog pages easily?

Posted in , ,  | Tags , , ,  | 1 comment

Sir Tim

Posted by Scott Laird Wed, 31 Dec 2003 02:04:41 GMT

My first real news item that I caught via Localfeeds:

Congratulations to Tim Berners-Lee, who on Thursday becomes a Knight Commander of the Order of the British Empire. (from bestkungfu)

I don’t know how I’d missed this earlier. It seems like real news to me.

Posted in  | no comments