There’s an interesting thread going on right now on the Linux netdev mailing list, speculating about the network accelerator technology that Intel’s been talking about recently. No one’s quite sure what Intel is planning on adding, but for the past several years “network accelerator” has usually meant TCP offload engines (ToE), and Linux’s core networking guys are almost famously anti-ToE. Even though no one really knows what Intel’s up to, there’s a feeling that it’s not just ToE this time.

Several people have pointed out other technologies that can make a huge difference without requiring the sorts of compromises that ToE needs to work. For instance, this post by Lennert Buytenhek suggests that PCI and memory system latency is a big problem, but fixing it can have huge payoffs:

The reason a 1.4GHz IXP2800 processes 15Mpps while a high-end PC hardly does 1Mpps is exactly because the PC spends all of its cycles stalling on memory and PCI reads (i.e. ’latency’), and the IXP2800 has various ways of mitigating this cost that the PC doesn’t have. First of all, the IXP has 16 cores which are 8-way ‘hyperthreaded’ each (128 threads total.)

I haven’t paid much attention to Intel’s IXP network processor family in the past, and that may be a mistake–from the description here, the IXP2800 sounds like a cross between Tera’s multithreaded CPU and IBM’s new Cell processor. Tera’s CPU, which was designed to support tons of threads, automatically switches between threads whenever one thread blocked due to I/O or memory access. The goal with Tera was to be able to remain efficient while the gap between CPU and memory speeds continued to grow. The IXP2800 isn’t as ambitious as the Tera, but the fundamental concept looks similar–support lots of threads in hardware, and switch when latency gets in the way. The IXP2800’s threaded CPUs aren’t full-blown processors, though–like the Cell, the IXP2800 contains one main CPU and a cluster of smaller domain-specific processors that are specialized for one specific task.

It’s unlikely that Intel will roll something like this into their Xeon CPUs anytime soon, though. It’s certainly not a quick fix–it’d require major changes in any OS that wanted to make use of it, and would probably take 3-6 years before it was really fully utilized.

Massively-multithreaded CPUs aren’t the only approach that has paid off for dedicated network processors, though. Some of FreeScale and Broadcom’s chips know how to pre-populate the CPU’s cache with headers from recently-received packets. This drastically cuts latency, but it seems to require that the CPU and network interface be very tightly coupled. Reducing the overhead needed to talk to the NIC can help, too–apparently some of Intel’s 865 and 875 motherboards use a version of their GigE chip that is connected directly to the north bridge, bypassing the PCI bus entirely, and some benchmarks show substantial improvements.

Reading the thread suggests that most of the effort going into Linux network optimization in the next few years will be happening on the receive end of things. Over the past several years, most higher-end NICs have added limited support for checksum generation and TCP segmentation offloading (TSO), where the CPU can hand the NIC a block of data and a TCP header template, and then have the NIC produce a stream of TCP packets without requiring the CPU to touch the data at all. Relatively little has happened on the receive side, but this seems to be changing. For example, Neterion’s newest card can separate headers from data, and is nearly able to re-assemble TCP streams on its own, sort of the inverse of transmit-time TSO. It’s not clear how many streams the card can handle at a time, though–even my little web server at home is currently maintaining 384 simultaneous TCP connections, and a busy system could easily have tens or hundreds of thousands of open streams. Odds are, throwing 100,000 steams at the card would run it out of RAM and completely negate any benefit that receive offloading would have. Unless it’s bright enough to be able to handle the 1,000 or so fastest streams and then let the main CPU handle the 99,000 that are dribbling data at 28k modem speeds.

This is a fascinating topic, and I can’t wait to see how this will turn out.