As mentioned earlier, Intel has been making noises about improving network I/O on PC servers. Today, at IDF, they released a few details on their plans. Apparently the presentation itself was good, but their web documentation is slim on details. Lennert Buytenhek summarized the important details, centering on the threading improvements:
[…] Rather than providing multiple hardware contexts in a processor like Hyper-Threading (HT) Technology from Intel, a single hardware context contains the network stack with multiple software-controlled threads. When a packet thread triggers a memory event a scheduler within the network stack selects an alternate packet thread and loads the CPU execution pipeline. Porcessing continues in the shadow of a memory access. […] Stall conditions, triggered by requests to slow memory devices, are nearly eliminated.
This isn’t exactly like the IXP2800, but there are some distinct similarities. In essence, it looks like Intel wants to provide the OS with the ability to task-switch on cache misses. I’m not sure that current OSes can switch threads much faster then the CPU can handle a cache miss, so this will be interesting to follow. I suspect that you could switch fast enough if you don’t touch the TLB or most of the CPU mode bits.
Intel also points out that with 10 GbE, just mitigating the effect of cache misses by processing multiple packets in parallel isn’t enough–packets actually arrive faster then the computer can fetch data from main memory–with 64 byte packets at 10 Gbps, a new packet arrives every 51.2 ns, which isn’t even long enough for a single main-memory access. According to Intel, normal packet processing requires 5 main memory reads. Intel’s fix for this is to add the ability to DMA directly into the CPU’s cache, and then add support for offloading memory copies onto the memory controller itself.
While Intel is aiming at improving network performance, I suspect that other types of processing may see big improvements from the planned changes. Video compression, for instance, can have horrible cache performance; I saw a study a while back that showed P4s running a MPEG-2 codec were averaging one instruction every 5 cycles during part of the processing, or way under 10% of what the CPU is capable of. A video codec that could compress several macroblocks at once, switching between them on cache misses, could easily see big speed boosts.