10G and Direct Cache Access

As some of you might know, I currently work with a client doing 10G network stuff. 10G as in 10 gigabit/second Ethernet. That’s a lot of data. It’s actually so much data it’s hard to even generate network loads of this magnitude to be able to do good tests, as a typical server using SATA harddrives hardly fills a one gigabit pipe due to “slow” I/O: ordinary SATA drives don’t even reach 100MB/sec. You need RAID solutions or putting the entire thing in RAM first. So generating 10 gigabit network loads thus requires some extraordinary solutions.

Having a server that tries to “eat” a line speed 10G is a big challenge, and in fact we can’t do it as 1.25 GB/sec is just too much and yet we run a quad-core 3.00GHz Xeon thing here which is at least near the best “off-the-shelf” CPU/server you can get at the moment. Of course our software does a little bit more with the data than just receiving it as well.

Anyway, recently I’ve been experimenting with 10G cards from Myricom and when trying to maximize our performance with these beauties, I fell over the three-letter acronym DCA. Direct Cache Access. A terribly overused acronym consisting of often-used words make it hard to research and learn about! But here’s a great document describing some of the gory details:

Direct Cache Access for High Bandwidth Network I/O

Summary: it is an Intel technology for delivering data directly into the CPU’s cache, to reduce the bandwidth requirement to memory (note: it only decreases the bandwidth requirement at that moment, not the total requirement as it still needs to be read from memory into the cache, as noted in a comment below). Using this technique it should be possible to drastically reduce the time for getting the traffic. Support for this tech has been added to the Linux kernel as well since a while back.

It seems DCA is (only?) implemented in Intel’s 7300 chipset family which seems to only exist for Xeon 7300 and 7400. Too bad we don’t have one of these monsters so I haven’t been able to try this out for real yet…

Currently we can generate 10G network loads using two different approaches: one is uploading a specially crafted binary blob embedded with the FPGA image to a Xilinx-equipped board with a 10G MAC that then can do some fiddling with the packages (like increasing a counter) so that they aren’t all 100% identical. It makes a pretty good load test, even if the traffic isn’t at all shaped like the “real” traffic our product will receive. Our other approach has been even less good: upload a custom firmware to the network card and have that send the same Ethernet frame… This latter approach didn’t get better because it was a bit too complicated and badly documented on how to make a really good generator out of it. Even if I liked being able to upload custom code to my network card! ;-)

Allow me to also mention that the problems with generating 10G is with small packet sizes, like 100 bytes or so as the main problem in the hardwares seem to the number of packets, not the payload part. Thus it is easier to do full line speed with 9000 bytes packets (jumbo frames) than the tiny ones we are likely to get when this product is in use by customers in the wild.

Update: this article was written in 2008. Please note that many things may have changed since then.

6 Responses to “10G and Direct Cache Access”

  1. bgoglin Says:

    DCA doesn’t really reduce the memory bandwidth requirements since the data still has to be fetched by the cache from the main memory (the device doesn’t write into the cache, it just tells the cache that data should be fetched). The whole point of the approach is that this fetch is done in advance, so you don’t have to wait for it when the host starts processing the packet.

    DCA is at least also supported on Xeon 53400 and 5400 based hosts with Intel 5000 chipset. I actually think most/all modern Xeon hosts support I/OAT including DCA. So if you can get the DMA engine in lspci, you should get DCA as well.

    However it is not always enabled in the BIOS. For instance, on all models of Dell Poweredge 2950 that we have, all I/OAT features are disabled by default. You have to enter the BIOS to enable I/OAT DMA engine, but you can’t even enable DCA from there. So you end up modifying the chipset registers manually to enable DCA. It works then, but it is annoying.

  2. daniel Says:

    Many thanks for these additional details/corrections, I’ll see if I can take advantage of this…

  3. sakamura Says:

    Quick question, if possible. I have searched, but could not find any information on how to modify these chipset registers to enable DCA. I got the exact same problem, and am running Linux. Any starting point to do this task?

  4. Dawid Says:

    Hi,

    I’d also appreciate info about enabling DCA, ’cause just by:
    1. enabling I/OAT DMA Engine in DELL PowerEdge 2950 BIOS
    2. installing appropriate drivers for Intel 5000 Series Chipset Integrated Device – 1A38 (http://downloadcenter.intel.com/detail_desc.aspx?agr=Y&DwnldID=12193)

    it seems to work. But regarding to page 4 of Intel Processor-based Server Selection Guide (http://download.intel.com/products/processor/xeon/ssguide.pdf) the enhanced features like Direct Cache Access (DCA) apply only to Xeon 7400 Series.

    So, I’m little bit lost here. How can I be sure that the server is actually using I/OAT or DCA with PCI-E Intel PRO/1000 PT Dual Port Server Adapter (http://support.intel.com/support/network/adapter/1000ptdual)?
    Can you help me?

    Thanx in advance

  5. dlanthier72 Says:

    I have the same problem with a PowerEdge 1950 !

    How can we enable the DCA modifying the chipset registers manually ?

    Thanks !

    Dominique

    P.S. Please bgoglin respond …

  6. bgoglin Says:

    Please send me a email at my above login _at_ free.fr, I’ll send you some details and code.