Older blog entries for dreier (starting at number 48)

8 Dec 2010 »

Transition to Linode complete

I recently moved the VPS that hosts this blog from Slicehost to Linode. Both are very nice hosting providers that give you full control over a Xen virtual machine, including root access to the distribution of your choice and a slick web control panel, but right now at least, Linode gives you roughly twice the RAM as well as substantially more storage and bandwidth for the same price as Slicehost.

The main point of this post is really just to include my Linode referral link — if you’re going to sign up for Linode anyway, why not use my link and save me a few bucks on hosting?

Syndicated 2010-12-08 05:36:52 from Roland's Blog

7 Dec 2010 »

Two notes on IBoE

I want to mention two things about IBoE. (I’m using the term InfiniBand-over-Ethernet, or IBoE for short, for what the IBTA calls RoCE for reasons already discussed)

First, we merged IBoE support on mlx4 devices into the upstream kernel in 2.6.37-rc1, so IBoE will be in upstream kernel for the 2.6.37 release — one fewer reason to use OFED. (And by the way, we used the term IBoE in the kernel) The requisite libibverbs and libmlx4 patches are not merged yet, but I hope to get to that soon and release new versions of the userspace libraries with IBoE support.

Second, a while ago I promised to detail some of my specific critiques of the IBoE spec (more formally, “Annex A16: RDMA over Converged Ethernet (RoCE)” to the “InfiniBand Architecture Specification Volume 1 Release 1.2.1″; if you want to follow along at home, you can download a copy from the IBTA). So here are two places where I think it’s really obvious that the spec is a half-assed rush job, to the detriment of trying to create interoperable implementations. (Fortunately everyone will just copy what the Linux stack does if they don’t actually just reuse the code, but still it would have been nice if the people writing the standards had thought things through instead of letting us just make something up and hope it there are no corner cases that will bite us later)

The annex has this to say about address resolution in A16.5.1, “ADDRESS ASSIGNMENT AND RESOLUTION”:
The means for resolving a GID to a local port address (i.e. SMAC or DMAC) are outside the scope of this annex. It is assumed that standard Ethernet mechanisms, such as ARP or Neighbor Discovery are used to maintain an appropriate address cache for RoCE ports.

It’s easy to say that something is “outside the scope” but, uh, who else is going to specify how to turn an IB GID into an Ethernet address, if not the spec about how to run IB over Ethernet packets? And how could ARP conceivably be used, given that GIDs are 128-bit IPv6 addresses? If we’re supposed to use neighbor discovery, a little more guidance about how to coordinate the IPv6 stack and the IB stack might be helpful. In the current Linux code, we finesse all this by assuming that (unicast) GIDs are always local-scope IPv6 addresses with the Ethernet address encoded in them, so converting a GID to a MAC is trivial (cf rdma_get_ll_mac()).
This leads to the second glaring omission from the spec: nowhere are we told how to send multicast packets. The spec explicitly says that multicast should work in IBoE, but nowhere does it say how to map a multicast GID to the Ethernet address to use when sending to that MGID. In Linux we just used the standard mapping from multicast IPv6 addresses to multicast Ethernet addresses, but this is a completely arbitrary choice not supported by the spec at all.

You may hear people defending these omissions from the IBoE spec by saying that these things should be specified elsewhere or are out of scope for the IBTA. This is nonsense: who else is going to specify these things? In my opinion, what happened is simply that (for non-technical reasons) some members of the IBTA wanted to get a spec out very quickly, and this led to a process that was too short to produce a complete spec.

Syndicated 2010-12-07 06:33:48 from Roland's Blog

3 Jun 2010 »

Was it something I said?

I saw that OpenBSD 4.7 was released a couple of weeks ago. I tried to help, I really did.

I used to have a fanless 600MHz VIA system with a cheapie Airlink 101 Wi-Fi card that I used as a home wireless router. I ran OpenBSD on it for a few reasons — at the time I started, the OpenBSD wireless stack was ahead of Linux; their security obsession appealed to me; and not using Linux everywhere seemed like a fun thing to do. It all worked pretty well, except that the wireless interface sometimes got stuck while forwarding heavy traffic. For quite a while, I survived with hacks similar to this nutty crontab entry.

Eventually, though, I said to myself, “Self, you’re a kernel hacker. You should be able to fix this driver.” And indeed, after a couple of evenings of hacking, I figured out what was wrong and came up with a patch that improved things immensely for me. The problem was that the driver was not written with a system as slow as mine in mind, and it got confused if more than one interrupt happened before it got a chance to service the first interrupt — you can read the patch description for full details. Of course, being a good free software citizen, I sent my patch to the OpenBSD mailing lists so that it could be applied upstream.

Here’s where things went wrong. I never heard from the author of this driver — I got no reply when I reported the original bug, and no replies to any mail I sent about my patch. I did get several reports from other users who had the same problem and found that my patch fixed things for them as well, and finally another OpenBSD committer wrote, “Then if no one objects I’ll commit it tomorrow.“ Unfortunately, at this point the original driver author did seem to get interested — he sent private email to this committer (not copying the mailing list or me) objecting, and so we ended up with, “Objections were made. Apparently this patch only works for AP and does funky stuff to the hardware. So back to the drawing board on this one.“ As I said, all of my attempts to work directly with the driver author to find out what those objections were or how to improve the patch were ignored.

At this point I gave up on getting my patch upstream (and when I upgraded my wireless network to 802.11n, I chose a MIPS box running OpenWrt).

Syndicated 2010-06-03 21:42:27 from Roland's Blog

20 Apr 2010 »

Rocky roads

I saw that the InfiniBand Trade Association announced the “RDMA over Converged Ethernet (RoCE)” specification today. I’ve already discussed my thoughts on the underlying technology (although I have a bit more to say), so for now I just want to say that I really, truly hate the name they chose. There are at least two things that suck about the name:

Calling the technology “RDMA over” instead of “InfiniBand over” is overly vague and intentionally deceptive. We already have “RDMA over Ethernet” — except we’ve been calling it iWARP. Choosing “RoCE” is somewhat like talking about “Storage over Ethernet” instead of “Fibre Channel over Ethernet.” Sure, FCoE is storage over ethernet, but so is iSCSI. As for the intentionally deceptive part: I’ve been told that “InfiniBand” was left out of the name because the InfiniBand Trade Association felt that InfiniBand is viewed negatively in some of the markets they’re going after. What does that say about your marketing when you are running away from your own main trademark?
The term “Converged Ethernet” is also pretty meaningless. The actual technology has nothing to do with “converged” ethernet (whatever that is, exactly); the annex that was just release simply describes how to stick InfiniBand packets inside a MAC header and Ethernet FCS, so simply “Ethernet” would be more accurate. At least the “CE” part is an improvement over the previous try, “Converged Enhanced Ethernet” or “CEE”; not only does the technology have nothing to do with CEE either, “CEE” was an IBM-specific marketing term for what eventually became Data Center Bridging or “DCB.” (At Cisco we used to use the term “Data Center Ethernet” or “DCE”)

So both the “R” and the “CE” of “RoCE” aren’t very good choices. It would be a lot clearer and more intellectually honest if we could just call InfiniBand over Ethernet by its proper name: IBoE. And explaining the technology would be a bit simpler too, since the analogy with FCoE becomes a lot more explicit.

Syndicated 2010-04-19 23:16:42 from Roland's Blog

20 Nov 2009 »

First they laugh at you…

I found this article in “Network Computing” pretty interesting, although not exactly for the content. Just the framing of the whole article, with Microsoft is touting the fact that they’ve managed to achieve performance parity with Linux on some HPC benchmarks as an achievement (and putting up a graph that shows they are still at least a few percent behind), shows how dominant Linux is in HPC. Also, the article says:

The beta also reportedly includes optimizations for new processors and can deploy and manage up to 1,000 nodes.

So in other words Microsoft is stuck at the low end of the HPC market, only usable on small clusters.

Syndicated 2009-11-20 20:44:27 from Roland's Blog

12 Aug 2009 »

Poor choice of words

I just got a marketing email from drugstore.com with the subject “Diaper blowout: save on Pampers, Huggies, Seventh Generation and more.” If you’re not a parent and you don’t know why that’s unintentionally hilarious, just do a web search for “diaper blowout.” Suffice it to say that when placed next to the word “diaper,” the word “blowout” does not usually connote slashed prices.

Syndicated 2009-08-12 00:22:21 from Roland's Blog

10 Jun 2009 »

Lazyweb: best Verizon data card?

I currently have Verizon mobile data service with a Kyocera PC card, and it works well with recent distros using NetworkManager. However, my venerable laptop is being replaced with a Lenovo X200, which has no PC card slot, so I’ll have to replace my Verizon data card as well. According to the Verizon Wireless web site, my choices seem to be the NovaTel V740 for ExpressCard, or for USB the UTStarcom UM175 or the Novatel USB760.

My question for the lazyweb is: which data card/EV-DO modem should I get (assume that I’ll be running Linux 99.9% of the time when I use it)? The ExpressCard is substantially more expensive and less flexible (since I may want to use this card on a system without an ExpressCard slot someday), so I’d probably go with one of the USB cards if it’s left up to me. The USB760 doubles as a micro SD reader, which is not useful to me, and confounds things with a mass storage interface that probably just causes confusion, so my first choice would be the UM175 probably. However if someone with first-hand knowledge knows why that’s a bad decision, I’d love to hear about it in the comments.

(And I put a very high value in not having to boot into Windows periodically to update cell tower locations or anything like that, for what it’s worth)

Syndicated 2009-06-10 00:15:30 from Roland's Blog

26 Mar 2009 »

RDMA on Converged Ethernet

I recently read Andy Grover’s post about converged fabrics, and since I particupated in the OpenFabrics panel in Sonoma that he alluded to, I thought it might be worth sharing my (somewhat different) thoughts.

The question that Andy is dealing with is how to run RDMA on “Converged Ethernet.” I’ve already explained what RDMA is, so I won’t go into that here, but it’s probably worth talking about Ethernet, since I think the latest developments are not that familiar to many people. The IEEE has been developing a few standards they collectively refer to as “Data Center Bridging” (DCB) and that are also sometimes referred to as “Converged Enhanced Ethernet” (CEE). This refers to high speed Ethernet (currently 10 Gb/sec, with a clear path to 40 Gb/sec and 100 Gb/sec), plus new features. The main new features are:

Priority-Based Flow Control (802.1Qbb), sometimes called “per-priority pause”
Enhanced Transmission Selection (802.1Qaz)
Congestion Notification (802.1Qau)

The first two features let an Ethernet link be split into multiple “virtual links” that operate pretty independently — bandwidth can be reserved for a given virtual link so that it can’t be starved, and by having per-virtual-link flow control, we can make sure certain traffic classes don’t overrun their buffers and avoid dropping packets. Then congestion notification means that we can tell senders to slow down to avoid congestion spreading caused by that flow control.

The main use case that DCB was developed for was Fibre Channel over Ethernet (FCoE). FC requires a very reliable network — it simply doesn’t work if packets are dropped because of congestion — and so DCB provides the ability to segregate FCoE traffic onto a “no drop” virtual link. However, I think Andy misjudges the real motivation for FCoE; the TCP/IP overhead of iSCSI was not really an issue (and indeed there are many people running iSCSI with very high performance on 10 Gb/sec Ethernet).

The real motivation for FCoE is to give a way for users to continue using all the FC storage they already have, while not requiring every server that wants to talk to the storage to have both a NIC and an FC HBA. With a gateway that’s easy to build an scale, legacy FC storage can be connected to an FCoE fabric, and now servers with a “converged network adapter” that functions as both an Ethernet NIC and an FCoE HBA can talk to network and storage over one (Ethernet) wire.

Now, of course for servers that want to do RDMA, it makes sense that they want a triple-thread converged adapter that does Ethernet NIC, FCoE HBA, and RDMA. They way that people are running RDMA over Ethernet today is via iWARP, which runs an RDMA protocol layered on top of TCP. The idea that Andy and several other people in Sonoma are pushing is to do something analogous to FCoE instead, that is, take the InfiniBand transport layer and stick it into Ethernet somehow. I see a number of problems with this idea.

First, one of the big reasons given for wanting to use InfiniBand on Ethernet instead of iWARP is that it’s the fastest path forward. The argument is, “we just scribble down a spec, and everyone can ship it easily.” That ignores the fact that iWARP adapters are already shipping from multiple vendors (although, to be fair, none with support for the proposed IEEE DCB standards yet; but DCB support should soon be ubiquitous in all 10 gigE NICs, iWARP and non-iWARP alike). And the idea that an IBoE spec is going to be quick or easy to write flies in the face of the experience with FCoE; FCoE sounded dead simple in theory (just stick an Ethernet header on FC frames, what more could there be?) it turns out that the standards work has taken at least 3 years, and a final spec is still not done. I believe that IBoE would be more complicated to specify, and fewer resources are available for the job, so a realistic view is that a true standard is very far away.

Andy points at a TOE page to say why running TCP on an iWARP NIC sucks. But when I look at that page, pretty much all the issues are still there with running the IB transport on a NIC. Just to take the first few on that page (without quibbling about the fact that many of the issues are just wrong even about TCP offload):

Security updates: yup, still there for IB
Point-in-time solution: yup, same for IB
Different network behavior: a hundred times worse if you’re running IB instead of TCP
Performance: yup
Hardware-specific limits: yup

And so on…

Certainly, given infinite resources, one could design an RDMA protocol that was cleaner than iWARP and took advantage of all the spiffy DCB features. But worse is better and iWARP mostly works well right now; fixing the worst warts of iWARP has a much better chance of success than trying to shoehorn IB onto Ethernet and ending up with a whole bunch of unforseen problems to solve.

Syndicated 2009-03-26 04:11:08 from Roland's Blog

20 Feb 2009 »

Know anyone at Coverity?

The recent mention of scan.coverity.com at lwn.net reminded me that the Coverity results for the kernel (what they call “linux-2.6″) have become pretty useless lately. The number of “results” that their checker produce jumped by a factor of 10 a month or so ago, with all of the new results apparently warning about nonsensical things. For example, CID 8429 is a warning about a resource leak, where the code is:

      req = kzalloc(sizeof *req, GFP_KERNEL);
      if (!req)
              return -ENOMEM;

and the checker thinks that req can be leaked here if we hit the return statement.

The reason for this seems to be that the checker is run with all config options enabled (which is sensible to get maximum code coverage), and in particular it seems to be because the config variable CONFIG_PROFILE_ALL_BRANCHES is enabled, which leads to a complex C macro redefininition of if() that fatally confuses the scanner.

I’ve sent email to scan-admin about this but not gotten any reply (or had any effect on the scan). So I’m appealing to the lazyweb to find someone at Coverity who can fix this and make the scanner useful for the kernel again; having nine-tenths or more of the results be false positives makes it really hard to use the current scans. What needs to be done to fix this is simple to make sure CONFIG_PROFILE_ALL_BRANCHES is not set; in fact it may be a good idea to set CONFIG_TRACE_BRANCH_PROFILING to n as well, since enabling that option causes all if statements annotated with likely() or unlikely to be obfuscated by a complex macro, which will probably lead to a similar level of false positives.

Syndicated 2009-02-20 06:38:54 from Roland's Blog

1 Dec 2008 »

Signs you may not be dealing with a straight shooter

A play in two scenes.

DRAMATIS PERSONAE