Older blog entries for apenwarr (starting at number 587)

14 Jan 2011 (updated 10 Feb 2011 at 03:09 UTC) »

bufferbloat vs. wireless networks, and other stories

It seems my previous article on bufferbloat struck a chord with some people. Most people, quite rightly, didn't know what to do with it and thought it was super confusing. (It was. It was rambly and poorly edited, and reading the responses mostly reminded me how disorganized all my thoughts on the topic were. Oops. It turns out disappearing into a hole and studying something for a month straight can quickly make you lose perspective on what the average person does/doesn't know about TCP/IP and traffic shaping. Sorry about that.)

On the other hand, some people actually think there's useful information hidden in there, which there is. Other people thought I was just wrong, which I'm not; that article, confusing as it was, isn't *conjecture*, it's a disjointed disorganized overly-summarized report of my actual careful experiments and observations.

Now, vague inexact reports of my careful observations are just as useful to you as totally wrong conjectures, so I apologize for that. Perhaps I'll try again later.

Meanwhile, I do want to correct a couple of misconceptions in particular, just in case people *did* believe me, because there are some things I *wasn't* claiming because they aren't true and/or I don't know if they're true.

Wireless networks have problems, but they are not these problems

My article was specifically about a very simple network topology that looks like this:

computers -> wires -> router -> DSL/cable/etc -> internet.

I didn't talk about wireless anywhere in the article, somewhat on purpose, because wireless has its own problems and those new problems confound your ability to do straightforward experiments to see the original router problems I was talking about.

Now, it's safe to say that if your wireless speed is much much more than your router speed, and all your wireless traffic is internet traffic (ie. not talking to other wireless devices on your LAN), then everything in my article remains true. Any wireless problems are dwarfed by problems caused by your router's crappy queue, and can be solved by doing traffic shaping correctly.

TCP is smart, so it knows not to send data any faster than your router can handle. Since that's much slower than your local wireless network, the only queue that ever fills up is the one at the bottleneck: at your router or the ISP's router on the other end. All the boxes on your WLAN will have essentially empty queues, so it doesn't matter how their maximum queue length is configured.

However, let's imagine that this isn't the case on your network; nowadays you might have 20 megabit cable, or 50 megabit FIOS, or whatever, which means your Internet connection is actually faster than your wireless network. Or maybe your Internet connection *is* a long range shared wireless network, in which case your wireless speed is obviously not much much greater than your router's speed, as we had been assuming. Or maybe you're just connecting two computers together on your LAN, with no router involved at all. What happens then?

What happens then is, well, a whole different story than I had been talking about. And so of course the solution will have to be totally different.

Now, as it happens, I've been thinking about creating a new startup to build the ultimate wireless router - one with built in traffic shaping *that works out of the box*. So while I don't have as much concrete test data for wireless as I do for wired routers, I do have quite a bit of information that comes from studying the situation in theory and a bit of practice.

Here are some things you need to know:

First, various operating systems and drivers have oversized network buffers on their wireless card drivers (and on their wired ethernet drivers too, in some cases). Since your internet router is no longer the bottleneck, suddenly the *client machine's* buffers are what matter, and those buffers are oversized. So yeah, if you start a big transfer, your interactive performance will suck. Solution: reduce the buffer size. Or install traffic shaping on your client machine, but that's a bit silly since it's your own LAN, so you should just fix the buffer size.

If you're downloading from a really dumb local server that has oversized transmit queues, same problem. You should get the server admin to lower the queue size. I guess your admin might be mean and refuse to do that, in which case you can improve your personal interactive performance when communicating with that server by installing traffic *policing* (downstream) on your client, but that's gross and fiddly.

Incidentally, traffic shaping/policing on a wireless network is totally disgusting because the available bandwidth is *variable*. How do you set the rates to 99%/80% of the bandwidth when the bandwidth keeps changing? You can't, at least not without having a direct line to the network driver to find out the current bitrate. And anyway, it's all stupid, because the right thing to do is simply lower the driver's tx queue size. Do that.

Next, rumour has it that if one machine starts blasting data all over your wireless network at top speed, this will hurt performance of other machines on the same WLAN. I personally have never experienced this, so warning: I'm now talking theory, not experimental evidence. But if it were to happen, I claim that this can't *possibly* have anything to do with the queue size of any of your devices. Why? Because no matter how big its queue, the wireless device on any computer can't send faster than its maximum transmit rate. A shorter or longer queue affects latency *on the machine with the queue*, and *on any machine forwarding through that machine*, but to everyone else, it just looks like the same packets coming out at the same speed.

(If your problem is that one machine on your wireless network starts blasting tons of data and overwhelms your router, well, yeah. Your router queue is too big. But if it overwhelms your router, we're in the case we talked about last time, not the current case where your router is too fast to ever be overwhelmed.)

Now, just because the problem isn't caused by your queue length doesn't mean I'm claiming it doesn't exist. There are lots of reasons the problem might exist, *especially* on a wireless network.

First of all, 802.11 wireless networks retransmit packets when they get dropped. This sounds obvious - doesn't everybody just retransmit stuff? - but no, it's *very strange*. In fact, ethernet *doesn't* retransmit stuff. If it sends a packet and the packet is lost due to noise or because the receiver has no buffer space, then tough. The packet is lost. It's TCP's job to retransmit it.

802.11 wireless networks don't work the same way. The reason is this: TCP *depends* on information about how many packets are lost and when. TCP assumes it's on a wired network, which is generally reliable, so when a packet is dropped, it's almost always because the receiver (or actually, any router along the path) has no more buffers. If that's true, it means the router is overwhelmed, so TCP needs to slow down... which it does. And that's how the whole internet works! That's why you don't need to know the speed of *my* DSL modem in order to send data to *my* DSL modem at exactly the speed my DSL modem can tolerate; no more, and no less. It's very cool.

Anyway, wireless networks are way less reliable than wired ones, because wireless networks are much more susceptible to random noise. So packets get dropped all the time, and it's not because you're sending too fast, it's because you're unlucky. Thus, the 802.11 standard calls for devices to retransmit packets that were lost at the physical layer. They can still be dropped by the receiving device, but only *after* they've been acknowledged by 802.11, which means the transmitter knows you got them. So in 802.11, there are two kinds of packet loss: the kind you're supposed to fix (noise), and the kind you aren't (buffer full).

Generally it's the driver's job to do the 802.11 retransmits. This calls into question: how many times do you retry transmitting? How long between retransmits? What if the noise was caused by someone else's retransmits, and we get into lockstep, constantly ruining each other's packets? On the receiving end, how and when do we report which kind of receive failure is which?

Answer: it's complicated. But it's obviously easy to do wrong. I heard a rumour that some Linux drivers will retry the same packet up to 256 times in a row if it doesn't work; that seems rather desperate to me. I can also imagine bugs where a driver, finding the kernel's receive queue is full, reports the 802.11 packet as corrupted ("please retransmit") rather than as delivered-but-then-dropped. If it did that, it would utterly destroy TCP's behaviour and potentially clog the network really badly. Maybe this bug exists; I have no idea, I've only heard rumours. If it does, then it would surely be a problem. It would also have nothing to do with queue length, and could not be solved by any form of traffic shaping.

Which brings up another really important point: the whole Internet actually only works at all because of a truly amazing level of voluntary cooperation. Anybody anywhere could deliberately mis-tune their TCP layer and get faster downloads and uploads than anybody else, at the cost of messing up things for everybody else. Similarly, a WLAN driver that mis-reported dropped packets could get much higher thruput than anybody else on the WLAN, at the cost of screwing over everybody else.

TCP is kind of like the "prisoner's dilemma" - as long as we all cooperate, we all win, but if any of us defects, we might as well all defect, because the party is over. And as with experimental tests of the "iterated prisoner's dilemma" in real life (where you play the game with your partners over and over, rather than just once), we have mostly converged on not cheating. Because if we didn't, someone else would cheat too, and the party would be over.

Overly long transmit queues are not defection, by the way; they're just stupid. You don't get better performance at other people's expense by using overly long queues. You just give yourself worse performance at your own expense. Nice work, genius.

So that's not this. And to repeat myself, queue length is also not what would cause one wireless device to steal bandwidth from another wireless device. 802.11 retransmit problems? Yeah, that could cause problems all right.

Now, even if we assume our 802.11 drivers don't have bugs (other than queue length, which doesn't screw anybody but the perpetrator), which might even be the case, there are *still* some obscure reasons that one station's WLAN traffic might stomp on another's. Here are a couple that I know about; there may be more.

1) Ethernet-style collision detection (which is used, sort of, sometimes, in 802.11) is prone to timing problems; the more people you have on your network, and the faster the network, the more collisions you'll have. Background: in original 10MBit ethernet, wire lengths were limited so that the first byte of a packet A would reach all nodes by the time the last byte had been sent out. It was still possible that a second node would start transmitting its own packet B between the first node starting to send A and the second node start to receive A; however, the timing was arranged so that both A and B would know by the end that they had stomped on each other, so they could wait for a random delay and then retransmit.

Unfortunately, with 100MBit and faster networks, this collision detection got a lot worse, because the same packet was 1/10th the "length" (if you think of a packet travelling through a wire as a wave moving at the speed of light, the distance from start to end is the "length" of the packet), and that length was so short that it wasn't really reasonable to require nodes to be physically so close together. That's why hubs were sensible in the 10MBit days, but rapidly fell into disuse in the 100MBit days in favour of switches, which don't need collision detection because they just give every node its own wire.

With wireless, you have no choice: everybody is on the same "wire", ie. the airwaves. You can sort of help by using multiple channels, which 802.11 does, half-heartedly. (There are only 3 non-overlapping channels in 802.11's North American standard, at least, which means three packets at a time in an optimistically configured topology.) There are interesting mathematical improvements on this (eg. frequency hopping and ultrawideband transmission), but 802.11 doesn't use them; cell phone networks do, nowadays, because they have so darned many people using them at once that nothing else would suffice.

*Anyway*, with 802.11, you're back to old-fashioned collision detection, only now you have outside noise, uncontrolled distances, random signal reflection, *and* fast transmit rates. It's awful, and frankly, I'm amazed it works at all.

The collision detection algorithms have improved somewhat since the ethernet CSMACD (carrier-sense multiple access collision detection) days, but not much. And the various 802.11 standards - I haven't read them all yet - seem to define multiple different ways around the problem. One of them is actually, ha ha, a token-ring like scheme: the wireless access point assigns timeslots to each node, and the node is only allowed to transmit during its timeslot, thus avoiding the possibility for collisions (while also bringing back all the bad stuff about token ring that killed it in the first place). I'm a bit fuzzy on when this mode is enabled and when it's not; I think it's mostly not, so this is more academic than useful.

So anyway, collision detection is hard. And the more you transmit, the more collisions you'll have. The more different nodes you have transmitting, the more collisions you have. The more your driver's auto-retransmit algorithm is broken, the *vastly* more collisions you'll have. And so on. One busy station with a wrong retransmit algorithm could really ruin it for everyone. Of course, like I said, I have no evidence that such a thing exists; I've just heard rumours.

2) This one is even more insidious. It's called the "hidden node problem." It works like this: station A can talk to the access point, and station B can talk to the access point, but A and B are on opposite ends of the building, so they're too far away to see *each other*. So let's say they both decide to start blasting data at the access point at the same time at full speed.

No matter what collision detection method they try to use, they'll never know that every single one of their packets is colliding with the other guy! As far as the access point is concerned, it's seeing a nonsense combination of signals from A and B. As far as A and B are concerned, all is well,¹ except for some reason the access point isn't acknowledging any of their packets, so they have to retransmit.

If you're running a long-range wireless network, your access point probably has a really great well-positioned antenna, and the other stations probably don't. That means you probably have the hidden node problem in a *big* way. That means tons of horrible unexplained packet loss, which means tons of retransmits. And, rumour has it, those retransmits might be compounded by buggy retransmission algorithms. As the operator of such a network, your life, in short, sucks big time.

I don't have any experience running long-range multi-user wireless networks, although I have fantasized about doing so, in what now appears to be a rather obsessive-compulsive level of detail. And the hidden node problem is where my fantasies turn into nightmares.

I don't know the solution to this one. I do know three things that *might* be potential solutions, but they'd need to be thoroughly tested:

a) That 802.11 timeslot mechanism I mentioned above, if it actually was turned on and supported. (I read about it in the spec, and got the distinct impression that it was a great idea that wouldn't do me any good because I couldn't use it. But I now don't remember why. I think it's part of the 802.11 QoS spec, and even if routers support that, none of the standard workstation drivers do.)

b) Some projects (frottle, WiCCP) that actually do timeslicing at the user level. At least frottle has some user reports that say it dramatically improved results. However, it requires *everyone* on your network to run special software; if anybody isn't, they'll probably stomp on all the packets from everybody else.

c) An slightly insane specialized application of traffic policing at the access point. This requires that everybody talks only to the access point directly, never to other nodes on the WLAN, which is a pretty safe assumption since the hidden node problem means they can't see each other anyhow. Basically, if you do things exactly right, you should be able to slow down everyone's traffic so they all get 1/n of the total bandwidth, and you should be able to reduce traffic burstiness so those packets mostly statistically end up not colliding. If it worked, it would be a totally amazing hack, because you wouldn't have to fix the software on any of the stations, and you wouldn't have to use explicit timeslots; the timeslots would be implicit based on your traffic policing algorithm.

And no, the transmit queue length has no bearing on this problem. In fact, somewhat entertainingly, activating policing on the access point's WLAN link could actually force the transmit queues to be shorter (since the TCP window size would be shorter since policing is dropping so many packets), thus solving that problem too.

Assorted other clarifications about my last article

While we're here, note that:

the claims in my previous article don't depend on bandwidth; they are true at fast and slow speeds, you just use different constants when doing the math.
256 kbits (with excessive buffers) is a fun example because it makes all the problems worse; you can't half-ass your traffic shaping or it just won't work. That means experiments are easier on that link than on a *good* link, which is why I took advantage of it during my sublet back in September.
My claim of 250ms of latency under high load needs explanation: that 250ms is configurable and asymmetric. Basically, all the latency was *downstream*, because incoming traffic (policing) is much harder to control than outgoing traffic (shaping). And you can lower the latency by lowering your downstream bandwidth limit, or raise the latency by raising the downstream bandwidth limit; it's a tradeoff. I talked about 250ms and 80% downstream because that was the tradeoff that worked for *me* with 10 sessions and Skype on a 256k link.
I "dialed in" 250ms because that's about the limit of bearability for interactive ssh traffic. No other reason. I could have picked basically any number.

I doubt all that is very helpful. But that's the best I can do for now.

Footnote

¹ You can simulate the hidden node problem at home while doing the dishes. Go stand by the kitchen sink and turn on the tap. Now talk at a normal volume to someone across the room, and have them talk back to you at a normal volume. To them, you'll be audible; what they hear is about a 1:1 ratio of tap noise to voice, which is good enough to understand. But to you, they'll be inaudible; by the time their voice reaches you, it has decayed a lot (ie. by the square of the distance between you and them), but the water noise is full blast, totally overwhelming their voice.

Syndicated 2011-01-14 04:33:37 (Updated 2011-02-10 03:09:25) from apenwarr - Business is Programming

10 Jan 2011 »

bufferbloat (and a Net Neutrality rant in the footnotes)

Jim Gettys writes about "bufferbloat", the tendency of operating systems to keep increasing network buffer sizes so that now, with Internet links faster than ever, browsing often seems slower than ever.

His series of articles is a great introduction to the concept that I've been ranting to people about for years. Usually their eyes glaze over. What can you do? But his article is a bit more beginner-level, so maybe you'll have some idea what he's talking about.

This article, on the other hand, you probably won't.

Back in September, I was "lucky" enough to live for a month in someone else's apartment - and that someone was a cheapskate, so they subscribed to the cheapest possible Telus DSL plan, 256kbits/sec (symmetrical, for a change!).

On that connection, which is at least 4x as fast as single-channel ISDN or my old 56k modem (on which I used to multitask like crazy), it was totally impossible to do two things at a time. Loading a single web page totally killed interactive performance, to the point where we would have 8+ second ping times if someone had Google Reader open in a web browser tab. That same Google Reader that killed everything didn't even *work*, because it opened multiple sessions at once, and I think something in there somewhere must have had a 5-second timeout. Hilarity ensued.

Obviously, I concluded, our ISP has hard coded their DSL buffering based on the idea that we have a fast modem, even though we don't. At 6 megabits (about 23 times faster), that maximum 8-second delay would have been only 348 milliseconds... a mostly acceptable latency.

(Even if you have a fast DSL link, you still have the problem; most ISPs configure their DSL connections with symmetrical buffering, ie. the same sized buffer at each end. But even a 6 megabit downstream DSL link usually has a much lower upstream bandwidth - a megabit if you're lucky, or much less if you're not - so the same buffer size means a much higher delay on upload than on download. That's why on a normal link, *uploading* stuff kills performance much faster than *downloading* stuff. It's just misconfigured buffers, pure and simple.)

The bad news is that by the time you get into Jim's third article article in this series, he makes analyzing it all seem like rocket science. Bufferbloat is not rocket science at all. It's one single misconfigured number on virtually every DSL router (and now most cable modems too; technology apparently gets worse over time) everywhere. And unfortunately, it's often at both ends, yours and the ISP. And even when it's your end, you often can't reconfigure your DSL modem because it's locked down.

I heard a rumour once that ISP's do this on purpose: using a bigger buffer lets you squeeze a couple percentage points more bandwidth in stupid reviewers' benchmarks, and reviewers "of course" only ever download one thing at a time. You can evaluate bandwidth objectively, but lagginess during a download is mostly subjective. So perversely, an ISP that lowered their buffer size to make their connections suck less would lose out in third-party comparisons. Sigh. Having had my own products reviewed by morons in the past, I can't even fault them if that's the case.

On the other hand, the ISP here had no incentive to tune our cheapest-you-can-go 256kbit DSL link. In fact, they had an incentive to make it suck as much as possible, so we'd stop being such incredible cheapskates and actually buy one of their bigger, profitable plans.

So anyway, with all that suck out of the way, I have a message for the world:

You can fix it, and you don't need your ISP to not screw it up.

The bad news is that what you do need is a miracle :) Just kidding. Sort of. You don't technically need a miracle, but you need to figure out Linux's tc (traffic control) tool, which is about as close as you can get.

Let me jump ahead and say that, with a lot of fiddling back in September - let's just say I spent *most* of my September learning about this - I managed to get my 256 kbit connection to upload/download 10 files simultaneously while having a flawless Skype call, and my ping latency to Google never exceeded 250 ms. Were the downloads fast? Heck no, 256 kbits still sucks. But they were each about (my_bandwidth-skype)/10, which is to say, as good as it gets.

In case you don't already know, what you need is to insert another router (if you don't already have one) in between your home network and your DSL modem, and enable something called traffic shaping. On a typical DSL modem, you mostly only need to care about upstream traffic shaping, since that's the one that normally kills your performance (especially with BitTorrent); if you're picky, like me, you might also play with downstream shaping (called "policing" since it works totally differently), though the Linux code for that is not nearly as complete as the upstream stuff.

Here are some things I learned:

First of all, about 90% of traffic shaping setups that people try are absolute disasters. They make things much, much worse than nothing at all. It's a bit of a black art. It actually *shouldn't* be a black art, but there's something a bit funny about the whole department; everybody doing it seems to have taken user interface lessons from Cisco. The Linux 'tc' tools (kernel and userspace) were written by some kind of super genius in Russia, which is great as far as performance goes, but absolute death in terms of usability and documentation. I mean, the help message from the thing is literally a dump of its BNF grammar. If you're a Russian super genius, that's probably all you need. If you're me, you cry.

Given that, the terrible state of the documentation - all written by other people, Russian super geniuses don't need no stinking documentation - is actually kind of forgivable. Except in the cases where they obviously haven't even tried their own advice. Oh, sure, the sample scripts all *run* without errors, but they just plain don't do the things they claim to do. I'll get to that in a minute.

So, ok. Super complex UI, lots of fiddly magic numbers, and no documentation. You're sort of doomed. I know, I'm sorry. That's why this article is more of a rant than a howto. I won't keep you in suspense: you're only halfway through this article, but it doesn't get any better. Feel free to stop now before you get as depressed as I am.

BTW, all the "user friendly traffic shaping" scripts I could find make horrible mistakes in configuration and do lots of weird things. If anyone remembers the Nitix "Active Queue Management" feature and that it would do insane things sometimes... now I know why. Because we didn't spend enough time figuring out what the heck those scripts were doing, and realizing that they're doing it wrong.

Actual experimental observations

With upstream shaping, you can set it to about 99% of your upstream bandwidth and life is good. This makes sense, of course, because some traffic can jump straight to the head of the queue, so you don't really have latency concerns. (By the way, with an idealized perfect queue, there's no reason to ever limit the queue size. But such a queue doesn't exist.)

With downstream policing, you have to cut bandwidth usage to 80% or less, or the jitter will kill your latency "sometimes." (This might not matter as much on faster-than-256k links. At 256k, even three full-sized packets enqueued at once can cause 140ms of momentary delay. If you have ten downloads at a time, the *average* case is about 10 packets at once, or 458ms, and it gets worse from there.) So this 80% number isn't really fixed; it varies based on how many connections you have going at once.

If you're massively exceeding your allotted bandwidth (that is, always, if you have a super-slow 256k link like mine, or if you have a normal link and you're doing a lot of torrenting/etc) then enabling ECN on all your client machines, and using an active queue like RED on your router, can make a huge difference versus no ECN.

ECN's biggest effect is in improving interactive traffic when there are high-bandwidth connections around (interactive TCP responds *very very badly* to packet drops, since it doesn't have any other packets outstanding, so it doesn't get the instant re-ACKs caused by later packets... long story, but losing a couple of packets can cause multi-second delays on interactive traffic, and only millisecond delays on busy sessions). The good news is that this means you can enable ECN on *just* your own client that you use for interactive traffic, and your results are improved, even if you don't want to ECN-enable every single computer on your network. (This isn't cheating; you aren't hurting anybody else's performance by turning on ECN for only one machine. Like magic, ECN is just always an improvement, and always fair. Nowadays, chances are that the reasons ECN is off by default don't apply to you anymore; try turning it on and feel the love.)

If you do traffic shaping/policing properly, there is little need to actually *prioritize* traffic; just letting all the streams share fairly is usually enough. I was able to have a Skype call just fine on my 256k link with 5 downloads going at once with traffic speed limiting but no prioritization. With 10 downloads, 1/11th of my bandwidth was just too small to get good sound quality, so prioritization was the only way. But come on, if you have a 10 megabit link, do you *really* think that's your problem? Plus, you can't do prioritization of your downstream - that's something only the ISP can do, and they won't, for very good reasons.¹ So you're lucky it turns out not to matter.

So the rule of thumb? Get your basic queuing right *first*, and worry about prioritization *second*. A lot of people get those two mixed up. Prioritization is worthless when your queue is doing stupid things.

Next, beyond simple idiotic oversized DSL buffers, the *real* problem with BitTorrent is that it uses so many simultaneous TCP connections. If you have 50 of them at once going at full speed, everything else will get 1/50 of the bandwidth at most. As far as I know, nobody has ever found a really good trick for protecting against this. (You could group traffic by IP instead of by IP:port, but that would penalize anybody behind a NAT, so it's no good.)

Linux's traffic policing (ingress) layer can't do ECN. This is a terrible, terrible oversight. Someday I may try to fix it, if someone doesn't beat me to it.

Some people suggest shaping the *outgoing* queue on your *local* interface as an alternative to policing the *incoming* queue on the *remote* interface. Same thing, right? Packets still get dropped when they exceed the bandwidth, but now you can use all the fancy queue types, including RED, which allows you to use ECN for incoming traffic! Except this setup results in total crap; I wonder sometimes if the people writing these howtos even try their own advice. When receiving packets from a slow link, you've already paid the latency costs; if you now create a queue before sending the packets onward locally, you're just adding a bunch more latency. And the quality of (for example) your RED tagging varies proportionately with the added latency; if you keep the useless queue short, the policing quality drops because the algorithms aren't made for that. A much better solution would be a policing ingress queue that would "virtually" drop packets using ECN when possible.

I think you could make a "virtual RED queue" that doesn't actually delay or even store packets - RED isn't about delaying packets anyhow, it's about dropping them before the queue is full, and in fact it's better to ECN tag them instead of dropping them, so why not virtually drop them when your virtual queue is virtually full? It would be about as good as really dropping them from a real queue that's really full, but with no added latency and no actual packet loss. Someone could probably write a Ph.D. thesis about this, which I'm guessing is also why nobody has implemented it.

Note that regardless of the above, an *outgoing* queue used for traffic shaping does *not* increase latency. It always releases the packets (just slightly less than) as fast as they can go. So even though the queue usually has a bunch of stuff in it, there would have been a queue with stuff waiting in it anyhow; traffic shaping just makes it smarter. This fundamental difference is why using your outgoing queue algorithms for ingress traffic totally doesn't work.

TBF vs. SFQ

Whatever you do, don't try to combine TBF (token bucket filter) with SFQ (stochastic fair queuing), because the end result is dramatically bad. Unfortunately, some of the Linux howtos actually show you how to set this up as one of their examples! But it's fatally flawed.

A basic non-prioritizing TBF will simply drop the next packet if there's been too much data lately; statistically speaking, the busiest connections have the most packets, so they're the most likely ones to have a packet dropped. Interactive sessions, with just one packet in a blue moon, have a very low probability of dropping a packet. Pretty good, right?

That's TBF by itself. I mean, it's not perfect: even my 1/100 speed interactive session will get screwed for 1/100 of the packet drops, even if my busy connection is 99/100 packets. What we really want is for *all* the packet drops to punish the high-bandwidth guy. Still, it's pretty good considering how simple it is, and being wrong 1% of the time isn't so bad.

Now, let's look at SFQ, which is a very clever invention. SFQ works by sorting each session into one of, say, 100 "bins." Then it feeds the first packet in each bin to the output queue, looping through the bins in sequence. When a new packet arrives, it sorts it into a bin, and drops the packet only if *that* bin is full, but other packets, in other bins, don't get dropped. The trick is that packets are sorted into bins based on the hash of their (srcip,srcport,dstip,dstport) tuple, so on average, each connection gets its own bin. Assuming there are a lot more bins than connections, the end result is that you *never* sacrifice your 1/100 interactive session by accident; the only packets that ever drop are the 99/100 busy session, exactly the guy you want to slow down. Sweet! And it really works, too... if your network device never drops packets from its own queue, which is a safe assumption for physical devices.

Now let's hook SFQ to a TBF instead of a physical device. SFQ cleans things up, sorting the packets in the order A,B,A,B,A,B (where A is your interactive session, and B is your high-bandwidth session). It feeds them into TBF, which queues, say, 10 packets at a time. When the tokens run out, it drops a random one of the 10 packets, and lets in a new one from the SFQ. Which one will it choose randomly? Well, given that SFQ has inserted a whole bunch of fairness into the queue and the packets are no longer bursty... your interactive session has about a 50% chance of having its packet dropped by TBF.

...blink...

OH MY GOD IT'S SO TERRIBLE.

And I'm not just making this up. I tried it. I really did. I gathered statistics. I analyzed packet dumps. I learned what all the crazy magic numbers did (and tc has a *lot* of crazy magic numbers). And the conclusion was: it works just like I said just now. And it sure does *completely* tank your interactive performance. TBF by itself is actually okay (1% error), and SFQ is okay (~0% error) if that's your problem, but TBF+SFQ is never, never okay (50% error).

Two rights make a wrong? That's a new one.

Now, you could probably fix this by writing a different sort of TBF: one that accepts packets into its queue on a particular schedule, but assumes the guy feeding it packets will be the one doing the dropping; my alternate-universe-TBF would never drop packets itself. This parallels my "virtual RED queue" suggestion up above, but beware that SFQ+RED is probably just as bad as SFQ+TBF.

By the way, just putting things in the opposite order - TBF then SFQ - doesn't work either. The SFQ always drains instantly, and without packets in the queue, it can't add any fairness.

Oh, and also, if you think about connecting PRIO (basic prioritization, ie. VoIP vs. bulk) to TBF... you get exactly the same problem. TBF will end up penalizing high-priority packets even more than it would without PRIO! Which is why some people find that enabling packet prioritization makes their Internet performance worse.

No, HTB is not the answer!

And before you say something, yes, I've read all the notes that HTB (hierarchical token bucket) is the newfangled big thing that will solve all your problems.

No it won't.

HTB is probably better than CBQ, the thing it was designed to replace, but neither of them actually solves any of the problems I'm writing about here. As soon as people start talking about traffic shaping, they always want to start talking about restricting one stream to 1 megabit, while the other one gets 5 megabits, but then we allow short bursts of xyz for not longer than abc every pqr, and then we subdivide it like this...

It's all wankery. You're sitting at home, your Internet sucks. You do not need to do any of those things. Moreover, even if you had the problems those people are talking about, you would still have the latency problems *I'm* talking about, and HTB gets you 0% closer to solving them. HTB is good if you need it. But chances are, you absolutely don't.

The real problem with traffic shaping...

...is that it's just too fiddly. The truth of the matter is that it's really just a few little numbers - really simple numbers, integers, like the maximum number of packets in a queue - that make a huge difference. Coming up with this small set of numbers requires a surprising amount of math, and if you don't get your math exactly right, you don't just get substandard results - you get horrifically bad results. A mis-tuned RED queue, as I learned in my tests, is both very easy to produce, and very destructive to use. Combining TBF, which is great, with SFQ, which is great, results in disaster. It's really mathematically hard.

The funny thing is that, if you could just reliably get the math right, the actual code behind traffic shaping is delightfully simple. Any tiny little routing box with a super crappy CPU and almost no memory can do it, and the results would be great.

It's one of those really frustrating computer science problems: most programmers just like to get down to coding stuff. If you work a little harder, your solution gets a little better. The world of traffic shaping isn't like that; it's not a lot of work, but the outputs also aren't proportional to the inputs. It's a step function from crappy to awesome, with nothing in between to give you a hint that you're on the right track.

Footnote: net neutrality

¹ I said up above that ISP's won't do packet prioritization at their end of the link, for "good reason." That reason comes down to the whole famous "net neutrality" debate; should ISPs prioritize traffic or is every stream equal to every other stream? You might have heard of the net neutrality debate, but every time I hear it, I always think: wait, which side of this debate am I supposed to be on, anyway? I can never tell.

So far my answer is, it all sucks. In short, everyone can agree that their phone calls should take precedence over their bittorrent downloads. They can probably even agree that their phone calls should take precedence over their web browsing. And you know, most people can even agree that *other* people's phone calls and web browsing should take precedence over their own bittorrent downloads. Most of it is easy; most of the people spamming you about net neutrality only tell you about the easy bits.

But the thing nobody can agree on is... what *is* a phone call? And what *is* web browsing? And how can your ISP tell the difference so they can prioritize it the way we all agree is the right way?

Packet prioritization on your local network is no big deal. (It's useless, but easy; by the way, if your wireless router has a "wireless QoS" option, just turn it off. I guarantee you don't need it, and it definitely doesn't do what you think it does.)

Conversely, packet prioritization on the open internet is a giant security hole. If you can tag a skype session as "high priority", then what's to stop people from tagging their browser sessions as high priority? What's to stop some annoying hacker from blasting a million "high priority" packets at your IP address, completely filling your queue, which prevents any low priority packets from getting through *at all*, thus destroying your ability to access email or the web? Nothing can prevent this but your ISP. And so your ISP is stuck: do they honour those packet priority tags, or not? If they do, giant security hole. If they don't, skype calls are a bit less clear. On the open Internet, they pick the latter because the former is simply not acceptable to anybody.

Now, at least in Canada, all the ISPs are *also* in the VoIP business nowadays. They offer special boxes that do phone calls over their Internet link, in parallel with your cable/DSL modem that does Internet stuff. And they prioritize *that* traffic, right? Which is totally unfair to Skype, right? Totally unfair? Yes, absolutely!

But unfortunately, also justifiable for technical reasons. I'm not talking about business here; that's another question. But the idea is that you *obviously want* your phone connection to take precedence over the Internet connection; your ISP controls their network, the cable modem, and the phone box; and so the ISP can safely tag the packets - in a way your cable modem can be configured not to do - so that you and anybody else can't fake high-priority packets and cause trouble.

In short, barring insanely complicated inventions (ie. intserv, "integrated services", the failed opposite of diffserv) you or anyone can only securely implement prioritization on a network you completely control.

Unfortunately this gets the net neutrality police up in arms, because now BigCo's VoIP service has more reliable performance than Skype. And oh boy, does it ever; the sound quality is pretty much perfect, all the time. And yeah, that helps BigCo reinforce their monopoly position, and that totally sucks for Skype and competition and consumers. But make no mistake; if you pass a law requiring Net Neutrality, then you'll just make the VoIP service suck; you won't make Skype better. Or if you pass a law requiring ISPs to implement diffserv, then the whole Internet will collapse because a few idiots will abuse it by making their torrents look like phone calls.

And how about those special deals for Youtube and Netflix? Are those fair? Nope. But because those companies have special arrangements with the ISPs, the ISPs *can* prioritize those live video streams over random Internet crap without creating a giant gaping security hole. And let's face it, average consumers *do* want their Youtube and Netflix to be prioritized over their random web browsing, because glitchy video sucks, and glitchy web browsing is mostly unnoticeable. But is it fair to J. Random Bob's video sharing site? Of course not. Is there even any way, technically, for ISPs to offer the same service to J. Random Bob? No, and that's because the universe is fundamentally unfair.

With net neutrality, your choice is between suck and unfairness. Both options are terrible. And that's why neither side has won the debate and governments can't figure out what to do.

Oddly, it doesn't even matter that ISPs stand to make a lot of money by auctioning off priority to the highest bidder. We could deal with that; that would just be plain simple corruption. Take a bribe, or throw someone in jail, or whatever, but it would be over with by now.

Epilogue

There is some good news, though: I read somewhere (I can't find the link right now) that in Internet backbone-scale tests, simply overprovisioning bandwidth by around 1.6x (ie. 60% more bandwidth than you need) reduces jitter and latency just as much as packet prioritization. So if available bandwidth keeps going up at the rate it has been, unless some accursed person invents something even more bandwidth-wasteful than video (what could possibly do that?!) then this whole debate should end itself in a few years. When in doubt, use brute force.

Of course, even that does you no good if your router's queue is too big.

Syndicated 2011-01-10 07:39:17 from apenwarr - Business is Programming

9 Jan 2011 »

Notes on "The Yishan Tenets of Engineering Management"

Yishan Wong from Facebook posted about some of his most important ideas on Engineering Management.

I largely agree with his ideas, although there were a few things I obviously haven't thought enough about. I selfishly enjoy that kind of article, because it makes me feel simultaneously like I'm smart and like I'm learning something.

For example, on hiring managers:

...the primary source of experienced managers is larger companies. Small companies cannot produce many managers (there just aren't enough people to manage, or if there are, it becomes a large company), so the vast majority of managers come from large companies. There is a low probability that external managers will understand, preserve, and strengthen the internal culture. They will tend to bring their own outside culture, which is typically that of the larger company. This is the single greatest potential force that can slow down your internal operations, usually by introducing process and methodology significantly before it is optimal to do so (the reason is usually "preparing the organization to scale").

I found that quote really striking because I have literally heard and accepted (and to my regret, even repeated) the "preparing the organization to scale" claim many times. Anything justified by this claim really has always been a cause of failure rather than success, but I hadn't noticed the correlation until he pointed it out.

(Ironically, I've reached the point in my career where various people who want to hire me think I should parachute in to be a manager for their engineering groups. Ouch.)

Relatedly, he has some specific comments on "process" as used in growing companies:

At Facebook, there was a cultural resistance to process, to the point where the pattern around introducing process typically went "new process is reluctantly introduced only right before the point where things tip into chaos." Push this point as far as humanly possible, and then some, because what you receive in return is high organizational speed. If your organization has less process than another one of equivalent size, you will innovate and execute faster, taking ideas from conception to market more rapidly.

I have always hoped that the above is true, but never been able to prove it. I've certainly noticed a correlation between "high process" and "slow," but correlation doesn't imply causality, and slower isn't always worse... right? At least this provides one more vote for my preferred viewpoint, which is that if you have the right kind of culture, your company doesn't need as much "process" as you think you do.

Of course, I don't know much about the inner workings of the Facebook team; I don't even *use* Facebook. (Yeah, I know, I'm a luddite.) Based on my prejudices, running a company "that works like Facebook" - for example, in terms of revenue model and ethics - is not very high on my priority list. But oh well, I'll take wisdom where it comes.

Though in fairness, I now have to note that this creates (or maybe strengthens) a correlation between wisdom I like and companies I don't.

Finally, here's a quote that really got me thinking:

Managers may need to psychologically contend with more chaos than they are comfortable with, but there is a huge difference between chaos that makes one uncomfortable and chaos that actually threatens the business.

I have observed that different people have different levels of tolerance for chaos. But I'd never considered that even when someone doesn't *like* chaos, their company might be more productive by taking on the chaos anyhow.

That idea isn't obvious, but think of the consequences! A group of excellent people successfully doing what they're good at, in a well-oiled smooth-running company that doesn't make mistakes, can still be outcompeted by a bunch of goofballs who are literally screwing up a hundred times a day. All because the latter people can work with chaos - even when they don't like it.

(Read Yishan's whole article)

Syndicated 2011-01-08 11:15:31 from apenwarr - Business is Programming

4 Jan 2011 »

The first testimonial for djb redo

...other than from me, that is.

It gets better - by "really wants" I mean "SCons requires sophisticated Python code to create new build rules". If you look at the "scons" branch of my bsnes-qt repository, it includes a "qt4" plugin[1] that adds special build rules for Qt4-based projects. It's nearly a thousand lines long, and I still have to hack around it in the SConstruct file because this particular project doesn't conform to the usual Qt conventions.
By comparison the "build-scripts/util" file in my "redo" branch is about 60 lines long, half of which is the non-Qt-specific "compile" shell function. Granted, my solution would be a lot bigger if it was as portable and comprehensive as the SCons plugin, but at least this approach was feasible for me to create in an afternoon.

-- Tim Allen on the redo mailing list

Take that, scons.

(More about my redo project)

Syndicated 2011-01-04 11:09:07 from apenwarr - Business is Programming

2 Jan 2011 »

Happy New Year

If, for some reason, we make some giant mistakes and IBM wins, my personal feeling is that we are going to enter sort of a computer Dark Ages for about 20 years. Once IBM gains control of a market sector, they almost always stop innovation. They prevent innovation from happening.

-- Steve Jobs in 1985

Ouch.

Syndicated 2011-01-02 03:05:25 from apenwarr - Business is Programming

22 Dec 2010 »

git vs. professionalism

I followed a random link today that led me to the complete email archive of the discussion, following Theo de Raadt's expulsion from NetBSD in 1994, that led to the creation of OpenBSD.

Check out this quote:

I want _EVERYONE_ working on NetBSD to behave in a professional manner, in public _AND_ in private. Obviously, the latter can't be enforced, but if people complain about "unprofessional" behaviour, then it doesn't rightly matter _where_ it occurred.

-- a NetBSD core team member

The entire email chain - from both sides - is totally pointless politics. But it's educational, because it reminds us of a few things programmers have learned since then:

enforced professionalism sounds scarily draconian;
the best programmers are not the best company representatives, and expecting them to be is silly;
withholding technical stuff (CVS access) in exchange for political stuff (an apology) just results in angst and wasted time;
accepting/refusing code based on anything other than technical merits (and valid licensing, I suppose) is stupid;
when you give a "core team" special powers, they abuse those powers;
you can hold emotional power over a programmer by controlling whether their work is used or not used, and that power needs to be used responsibly.

When I say we've learned these things, maybe I'm just being optimistic. Probably there are plenty of teams with equally stupid politics going on. I hear the Java Community Process is similarly a waste of time.

What would happen if you tried to kick Linus Torvalds off the "Linux core team?" He would laugh at you. There is no core team. There is nothing to kick him off of. There is no CVS commit access to revoke. He can be as much of a bastard as he wants, and a bunch of self-appointed "project representatives" can't stop him from giving us all the awesome code he wants to.

Of course, Linus is synonymous with Linux; if anyone was going to be "kicked out", it would be anyone or everyone but him. So it's even more funny that he's the guy who wrote git, the tool that makes it impossible to block anybody. Now he can be even more of a bastard, because even if people hate him and quit, they'll still share their code with the world, including him.

As far as I know, all the BSDs still use CVS. Of course they do. Switching away from that would be like admitting their whole world view is wrong.

that is what the above is. if it has anger in it, then that is what is in it. it is an honest and open sharing of ideas and feelings. those feelings include anger, hurt, apprehension, loss. but i'm trying to find a way to forget about that past stuff, and you're not making it easy.

-- Theo de Raadt

Syndicated 2010-12-22 07:53:04 from apenwarr - Business is Programming

14 Dec 2010 »

The only build system that might someday replace make...

...is djb redo.

There are only two problems. In order of increasing difficulty:

1. you've never heard of it.

2. it doesn't exist.

Well, I just solved problem #1. Progress!

Problem #2 is a lot tougher. You see, the page linked above talks about a theoretically great build system, which perhaps D. J. Bernstein (djb), maker of qmail and djbdns, has implemented part of. But he doesn't link to an implementation, probably because his own code isn't up to his (rather exacting) standards yet, or he's gotten busy with other things and hasn't finished thinking through all the details.

I would like to tell you, in a concise way, exactly why redo is literally the most amazingly groundbreaking build system since the original invention of 'make', but I don't know how. So lacking conciseness, I'll just make a few grandiose claims:

it can do everything make can do;
with no baked-in assumptions about what you're building;
with much less code;
with much greater parallelism;
with finer-grained dependencies;
with much less syntax (actually nothing but /bin/sh);
while supporting recursion and full dependency information simultaneously (no Recursive Make Considered Harmful crap);
yet build scripts are highly modular and readable;
and you can checksum your targets instead of using timestamps;
and your build scripts run linearly instead of an orderless "ruleset";
with no implicit rules required;
and implementing C header autodependencies is completely sane;
and dependency checks involve no forking or parsing so it's crazy fast;
and you can incrementally convert parts of your project;
because it can play well with other build systems;
including jobserver compatibility with make -j;
oh, and you can write a plug-compatible toy
implementation in 100 lines of shell.

Sounds awesome? You bet it is. Well, except for the not existing part.

So I wrote it

Yes. djb's design is so simple that I thought I could do it in a couple of days. Actually I did have a working version in a couple of days. And normally, I'm a big fan of "release early, release often." But when it comes to build systems, well, you know, it's kind of a crowded market. I figured I'm only going to get one chance to convince you that this is the most fabulous thing ever.

That's why I've spent more than a month - days, evenings, and weekends - of my so-called vacation¹ working on this stupid thing. It has man pages. It has a ridiculously huge README+FAQ. It has an installer. It has parallelism. It has unit tests, oh god, the unit tests. It has locking. In fact, it has so much parallelism and locking that it uncovered a race condition in MacOS X's fcntl(). It has a GNU make compatible jobserver, so that 'redo -j5' and 'make -j5' can call each other recursively and "just work." It has a few extensions that djb didn't talk about, like checksum-based dependencies. For testing, I converted a pretty huge build system (one that builds the Linux kernel and a bunch of libraries for two different target platforms), with some of my targets depending on more than 12000 files. redo can check every single fine-grained dependency in the whole project in about 4 seconds.

Version 0.01 actually delivers on every one of those grandiose claims.

As I'm writing this, it's 1574 lines of python. That's a lot smaller than the source for GNU make.

And for fun, I also wrote a super-stripped-down "forwards compatible" implementation (without support for actual dependencies; it just assumes everything is always dirty) in 100 lines of shell. That's less than 2 kbytes, and it works with any input file that works with full-sized redo (modulo any as-yet-undiscovered bugs), because djb's design is that awesome.

And you know what? I almost certainly still, after all that effort, screwed up and it probably still won't work for you. Dammit. That's why I paranoidly called it version 0.01 instead of 1.00. But please give it a try:

apenwarr's redo on github

(The super oversized README is visible at that link. Or you can read the man pages.)

Disclaimer

Because redo is currently written in python (and it does a lot of fork-exec calls), it runs a little slower than I would like. Someday rewriting it in C would make it faster - probably at least 10x faster. But it's reasonably fast already, and the increased parallelism and faster dependency checking often makes up for the slowness.

The non-optimal performance is the only significant flaw I'm aware of at the moment... but I'm sure you'll try it out and tell me otherwise, right?

You should join the redo mailing list

You can find the mailing list on Google Groups at http://groups.google.com/group/redo-list.

Of course the mailing list archives are currently empty, because this is the first time I've announced it.

It might not look like it, but yes, you can subscribe without having a Google Account. Just send a message here: redo-list+subscribe@googlegroups.com.

Footnote

¹ I told the guard at the U.Shttp://apenwarr.ca/log/Canada border that I was "between jobs." He offered his sympathies. I said, "No, I mean, literally." He said, "Oh, normally people use that as a euphemism." Anyway, it's now painfully clear that I suck at vacations.

Syndicated 2010-12-14 11:34:07 from apenwarr - Business is Programming

13 Dec 2010 (updated 14 Dec 2010 at 04:06 UTC) »

Everything you never wanted to know about file locking

(Foreshadowing: I found a bug in MacOS X 10.6's fcntl(F_SETLK) locking that could cause corruption of sqlite databases. To see if your system has the bug, compile and run locky.c.)

I've never previously had the (mis)fortune of writing a program that relied on file locking. Well, I've used databases and gdbm and whatnot, and they "use" file locking, but they've always hidden it behind an API, so I've never had to actually lock my own files.

I heard file locking is terribly messy and arcane and varies wildly between operating systems, even between different Unix systems; even between different versions of the same system (like Linux). After some experimentation, I can confirm: it really is that arcane, and it really is that variable between systems, and it really is untrustworthy. I'm normally a pessimist when it comes to computers, but Unix file locking APIs have actually outdone even my own pessimism: they're worse than I ever imagined.

Other than simple lockfiles, which I won't go into (but which you might just want to use from now on, after reading this article :)), there are three Unix file locking APIs: flock(), fcntl(), and lockf().

flock() locking

flock() is the simplest sort of lock. According to my Linux man page, it dates back to BSD. It is *not* standardized by POSIX, which means that some Unix systems (probably SysV-related ones, I guess) don't support flock().

flock() locks an entire file at a time. It supports shared locks (LOCK_SH: multiple people can have the file locked for read at the same time) and exclusive locks (LOCK_EX: only one person can make an exclusive lock on the file; shared and exclusive locks are not allowed to coexist). If you learned about concurrency in textbooks, flock() locks are "reader/writer" locks. A shared lock is a reader lock, and an exclusive lock is a writer lock.

According to the Linux man page for flock(), flock() does not work over NFS.

Upgrading from a shared flock() lock to an exclusive one is racy. If you own a shared lock, then trying to upgrade it to an exclusive lock, behind the scenes, actually involves releasing your lock and acquiring a new one. Thus, you can't be guaranteed that someone else hasn't acquired the exclusive lock, written to the file, and unlocked it before your attempt at upgrading the lock returns. Moreover, if you try to upgrade from shared to exclusive in a non-blocking way (LOCK_NB), you might lose your shared lock entirely.

Supposedly, flock() locks persist across fork() and (unlike fcntl locks, see below) won't disappear if you close unrelated files. HOWEVER, you can't depend on this, because some systems - notably earlier versions of Linux - emulated flock() using fcntl(), which has totally different semantics. If you fork() or if you open the same file more than once, you should assume the results with flock() are undefined.

fcntl() locking

POSIX standardized the fcntl(F_SETLK) style of locks, so you should theoretically find them on just about any Unix system. The Linux man page claims that they've been around since 4.3BSD and SVr4.

fcntl() locks are much more powerful than flock() locks. Like flock(), there are both shared and exclusive locks. However, unlike flock(), each lock has an associated byte range associated with it: different byte ranges are completely independent. One process can have an exclusive lock on byte 7, and another process can have an exclusive lock on byte 8, and several processes can have a shared lock on byte 9, and that's all okay.

Note that calling them "byte ranges" is a bit meaningless; they're really just numbers. You can lock bytes past the end of the file, for example. You could lock bytes 9-12 and that might mean "records 9-12" if you want, even if records have variable length. The meaning of a fcntl() byte range is up to whoever defines the file format you're locking.

As with all the locks discussed in this article, these byterange locks are "advisory" - that is, you can read and write the file all day long even if someone other than you has an exclusive lock. You're just not supposed to do that. A properly written program will try to acquire the lock first, at which time it will be "advised" by the kernel whether everything is good or not.

The locks are advisory, which is why the byte ranges don't have to refer to actual bytes. The person acquiring the lock can interpret it however you want.

fcntl() locks are supposedly supported by NFSv3 and higher, but different kernels do different random things with it. Some kernels just lock the file locally, and don't notify the server. Some notify the server, but do it wrong. So you probably can't trust it.

According to various pages on the web that I've seen, fcntl() locks don't work on SMB (Windows file sharing) filesystems mounted on MacOS X. I don't know if this is still true in the latest versions of MacOS; I also don't know if it's true on Linux. Note that since flock() doesn't work on NFS, and fcntl() doesn't work on SMB fs, that there is no locking method that works reliably on all remote filesystems.

It doesn't seem to be explicitly stated anywhere, but it seems that fcntl() shared locks can be upgraded atomically to fcntl() exclusive locks. That is, if you have a shared lock and you try to upgrade it to an exclusive lock, you can do that without first releasing your shared lock. If you request a non-blocking upgrade and it fails, you still have your shared lock.

(Trivia: sqlite3 uses fcntl() locks, but it never uses *shared* fcntl() locks. Instead, it exclusively locks a random byte inside a byterange. This is apparently because some versions of Windows don't understand shared locks. As a bonus, it also doesn't have to care whether upgrading a lock from shared to exclusive is atomic or not. Update 2010/12/13: specifically, pre-NT versions of Windows had LockFile, but not LockFileEx.)

fcntl() locks have the handy feature of being able to tell you which pid owns a lock, using F_GETLK. That's pretty cool - and potentially useful for debugging - although it might be meaningless on NFS, where the pid could be on another computer. I don't know what would happen in that case.

fcntl() locks have two very strange behaviours. The first is merely an inconvenience: unlike nearly everything else about a process, fcntl() locks are not shared across fork(). That means if you open a file, lock some byte ranges, and fork(), the child process will still have the file, but it won't own any locks. The parent will still own all the locks. This is weird, but manageable, once you know about it. It also makes sense, in a perverse sort of way: this makes sure that no two processes have an exclusive lock on the same byterange of the same file. If you think about it, exclusively locking a byte range, then doing fork(), would mean that *two* processes have the same exclusive lock, so it's not all that exclusive any more, is it?

Maybe you don't care about these word games, but one advantage of this absolute exclusivity guarantee is that fcntl() locks can detect deadlocks. If process A has a lock on byte 5 and tries to lock byte 6, and process B has a lock on byte 6 and tries to lock byte 5, the kernel can give you EDEADLK, which is kind of cool. If it were possible for more than one process to own the same exclusive locks, the algorithm for this would be much harder, which is probably why flock() locks can't do it.

The second strange behaviour of fcntl() locks is this: the lock doesn't belong to a file descriptor, it belongs to a (pid,inode) pair. Worse, if you close *any* fd referring to that inode, it releases all your locks on that inode. For example, let's say you open /etc/passwd and lock some bytes, and then call getpwent() in libc, which usually opens /etc/passwd, reads some stuff, and closes it again. That process of opening /etc/passwd and closing it - which has nothing to do with your existing fd open on /etc/passwd - will cause you to lose your locks!

That behaviour is certifiably insane, and there's no possible justification for why it should work that way. But it's how it's always worked, and POSIX standardized it, and now you're stuck with it.

An even worse example: let's say you have two sqlite databases, db1 and db2. But let's say you're being mean, and you actually make db1 a hardlink to db2, so they're actually the same inode. If you open both databases in sqlite at the same time, then close the second one, all your open sqlite locks on the first one will be lost! Oops! Except, actually, the sqlite guys have already thought of this, and it does the right thing. But if you're writing your own file locking routines, beware.

So anyway, beware of that insane behaviour. Also beware of flock(), which on some systems is implemented as a call to fcntl(), and thus inherits the same insane behaviour.

Bonus insanity feature: the struct you use to talk to fcntl() locks is called 'struct flock', even though it has nothing to do with flock(). Ha ha!

lockf() locking

lockf() is also standardized by POSIX. The Linux man page also mentions SVr4, but it doesn't mention BSD, which presumably means that some versions of BSD don't do lockf().

POSIX is also, apparently, unclear on whether lockf() locks are the same thing as fcntl() locks or not. On Linux and MacOS, they are documented to be the same thing. (In fact, lockf() is a libc call on Linux, not a system call, so we can assume it makes the same system calls as fcntl().)

The API of lockf() is a little easier than fcntl(), because you don't have to mess around with a struct. However, there is no way to query a lock to find out who owns it.

Moreover, lockf() may not be supported by pre-POSIX BSD systems, it seems, so this little bit of convenience also costs you in portability. I recommend you avoid lockf().

Interaction between different lock types

...is basically undefined. Don't use multiple types of locks - flock(), fcntl(), lockf() - on the same file.

The MacOS man pages for the three functions proudly proclaim that on MacOS (and maybe on whatever BSD MacOS is derived from), the three types of locks are handled by a unified locking implementation, so in fact, you *can* mix and match different lock types on the same file. That's great, but on other systems, they *aren't* unified, so doing so will just make your program fail strangely on other systems. It's non-portable, and furthermore, there's no reason to do it. So don't.

When you define a new file format that uses locking, be sure to document exactly which kind of locking you mean: flock(), fcntl(), or lockf(). And don't use lockf().

Mandatory locking

Stay far, far away, for total insanity lies in wait.

Seriously, don't do it. Advisory locks are the only thing that makes any sense whatsoever. In any application. I mean it.

Need another reason? The docs say that mandatory locking in Linux is "unreliable." In other words, they're not as mandatory as they're documented to be. "Almost mandatory" locking? Look. Just stay away.

Still not convinced? Man, you really must like punishment. Look, imagine someone is holding a mandatory lock on a file, so you try to read() from it and get blocked. Then he releases his lock, and your read() finishes, but some other guy reacquires the lock. You fiddle with your block, modify it, and try to write() it back, but you get held up for a bit, because the guy holding the lock isn't done yet. He does his own write() to that section of the file, and releases his lock, so your write() promptly resumes and overwrites what he just did.

What good does that do anyone? Come on. If you want locking to do you any good whatsoever, you're just going to have to acquire and release your own locks. Just do it. If you don't, you might as well not be using locks at all, because your program will be prone to horrible race conditions and they'll be extra hard to find, because mandatory locks will make it *mostly* seem to work right. If there's one thing I've learned about programming, it's that "mostly right" programs are *way* worse than "entirely wrong" programs. You don't want to be mostly right. Don't use mandatory locks.

Bonus feature: file locking in python

python has a module called "fcntl" that actually includes - or rather, seems to include - all three kinds of locks: flock(), fcntl(), and lockf(). If you like, follow along in the python source code to see how it works.

However, all is not as it seems. First of all, flock() doesn't exist on all systems, apparently. If you're on a system without flock(), python will still provide a fcntl.flock() function... by calling fcntl() for you. So you have no idea if you're actually getting fcntl() locks or flock() locks. Bad idea. Don't do it.

Next is fcntl.fcntl(). Although it pains me to say it, you can't use this one either. That's because it takes a binary data structure as a parameter. You have to create that data structure using struct.pack(), and parse it using struct.unpack(). No problem, right? Wrong. How do you know what the data structure looks like? The python fcntl module docs outright lie to you here, by providing an example of how to build the struct flock... but they just made assumptions about what it looks like. Those assumptions are definitely wrong if your system has 64-bit file offset support, which most of them do nowadays, so trying to use the example will just give an EINVAL. Moreover, POSIX doesn't guarantee that struct flock won't have other fields before/after the documented ones, or that the fields will be in a particular order, so even without 64-bit file offsets, your program is completely non-portable. And python doesn't offer any other option for generating that struct flock, so the whole function is useless. Don't do it. (You can still safely use fcntl.fcntl() for non-locking-related features, like F_SETFD.)

The only one left is fcntl.lockf(). This is the one you should use. Yeah, I know, up above I said you should avoid lockf(), because BSD systems might not have it, right? Well yeah, but that's C lockf(), not python's fcntl.lockf(). The python fcntl module documentation says of fcntl.lockf(), "This is essentially a wrapper around the fcntl() locking calls." But looking at the source, that's not quite true: in fact, it is *exactly* a wrapper around the fcntl() locking calls. fcntl.lockf() doesn't call C lockf() at all! It just fills in a struct flock and then calls fcntl()!

And that's exactly what you want. In short:

in C, use fcntl(), but avoid lockf(), which is not necessarily the same thing.
in python, use fcntl.lockf(), which is the same thing as fcntl() in C.

(Unfortunately, although calling fnctl.lockf() actually uses fcntl() locks, there is no way to run F_GETLK, so you can't find out which pid owns the lock.)

Bonus insanity feature: instead of using the C lockf() constants (F_LOCK, F_TLOCK, F_ULOCK, F_TEST), fcntl.lockf() actually uses the C flock() constants (LOCK_SH, LOCK_EX, LOCK_UN, LOCK_NB). There is no conceivable reason for this; it literally just takes in the wrong contants, and converts them to the right ones before calling fcntl().

So that means python gives you three locks in one! The constants from flock(), the functionality of fcntl(), and the name lockf(). Thanks, python, for making my programming world so simple and easy to unravel.

Epilogue

I learned all this while writing a program (in python, did you guess?) that uses file locking to control concurrent access to files. Naturally, I wanted to pick exactly the right kind of locks to solve my problem. Using the logic above, I settled on fcntl() locks, which in my python program means calling fcntl.lockf().

So far, so good. After several days of work - darn it, I really hate concurrent programming - I got it all working perfectly. On Linux.

Then I tried to port my program to MacOS. It's python, so porting was pretty easy - ie. I didn't change anything - but the tests failed. Huh? Digging deeper, it seems that some subprocesses were acquiring a lock, and sometime later, they just didn't own that lock anymore. I thought it might be one of the well-known fcntl() weirdnesses - maybe I fork()ed, or maybe I opened/closed the file without realizing it - but no. It only happens when *other* processes are locking byteranges on the same file. It appears the MacOS X (10.6.5 in my test) kernel is missing a mutex somewhere.

I wrote a minimal test case and filed a bug with Apple. If you work at Apple, you can find my bug report as number 8760769.

Dear Apple: please fix it. As far as I know, with this bug in place, any program that uses fcntl() locks is prone to silent file corruption. That includes anything using sqlite.

Super Short Summary

don't use flock() (python: fcntl.flock()) because it's not in POSIX and it doesn't work over NFS.
don't use lockf() (python: does not exist) because it's not in BSD, and probably just wraps fcntl().
don't use fcntl() (python: fcntl.lockf()) because it doesn't work over SMB on MacOS, and actually, on MacOS it doesn't work right at all.

Oh, and by the way, none of this applies on win32.

Are we having fun yet? I guess lockfiles are the answer after all.

I bet you're really glad you read this all the way to the end.

Syndicated 2010-12-13 11:03:10 (Updated 2010-12-14 04:06:09) from apenwarr - Business is Programming

12 Dec 2010 (updated 14 Dec 2010 at 04:06 UTC) »

Two things about programming

2. If you think you have a reliable system for doing it, you're probably doing the computer's job

-- "extension" on news.ycombinator.org

Syndicated 2010-12-12 13:01:15 (Updated 2010-12-14 04:06:09) from apenwarr - Business is Programming

21 Nov 2010 »

Canada and the Guaranteed Annual Income idea

The Globe and Mail recently published an article titled To end poverty, guarantee everyone in Canada $20,000 a year. (Note: I think the $20,000 is a made-up number but I'll keep using it here for illustrative purposes.) It was then picked up in the comments at YCombinator news.

Unfortunately, both the article and the comments seem to misrepresent the actual proposal. A better reference is Income Security for All Canadians (pdf) that explains in more detail.¹

But you don't necessarily need more detail to understand why this idea is a good one. It's simple: the proposal doesn't change anything except the administrative overhead.

In Canada, true homelessness - sleeping on the street involuntarily - equals death. It simply gets too cold here. If you don't want people to die, you provide social services, whether through private organizations or through the government. And so we do, in a complex combination of both.

Whether or not you agree with this idea - keeping people alive and healthy and, to some extent, comfortable even if they can't earn a living - can be debated, and it often is. Every country does it differently, and the rules and amounts are constantly changing.

But in any case, in Canada, the overall outcome of our complex set of rules is clear: there is a minimum guaranteed annual income. If you're not earning it, some government agency or another is willing to give it to you, if you jump through the required hoops.

The "guaranted annual income" proposal, then, is not a proposal to start guaranteeing a minimum annual income. We already have that. It's just a renaming and simplification of what we already have. Instead of a network of complicated government agencies to guarantee your annual income, and a bunch of complicated proof requirements, we just have one agency and the only proof required is that you're a Canadian.

Where does the money come from? That's easy: it comes from where it already came from. It's the same money as before, going to the same people.

What if people abuse the system? That's easy: the same thing that already happens. It's the same money and the same people as before.

Won't it cause inflation? No, because it's the same money as before.

Won't it make people lazy and not bother to get a job? Maybe, but so will the current system. And the income we're talking about is really not a comfortable income; it's just barely enough to survive. People complain regularly, and probably quite accurately, that the current welfare situation in Canada still leaves many people below the poverty line. That won't change.

Won't people just take the money and spend it all on drugs and alcohol? Probably, yes. But they would do that anyway. Such problems can't be solved with money alone.

Why do we give it to every citizen, instead of just the citizens who need it? Good question. The answer is: because the math works better that way. Here's why:

Right now, if you're taking unemployment insurance or welfare or receive a disability pension, the short version of a complex story is that every $1 you earn - by taking a job - reduces your free government income by about $1. At a really basic level, that sounds fair; we guaranteed you would get a survival income, so if you're earning money yourself, we'll supplement it, but only up to the survival amount. Why shouldn't we give more charity to someone with no job than someone with a job?

Thinking about it more deeply, though, the answer becomes clear: because if you take away money from people when they earn money, they have no incentive to earn money.² A 20 hour a week job and $20k/year is worse than no job and $20k/year. Sure, maybe with a 40 hour a week job, you could earn $40k/year, which is more than the basic welfare amount; but to get there, you have to get started. A lot of the people who need welfare couldn't just go get a job that, on day 1, would pay more than $20k/year. They have to gain experience, get promoted, and so on. Sure, in the long term it might pay off, but if everyone was good at long term planning, we wouldn't be in this mess.

What about the rich? Are we going to give them $20k/year too? Technically yes. But just as technically, we'd obviously raise the higher tax brackets so that their effective income would be the same as before. Progressive taxation makes sense; if you earn $1, the government takes away $0.40 or whatever, but you still have $0.60 more than if you didn't earn that $1. So you still have an incentive to work more.

In summary, the main advantages of the guaranteed annual income proposal are:

it doesn't actually cost money;
it massively simplifies administration (at least in theory);
it reduces the dehumanization of proving you're needy before we'll help you (note to Ayn Rand lovers: her main complaint about welfare was this part - "to each according to their needs" - not the part where we keep people from starving);
it means that people who are too far gone (drugs, etc) to currently file for welfare could potentially get the help they're entitled to because accessing it becomes easier;
it "sounds fairer" than just giving a lot of rich people's money to poor people; we're giving money equally to everyone, rich or poor;
it combines the advantages of socialism (less starving dying homeless people means rich people can be happier and guilt-free) with the advantages of capitalism (more work leads to more income, as it should).

I am not aware of any major disadvantages of such a plan. Let me know if you think of any realistic ones and I can add them here.

It's very interesting to me that it's the conservatively-minded politicians in both Canada and the United States that seem to be working on this idea; I would have expected it to be a more liberal suggestion. Perhaps the reduced administrative overhead ("smaller government") is appealing.

A government that managed to pass such an overhaul would certainly be remembered forever in Canadian history. Um, in a good way. I think.

Footnotes

¹ If you really want a lot of detail, I recommend A Fair Country, by John Ralston Saul. It's a really good book if you want to understand where Canadian culture comes from. (Spoiler: it comes largely from our aboriginal population. Bet you didn't see that coming. Neither did I. But he convinced me. Quiz question: which of our supposed primary influences, the British or the French, is the reason we're so polite? So accepting of other cultures? So biased toward peacekeeping instead of empire-building? Answer: neither of them, obviously.) That book was where I first heard of the guaranteed annual income proposal in a well-presented way.

² To be honest, I'm a little shocked and/or disbelieving that it works this way, because it's so obviously stupid. But I've talked to people on the receiving ends of these programmes, and more than once I've heard that there's no point for them to get a job, because if they did, they'd be cheating themselves out of free money. A rather obvious halfway implementation of the guaranteed annual income scheme would be to simply deduct only a fraction of your earnings from your government supplements; it's not as elegant as the whole deal, but at least you aren't outright discouraging people from getting a job. (If that's already what happens and I'm misinformed, well, maybe that's why this GAI idea is necessary; because it's so complicated that even the people receiving these services don't understand them. I admit to never having collected welfare, disability, or unemployment insurance, so I have no real experience here.)

Syndicated 2010-11-21 07:30:01 from apenwarr - Business is Programming

578 older entries...