The space between theory and practice in distributed systems

mcguire · on Aug 10, 2014

Yep, that's been a problem for decades, not just in distributed systems but perhaps especially in distributed systems. I'm practically overjoyed to see things like "Formal methods at Amazon Web Services", because it means at least some kind of bridge between the two exists.

My suspicion, outside of johnparkerg's polarization, is that distributed systems in practice are particularly messy from a formal standpoint, in contrast to non-distributed systems, while in practice you can live with the messiness, if you can reduce it a sufficient number of 9's, which seems to be anathema to theoretical approaches.

For example, the proof of the CAP theorem is irrelevant in practice, specifically because no system I'm aware of makes the strong consistency assumptions that it requires. On the other hand, CAP behavior is definitely a problem, once you reach a certain size.

api · on Aug 10, 2014

There is a parallel divide in AI: "neats" vs "scruffies". Neats like systems that are provably perfect and formal, while scruffies do stuff like multiple superposed algorithms with fudge factors in between that are set by a genetic algorithm. Formalists run away screaming but it works "sometimes" where sometimes is the problem domain they are working within.

"But it's not mathematically stable!"

"It has been for the last month..."

chubot · on Aug 10, 2014

If FLP says consensus is impossible with one faulty process, and faults happen all the time in practice, how are real systems built with consensus?

Good question. I think the answer is that you don't build large systems using consensus; you bootstrap large systems with very small systems using consensus. The very small systems are reasonably assumed to have no Byzantine faults (i.e. just like you more or less rely on a single database server not to have faults).

All the systems I know of that use consensus are meant to be small, e.g. run on 5 machines or so (and of course the membership is fixed). Google's Chubby, Yahoo's ZooKeeper, and similar systems like doozer and etcd all work like this.

Consensus doesn't "scale" anyway (the latency isn't bearable). If you only have 5 machines, the likelihood of Byzantine faults over a reasonably long time period is low. The main problem you will see is your own software bugs (i.e. not bugs due to faulty CPU, memory, disks, switches, etc.).

dgacmu · on Aug 10, 2014

Apparently Marc is right about the need for more of this - the answer is that the kind of nasty timing of message losses that can render consensus impossible don't happen, by and large, for very long in practice. It's not necessary to have Byzantine failures to prevent consensus, it's a consequence of the arbitrary network behavior allowed in an asynchronous communications model.

I should add this question to one of my exams and see if my students get it right. I'll be a little grumpy if they don't, but it's a great question. :)

chubot · on Aug 10, 2014

What's an example of "timing of message losses" that would make consensus impossible? I didn't think say Paxos depended on any such assumptions.

I suppose you can mathematically construct some kind of non-total message loss such that Paxos would never make progress. But that kind of message loss won't persist forever in a real system... it would basically be message loss with knowledge of the algorithm?

grogers · on Aug 11, 2014

Racing proposers cause paxos to fail to converge.

A: Prepare(1)

B: Prepare(2)

A: Accept(1, Va) # fails.

A: Prepare(3)

B: Accept(2, Vb) # fails.

Etc... In practice randomized backoffs when this happens causes it to converge extremely fast, but the exact right situation COULD cause it to delay forever. Don't lose sleep over it though.

chubot · on Aug 11, 2014

Yeah that is basically what I was imagining, but there's no failure condition that matches that in reality. You might get unlucky and fail to converge for 10 iterations, but not for say 10,000.

It is sort of Byzantine in that the "system" is inserting very specific message losses with knowledge of the algorithm.

johnparkerg · on Aug 10, 2014

I think this is just a reflection of the gap between academia and the real world. One could argue that the problem is the lack of a 'trial and error' philosophy in 'theory and practice' or that scientists couldn't possibly develop so much if they were bothered with the real world.

I believe the true problem lies in our need to categorize people as either scientists and theoretical engineers or down-to-earth engineers, which only polarizes the spectrum.

vicaya · on Aug 10, 2014

FLP/CAP merely illustrates the fundamental trade-off between "safety/consistency" and "liveliness/availability" of distributed systems.

There is a spectrum of design choices between strong consistency with best effort availability and best effort everything that make sense to many practical use cases.

erikb · on Aug 10, 2014

I think the middle ground between theory and practice has gaps in most areas of expertise. The problem may be that to create good content in those gaps one needs to have accomplished a certain level of skill in both areas, theory and practise. Very few people get around to actually step aside from their daily work (which usually is pure theory or pure practise) to achieve something on the other end, though.

api · on Aug 10, 2014

I would add a ninth fallacy to the eight: CPU is free. This is effectively true on desktops and possibly on servers if task priorities are used, but on mobile devices including laptops CPU eats battery life.