/dev/spout: Availability in the CAP Theorem is Not What You Think

I just realized that "Availability" in the CAP Theorem is not what I thought it was. My expectation is that if there is an explicit latency specification (eg: return within 2 seconds) that this is part of its "agreed function" that has to be met to be "up". Availability in the CAP theorem means something different.

I've been taking another look back at the CAP theorem, because I have to build a RESTful web service that supports US and European users, each with very low latency, and each with a pretty robust availability specification. If you look at the ITIL definition of availability, it's phrased as the percentage of the time a service can perform its agreed function. As I said, part of its "agreed function" is to return quickly.

In the real world, you measure availability with monitoring tools that poll on some frequency and verify that the service responds appropriately within some period of time. The monitoring tools open a thread, issue their network request(s) and have a timer that expires and declares the service is "down". No monitoring tool ever will leak a thread waiting indefinitely long. Anyway, back to the CAP theorem.

The definitive source of information about the CAP theorem is its proof, by Gilbert and Lynch. This is a short paper, and is approachable and useful if you aren't scared of proofs. In the section on Availabilitiy (2.2), they define it like so:

For a distributed system to be continuously available, every request recieved by a non-failing node in the system must result in a response. That is, any algorithm used by the service must eventually terminate. In some ways this is a weak definition of availability: it puts no bound on how long the algorithm may run before terminating, and therefore allows unbounded computation.

So taking a 10 hours or 3 days to return the answer is "available" under this definition. This really is more of a halting condition.

This realization changes the way I think about the CAP theorem a little. The definition of "partition tolerance" contemplates working correctly in spite of message loss. "No set of failures less than total network failure is allowed to cause the system to respond incorrectly". But if there is no bound on how long the system can take, it can deal with network partitions by retrying each message exchange until it succeeds. I suspect that the theorem is still true because this allows arbitrary delay in processing, and I suspect that this breaks down consistency, which requires total ordering of changes and each to appear to have completed at a point in time across the whole system.

The upshot of this, is that I don't think anybody really cares if a real world system has the availability property. We generally want something stronger and weaker: "most" of the time (eg: 99.9%), we want some kind of fixed bound on latency. It's stronger because of the fixed bound and weaker because we are allowed to fail some small percentage of the time. A real system probably only fails to meet such an availability response time bound when there is a failure, so we really are specifying "how rare is a partition".

This makes Abadi's PACELC approach all the more useful. A real world availability criteria is going to effectively tell us two things: what is the latency spec when we are not partitioned, and how reliable must the system be in terms of not having partitions. The PACELC approach has you trade latency vs consistency in the happy path (unpartitioned) case, and when packet loss happens, you trade availability vs consistency.

/dev/spout

Sunday, August 21, 2011

Availability in the CAP Theorem is Not What You Think

6 comments: