Chapter 73. The Implications of the CAP Theorem

Paul Doran

The CAP theorem forces us to compromise between consistency, availability, and partition tolerance in distributed data systems:

  • Consistency means all clients see the same response to a query.

  • Availability means every query from a client gets a response.

  • Partition tolerance means the system works if the system loses messages.

As data engineers, we must accept that distributed data systems will have partitions, so we need to understand the compromise between consistency and availability. Constructing robust data pipelines requires us to understand what could go wrong. By definition, a data pipeline needs to move data from one place to another.

Here are three things to note about the CAP theorem’s impact on system design:

  • Attempting to classify data systems as being either CP or AP is futile. The classification could depend on the operation or the configuration. The system might not meet the theorem’s definitions of consistency and availability.

  • While the CAP theorem seems limiting in practice, it cuts off only a tiny part of the design space. It disallows those systems that have perfect availability and consistency in the presence of network partitions.

  • The only fault considered is a network partition. We know from practice that we need to think about many more failure modes in real-world systems.

Having said that, the ...

Get 97 Things Every Data Engineer Should Know now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.