ServerlessBase Blog
  • CAP Theorem Explained: Consistency, Availability, Partition Tolerance

    Understanding the fundamental trade-offs in distributed systems and how CAP theorem applies to database design

    CAP Theorem Explained: Consistency, Availability, Partition Tolerance

    You've probably heard developers talk about "eventual consistency" or "strong consistency" when discussing databases. They're usually referring to the CAP theorem, a fundamental concept in distributed systems that every engineer should understand. The theorem describes the trade-offs you make when designing distributed systems, and it directly impacts how your application behaves under failure conditions.

    What the CAP Theorem Actually Says

    The CAP theorem states that in a distributed data store, you can only guarantee two out of three properties at any given time:

    • Consistency: Every read receives the most recent write or an error
    • Availability: Every request receives a (non-error) response, without the guarantee that it contains the most recent write
    • Partition Tolerance: The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

    The third property, partition tolerance, is unavoidable in distributed systems. If you have multiple nodes spread across different machines or data centers, the network between them will eventually fail. Therefore, the real choice is between consistency and availability.

    The Three Pillars Explained

    Consistency

    Consistency means that all nodes see the same data at the same time. When you write data to one node, all other nodes must be updated before any read can return that data. This is what you expect from a traditional relational database like PostgreSQL or MySQL.

    -- Strong consistency example
    BEGIN TRANSACTION;
    UPDATE accounts SET balance = balance - 100 WHERE id = 1;
    UPDATE accounts SET balance = balance + 100 WHERE id = 2;
    COMMIT;
    -- After COMMIT, all reads will see the updated balances

    Availability

    Availability means that every request receives a response, without the guarantee that it contains the most recent write. If a node fails, the system continues to serve requests from other nodes. This is what you get with many NoSQL databases and distributed caches like Redis.

    # Availability example - reads succeed even if some nodes are down
    redis-cli SET key value
    redis-cli GET key  # Returns value even if some replicas are lagging

    Partition Tolerance

    Partition tolerance means the system continues to operate despite network partitions. When nodes can't communicate, the system must decide how to handle data. This is unavoidable in distributed systems because networks are inherently unreliable.

    The CAP Trade-off: CP vs AP

    When a network partition occurs, you must choose between consistency and availability. This creates two main system types:

    CP Systems (Consistency + Partition Tolerance)

    CP systems prioritize consistency over availability. When a partition occurs, they refuse to answer requests until the partition is resolved. This ensures that all reads return the same data, but it means some requests will fail during partitions.

    Examples: PostgreSQL with synchronous replication, MongoDB with majority reads, etcd, ZooKeeper

    # MongoDB CP configuration example
    replication:
      replSetName: "rs0"
      priority:
        - 1
      settings:
        heartbeatIntervalMillis: 2000
        electionTimeoutMillis: 10000
        catchUpTimeoutMillis: -1

    Pros:

    • Guarantees data correctness
    • No stale reads
    • Predictable behavior

    Cons:

    • Unavailable during partitions
    • Slower writes (synchronous replication)
    • More complex to implement

    AP Systems (Availability + Partition Tolerance)

    AP systems prioritize availability over consistency. When a partition occurs, they continue to serve requests, even if it means returning stale data. This ensures that all requests succeed, but some reads may return outdated values.

    Examples: Cassandra, DynamoDB, Couchbase, Redis Cluster

    # Cassandra AP configuration example
    CREATE TABLE users (
      user_id uuid PRIMARY KEY,
      name text,
      email text
    ) WITH replication = {
      'class': 'SimpleStrategy',
      'replication_factor': 3
    };

    Pros:

    • Always available
    • Faster writes (asynchronous replication)
    • Better user experience during failures

    Cons:

    • Stale reads possible
    • Data conflicts
    • Requires conflict resolution

    When to Choose CP vs AP

    Choose CP When:

    • Data correctness is critical: Financial systems, inventory management, medical records
    • You need strong guarantees: No stale reads, no lost updates
    • Your application can tolerate downtime: Batch processing, background jobs
    • You have strong consistency requirements: ACID transactions

    Example: An e-commerce platform that must never sell the same item twice. If a partition prevents updating inventory across all nodes, the system should refuse the sale rather than sell it twice.

    Choose AP When:

    • Availability is critical: Social media feeds, real-time analytics, gaming
    • You can tolerate some stale data: User profiles, search results, caching layers
    • High write throughput is needed: Logging systems, time-series data
    • You need low latency: Real-time applications, chat applications

    Example: A social media feed that shows posts from your friends. If a partition prevents updating the feed, it's better to show the latest available posts than to show nothing at all.

    The BASE Alternative

    Many developers find the CAP theorem too binary. The BASE alternative provides a more nuanced view:

    • Basically Available: The system is available during network partitions
    • Asynchronous consistency: Data consistency is eventually achieved
    • Soft state: Data can change over time
    • Eventual consistency: The system will converge to a consistent state

    BASE systems are essentially AP systems that eventually become consistent. They're common in distributed caches and NoSQL databases.

    Real-World Trade-offs

    Database Replication Strategies

    StrategyConsistencyAvailabilityUse Case
    Synchronous replicationStrongLowFinancial systems, critical data
    Asynchronous replicationEventualHighCaching, social media feeds
    Multi-leader replicationWeakHighMulti-region deployments
    Leader-follower replicationStrongMediumMost relational databases

    Network Partition Scenarios

    Scenario 1: E-commerce checkout

    • Choice: CP
    • Reason: Cannot sell the same item twice
    • Behavior: Refuse checkout during partition, show error message

    Scenario 2: Social media feed

    • Choice: AP
    • Reason: Users expect to see posts even if some servers are down
    • Behavior: Serve stale data during partition, update when partition resolves

    Scenario 3: Real-time analytics

    • Choice: AP
    • Reason: Analytics can tolerate some data loss
    • Behavior: Continue collecting data, batch process later

    Practical Implementation

    Choosing the Right Database

    When selecting a database for your application, consider these questions:

    1. Is data correctness more important than availability?

      • Yes → Consider CP databases (PostgreSQL, MongoDB)
      • No → Consider AP databases (Cassandra, DynamoDB)
    2. Can your application tolerate stale reads?

      • No → Choose CP
      • Yes → Choose AP
    3. What happens during network partitions?

      • Refuse requests → CP
      • Serve stale data → AP

    Hybrid Approaches

    Many systems use hybrid approaches to balance consistency and availability:

    Read replicas with eventual consistency: Primary database is CP, read replicas are AP. Reads can be served from replicas for performance.

    Multi-region deployments: Use AP for global reads, CP for local writes. Implement conflict resolution strategies.

    Caching layers: Use AP caches (Redis) with cache invalidation strategies to balance performance and consistency.

    Common Misconceptions

    "CAP theorem means you can only have two properties"

    The theorem states that you can only guarantee two properties at any given time. The third property may still exist, but it's not guaranteed. For example, an AP system may eventually become consistent after a partition resolves.

    "CP systems are always consistent"

    CP systems guarantee consistency during normal operation and during partitions. However, they can still have consistency issues due to bugs, misconfigurations, or application-level logic.

    "AP systems are always available"

    AP systems prioritize availability, but they can still become unavailable due to other failure modes like node crashes, resource exhaustion, or configuration errors.

    Conclusion

    The CAP theorem isn't about choosing one system type over another. It's about understanding the trade-offs you're making and choosing the right approach for your use case. Every distributed system makes these trade-offs, whether explicitly or implicitly.

    When designing your system, ask yourself: "What happens when the network fails?" Your answer will guide you toward the right consistency and availability strategy. Remember that there's no perfect choice—only trade-offs that make sense for your specific requirements.

    Platforms like ServerlessBase make it easier to deploy and manage distributed databases with built-in replication and failover, so you can focus on implementing the right consistency model for your application without worrying about the underlying infrastructure.


    Next Steps:

    • Understand database indexing and query optimization
    • Learn about database transactions and isolation levels
    • Explore read replicas and write scaling strategies

    Leave comment