ServerlessBase Blog
  • Understanding Database High Availability Patterns

    A comprehensive guide to database high availability patterns, redundancy strategies, and failover mechanisms for mission-critical applications

    Understanding Database High Availability Patterns

    You've deployed your application to production, and everything looks great. Then the inevitable happens: your database goes down. Maybe it's a hardware failure, a network partition, or a misconfigured update. Suddenly, your entire application grinds to a halt. Users can't log in, transactions fail, and your team is scrambling to restore service.

    This is where database high availability (HA) patterns become critical. HA isn't just about preventing downtime—it's about designing systems that can survive failures without human intervention. In this article, you'll learn the core HA patterns used in production systems, when to apply each one, and the trade-offs involved.

    What Is High Availability in Databases?

    High availability means your database remains accessible and operational as much as possible. In practice, this translates to uptime targets like 99.9% (3 nines), 99.99% (4 nines), or even 99.999% (5 nines). Achieving these targets requires eliminating single points of failure, implementing redundancy, and designing graceful failure modes.

    The core principle is simple: if one component fails, another takes over seamlessly. This requires careful planning, but the effort pays off in reliability and user trust.

    Core HA Patterns

    1. Active-Passive (Master-Slave) Replication

    Active-passive replication is the most common HA pattern. You maintain two database instances: one active (master) that handles all writes, and one passive (slave) that replicates changes from the master.

    When the master fails, the passive takes over. This failover can be automatic or manual, depending on your configuration.

    # Example: PostgreSQL primary-standby configuration
    primary:
      host: db-primary.example.com
      port: 5432
      standby:
        host: db-standby.example.com
        port: 5432
        trigger_file: /var/lib/postgresql/trigger

    How it works:

    1. The master accepts all write operations
    2. Changes are asynchronously replicated to the standby
    3. The standby applies the same changes
    4. On master failure, the standby promotes to master
    5. New connections redirect to the new master

    Pros:

    • Simple to implement
    • Good read scaling (standby can serve reads)
    • Low operational overhead

    Cons:

    • Data loss possible during failover (replication lag)
    • Single point of failure during failover (if not automated)
    • Write scalability limited to one master

    Best for: Most applications where 99.9% uptime is sufficient and read scaling is needed.

    2. Active-Active Replication

    Active-active replication goes a step further by having multiple masters that can accept writes simultaneously. This provides both HA and write scalability.

    # Example: Multi-master PostgreSQL with BDR
    database:
      nodes:
        - host: db-us-east-1.example.com
          region: us-east-1
        - host: db-us-west-2.example.com
          region: us-west-2
        - host: db-eu-central-1.example.com
          region: eu-central-1
      conflict_resolution: last_write_wins

    How it works:

    1. Multiple masters accept writes independently
    2. Changes are asynchronously replicated to other masters
    3. Applications connect to the nearest master
    4. Conflict resolution handles write collisions

    Pros:

    • Maximum availability (no single point of failure)
    • Write scalability across multiple regions
    • Geographic distribution for compliance

    Cons:

    • Complex conflict resolution
    • Higher operational overhead
    • Increased data loss risk
    • More expensive (multiple database instances)

    Best for: Global applications with distributed users, where write throughput and geographic compliance are critical.

    3. Multi-Master with Conflict Resolution

    Active-active replication introduces write conflicts when the same data is modified simultaneously on different masters. You need a conflict resolution strategy.

    Resolution StrategyDescriptionUse Case
    Last Write WinsThe most recent write winsSimple applications, low conflict frequency
    Timestamp-basedUses timestamps to determine orderApplications with consistent clocks
    Application-levelCustom logic resolves conflictsComplex business logic
    Vector ClocksTracks version vectors for conflictsDistributed systems, eventual consistency
    # Example: Last-write-wins conflict resolution
    def resolve_conflict(record1, record2):
        if record1.updated_at > record2.updated_at:
            return record1
        return record2

    Best for: Applications with low write conflict frequency or custom conflict resolution logic.

    4. Cluster-Based HA (Galera Cluster, etc.)

    Cluster-based HA uses synchronous replication across multiple nodes. All nodes participate in writes, ensuring data consistency.

    # Example: Galera Cluster configuration
    [galera]
    wsrep_provider=/usr/lib/galera/libgalera_smm.so
    wsrep_cluster_name=db_cluster
    wsrep_cluster_address=gcomm://node1,node2,node3
    wsrep_sst_method=rsync

    How it works:

    1. All nodes are equal participants
    2. Writes are synchronized across all nodes
    3. Quorum ensures only committed writes are visible
    4. Any node can handle writes

    Pros:

    • Strong consistency guarantees
    • No single point of failure
    • Automatic node reconfiguration

    Cons:

    • Synchronous replication reduces write performance
    • Higher operational complexity
    • More expensive (multiple full database instances)

    Best for: Applications requiring strong consistency and high availability, like financial systems.

    5. Read Replicas for Read Scaling

    Read replicas are passive copies that serve read queries while the master handles writes. This pattern doesn't provide HA but improves read performance.

    -- Example: Querying a read replica
    SELECT * FROM users WHERE id = 123;
    -- This query goes to a read replica, offloading the master

    How it works:

    1. Master accepts all writes
    2. Writes are asynchronously replicated to read replicas
    3. Read queries are routed to replicas
    4. Write queries always go to the master

    Pros:

    • Simple to implement
    • Significant read scaling
    • Low operational overhead

    Cons:

    • No HA (master is still a single point of failure)
    • Data staleness (replication lag)

    Best for: Applications with high read-to-write ratios where HA isn't the primary concern.

    Comparison of HA Patterns

    PatternAvailabilityConsistencyComplexityCostUse Case
    Active-PassiveHighStrongLowMediumMost applications
    Active-ActiveVery HighEventualHighHighGlobal applications
    Multi-Master with Conflict ResolutionVery HighEventualVery HighHighDistributed systems
    Cluster-Based (Synchronous)Very HighStrongHighHighFinancial systems
    Read ReplicasLowStrongLowLowRead-heavy applications

    Implementing Failover

    Failover is the process of transitioning from a failed master to a standby. A well-designed failover is automatic, fast, and transparent to applications.

    Automatic Failover with Patroni

    Patroni is a popular tool for managing PostgreSQL high availability with automatic failover.

    # patroni.yml
    scope: postgres-cluster
    name: db1
    restapi:
      listen: 0.0.0.0:8008
      connect_address: db1.example.com:8008
    etcd3:
      hosts:
        - etcd1:2379
        - etcd2:2379
        - etcd3:2379
    postgresql:
      name: postgres
      listen: 0.0.0.0:5432
      data_dir: /var/lib/postgresql/data
      bin_dir: /usr/lib/postgresql/14/bin
      authentication:
        replication:
          username: replicator
          password: secret
      parameters:
        max_wal_senders: 10
        max_replication_slots: 10
        hot_standby: "on"

    Patroni provides:

    • Automatic leader election
    • Automatic failover
    • Automatic reconfiguration
    • Health checks
    • Automatic recovery

    Health Checks and Circuit Breakers

    Health checks determine when a master has failed. Circuit breakers prevent cascading failures by temporarily blocking traffic to unhealthy nodes.

    # Example: Circuit breaker pattern
    class DatabaseCircuitBreaker:
        def __init__(self, failure_threshold=5, timeout=60):
            self.failure_count = 0
            self.failure_threshold = failure_threshold
            self.timeout = timeout
            self.last_failure_time = None
            self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
     
        def call(self, func, *args, **kwargs):
            if self.state == "OPEN":
                if time.time() - self.last_failure_time > self.timeout:
                    self.state = "HALF_OPEN"
                else:
                    raise CircuitBreakerOpenError("Circuit breaker is open")
     
            try:
                result = func(*args, **kwargs)
                if self.state == "HALF_OPEN":
                    self.state = "CLOSED"
                self.failure_count = 0
                return result
            except Exception as e:
                self.failure_count += 1
                self.last_failure_time = time.time()
                if self.failure_count >= self.failure_threshold:
                    self.state = "OPEN"
                raise

    Monitoring and Alerting

    You can't protect what you don't monitor. Effective HA monitoring tracks replication lag, node health, and failover events.

    Key Metrics to Monitor

    MetricWhy It MattersAlert Threshold
    Replication LagIndicates data loss risk during failover> 30 seconds
    Node HealthDetects failed nodes before they cause issuesAny failure
    Failover CountTracks HA system activity> 5 per day
    Connection ErrorsIndicates client-side issues> 10 per minute
    Write LatencyMeasures performance impact> 100ms
    # Example: Monitoring replication lag with pg_stat_replication
    SELECT
      client_addr,
      state,
      sync_state,
      replay_lag
    FROM pg_stat_replication;

    Alerting Best Practices

    1. Set appropriate thresholds: Don't alert on every minor fluctuation
    2. Use multiple channels: Email, Slack, PagerDuty for critical alerts
    3. Group related alerts: Combine similar alerts to reduce noise
    4. Include context: Alerts should explain what's happening and what to do
    5. Test your alerts: Verify alerts fire correctly before relying on them

    Testing Your HA Setup

    You can't rely on HA patterns without testing them. Regular failover drills ensure your system responds correctly when failures occur.

    Automated Failover Testing

    # Example: Automated failover test script
    #!/bin/bash
    # test-failover.sh
     
    PRIMARY_HOST=$1
    STANDBY_HOST=$2
     
    echo "Starting failover test..."
     
    # Stop the primary database
    ssh $PRIMARY_HOST "systemctl stop postgresql"
     
    # Wait for failover
    sleep 30
     
    # Verify standby is now accepting connections
    if ssh $STANDBY_HOST "pg_isready -h localhost" | grep -q "accepting connections"; then
        echo "✓ Failover successful"
        exit 0
    else
        echo "✗ Failover failed"
        exit 1
    fi

    Run this test weekly to ensure your HA setup works correctly.

    Disaster Recovery Testing

    Don't just test failover—test full disaster recovery. Simulate a complete database loss and verify you can restore from backups.

    # Example: Full DR test
    1. Stop all database nodes
    2. Delete all database data
    3. Restore from latest backup
    4. Promote standby to master
    5. Verify application connectivity
    6. Run smoke tests

    Common Pitfalls

    1. Assuming Automatic Failover Is Enough

    Automatic failover is only as good as your configuration. Misconfigured health checks, network partitions, or resource exhaustion can prevent proper failover.

    Solution: Regularly test your failover process and monitor alerting.

    2. Ignoring Replication Lag

    High replication lag means data loss risk during failover. If your standby is 5 minutes behind, you could lose 5 minutes of transactions.

    Solution: Monitor replication lag and set alerts for thresholds that match your RPO (Recovery Point Objective).

    3. Forgetting to Update Application Configuration

    After failover, your application might still point to the old master. This causes connection errors until you update your configuration.

    Solution: Use DNS-based load balancing or a service discovery system that automatically updates on failover.

    4. Over-Engineering for High Availability

    Not every application needs 99.999% uptime. A blog site can tolerate more downtime than a payment processing system.

    Solution: Determine your RTO (Recovery Time Objective) and RPO (Recovery Point Objective) and choose patterns that meet those requirements.

    Choosing the Right Pattern

    Selecting the right HA pattern depends on your application's requirements, budget, and team expertise.

    Start with these questions:

    1. What is your required uptime (RTO/RPO)?
    2. How much data loss can you tolerate?
    3. What is your budget for database infrastructure?
    4. How complex is your application's data model?
    5. Do you have expertise in managing HA systems?

    Decision guide:

    • 99.9% uptime, moderate complexity: Active-passive replication
    • 99.99% uptime, strong consistency: Cluster-based (synchronous)
    • Global users, high write volume: Active-active with conflict resolution
    • Read-heavy application: Read replicas + active-passive

    Conclusion

    Database high availability is a critical component of reliable systems. By understanding the core HA patterns—active-passive, active-active, cluster-based, and read replicas—you can choose the right approach for your application.

    Remember that HA is not a one-time implementation but an ongoing process. Regular testing, monitoring, and maintenance ensure your HA system continues to protect your data and keep your application running.

    Platforms like ServerlessBase simplify database HA by providing managed services with built-in replication, automatic failover, and monitoring. This lets you focus on your application logic while the platform handles the complex HA infrastructure.

    Next steps:

    1. Assess your current database architecture for single points of failure
    2. Choose an HA pattern that meets your RTO/RPO requirements
    3. Implement monitoring and alerting for replication lag and node health
    4. Schedule regular failover drills to test your HA setup
    5. Document your HA procedures for your operations team

    Leave comment