Understanding Database High Availability Patterns

You've deployed your application to production, and everything looks great. Then the inevitable happens: your database goes down. Maybe it's a hardware failure, a network partition, or a misconfigured update. Suddenly, your entire application grinds to a halt. Users can't log in, transactions fail, and your team is scrambling to restore service.

This is where database high availability (HA) patterns become critical. HA isn't just about preventing downtime—it's about designing systems that can survive failures without human intervention. In this article, you'll learn the core HA patterns used in production systems, when to apply each one, and the trade-offs involved.

What Is High Availability in Databases?

High availability means your database remains accessible and operational as much as possible. In practice, this translates to uptime targets like 99.9% (3 nines), 99.99% (4 nines), or even 99.999% (5 nines). Achieving these targets requires eliminating single points of failure, implementing redundancy, and designing graceful failure modes.

The core principle is simple: if one component fails, another takes over seamlessly. This requires careful planning, but the effort pays off in reliability and user trust.

Core HA Patterns

1. Active-Passive (Master-Slave) Replication

Active-passive replication is the most common HA pattern. You maintain two database instances: one active (master) that handles all writes, and one passive (slave) that replicates changes from the master.

When the master fails, the passive takes over. This failover can be automatic or manual, depending on your configuration.

# Example: PostgreSQL primary-standby configuration
primary:
  host: db-primary.example.com
  port: 5432
  standby:
    host: db-standby.example.com
    port: 5432
    trigger_file: /var/lib/postgresql/trigger

How it works:

The master accepts all write operations
Changes are asynchronously replicated to the standby
The standby applies the same changes
On master failure, the standby promotes to master
New connections redirect to the new master

Pros:

Simple to implement
Good read scaling (standby can serve reads)
Low operational overhead

Cons:

Data loss possible during failover (replication lag)
Single point of failure during failover (if not automated)
Write scalability limited to one master

Best for: Most applications where 99.9% uptime is sufficient and read scaling is needed.

2. Active-Active Replication

Active-active replication goes a step further by having multiple masters that can accept writes simultaneously. This provides both HA and write scalability.

# Example: Multi-master PostgreSQL with BDR
database:
  nodes:
    - host: db-us-east-1.example.com
      region: us-east-1
    - host: db-us-west-2.example.com
      region: us-west-2
    - host: db-eu-central-1.example.com
      region: eu-central-1
  conflict_resolution: last_write_wins

How it works:

Multiple masters accept writes independently
Changes are asynchronously replicated to other masters
Applications connect to the nearest master
Conflict resolution handles write collisions

Pros:

Maximum availability (no single point of failure)
Write scalability across multiple regions
Geographic distribution for compliance

Cons:

Complex conflict resolution
Higher operational overhead
Increased data loss risk
More expensive (multiple database instances)

Best for: Global applications with distributed users, where write throughput and geographic compliance are critical.

3. Multi-Master with Conflict Resolution

Active-active replication introduces write conflicts when the same data is modified simultaneously on different masters. You need a conflict resolution strategy.

Resolution Strategy	Description	Use Case
Last Write Wins	The most recent write wins	Simple applications, low conflict frequency
Timestamp-based	Uses timestamps to determine order	Applications with consistent clocks
Application-level	Custom logic resolves conflicts	Complex business logic
Vector Clocks	Tracks version vectors for conflicts	Distributed systems, eventual consistency

# Example: Last-write-wins conflict resolution
def resolve_conflict(record1, record2):
    if record1.updated_at > record2.updated_at:
        return record1
    return record2

Best for: Applications with low write conflict frequency or custom conflict resolution logic.

4. Cluster-Based HA (Galera Cluster, etc.)

Cluster-based HA uses synchronous replication across multiple nodes. All nodes participate in writes, ensuring data consistency.

# Example: Galera Cluster configuration
[galera]
wsrep_provider=/usr/lib/galera/libgalera_smm.so
wsrep_cluster_name=db_cluster
wsrep_cluster_address=gcomm://node1,node2,node3
wsrep_sst_method=rsync

How it works:

All nodes are equal participants
Writes are synchronized across all nodes
Quorum ensures only committed writes are visible
Any node can handle writes

Pros:

Strong consistency guarantees
No single point of failure
Automatic node reconfiguration

Cons:

Synchronous replication reduces write performance
Higher operational complexity
More expensive (multiple full database instances)

Best for: Applications requiring strong consistency and high availability, like financial systems.

5. Read Replicas for Read Scaling

Read replicas are passive copies that serve read queries while the master handles writes. This pattern doesn't provide HA but improves read performance.

-- Example: Querying a read replica
SELECT * FROM users WHERE id = 123;
-- This query goes to a read replica, offloading the master

How it works:

Master accepts all writes
Writes are asynchronously replicated to read replicas
Read queries are routed to replicas
Write queries always go to the master

Pros:

Simple to implement
Significant read scaling
Low operational overhead

Cons:

No HA (master is still a single point of failure)
Data staleness (replication lag)

Best for: Applications with high read-to-write ratios where HA isn't the primary concern.

Comparison of HA Patterns

Pattern	Availability	Consistency	Complexity	Cost	Use Case
Active-Passive	High	Strong	Low	Medium	Most applications
Active-Active	Very High	Eventual	High	High	Global applications
Multi-Master with Conflict Resolution	Very High	Eventual	Very High	High	Distributed systems
Cluster-Based (Synchronous)	Very High	Strong	High	High	Financial systems
Read Replicas	Low	Strong	Low	Low	Read-heavy applications

Implementing Failover

Failover is the process of transitioning from a failed master to a standby. A well-designed failover is automatic, fast, and transparent to applications.

Automatic Failover with Patroni

Patroni is a popular tool for managing PostgreSQL high availability with automatic failover.

# patroni.yml
scope: postgres-cluster
name: db1
restapi:
  listen: 0.0.0.0:8008
  connect_address: db1.example.com:8008
etcd3:
  hosts:
    - etcd1:2379
    - etcd2:2379
    - etcd3:2379
postgresql:
  name: postgres
  listen: 0.0.0.0:5432
  data_dir: /var/lib/postgresql/data
  bin_dir: /usr/lib/postgresql/14/bin
  authentication:
    replication:
      username: replicator
      password: secret
  parameters:
    max_wal_senders: 10
    max_replication_slots: 10
    hot_standby: "on"

Patroni provides:

Automatic leader election
Automatic failover
Automatic reconfiguration
Health checks
Automatic recovery

Health Checks and Circuit Breakers

Health checks determine when a master has failed. Circuit breakers prevent cascading failures by temporarily blocking traffic to unhealthy nodes.

# Example: Circuit breaker pattern
class DatabaseCircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
 
    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            if time.time() - self.last_failure_time > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise CircuitBreakerOpenError("Circuit breaker is open")
 
        try:
            result = func(*args, **kwargs)
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
            self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

Monitoring and Alerting

You can't protect what you don't monitor. Effective HA monitoring tracks replication lag, node health, and failover events.

Key Metrics to Monitor

Metric	Why It Matters	Alert Threshold
Replication Lag	Indicates data loss risk during failover	> 30 seconds
Node Health	Detects failed nodes before they cause issues	Any failure
Failover Count	Tracks HA system activity	> 5 per day
Connection Errors	Indicates client-side issues	> 10 per minute
Write Latency	Measures performance impact	> 100ms

# Example: Monitoring replication lag with pg_stat_replication
SELECT
  client_addr,
  state,
  sync_state,
  replay_lag
FROM pg_stat_replication;

Alerting Best Practices

Set appropriate thresholds: Don't alert on every minor fluctuation
Use multiple channels: Email, Slack, PagerDuty for critical alerts
Group related alerts: Combine similar alerts to reduce noise
Include context: Alerts should explain what's happening and what to do
Test your alerts: Verify alerts fire correctly before relying on them

Testing Your HA Setup

You can't rely on HA patterns without testing them. Regular failover drills ensure your system responds correctly when failures occur.

Automated Failover Testing

# Example: Automated failover test script
#!/bin/bash
# test-failover.sh
 
PRIMARY_HOST=$1
STANDBY_HOST=$2
 
echo "Starting failover test..."
 
# Stop the primary database
ssh $PRIMARY_HOST "systemctl stop postgresql"
 
# Wait for failover
sleep 30
 
# Verify standby is now accepting connections
if ssh $STANDBY_HOST "pg_isready -h localhost" | grep -q "accepting connections"; then
    echo "✓ Failover successful"
    exit 0
else
    echo "✗ Failover failed"
    exit 1
fi

Run this test weekly to ensure your HA setup works correctly.

Disaster Recovery Testing

Don't just test failover—test full disaster recovery. Simulate a complete database loss and verify you can restore from backups.

# Example: Full DR test
1. Stop all database nodes
2. Delete all database data
3. Restore from latest backup
4. Promote standby to master
5. Verify application connectivity
6. Run smoke tests

Common Pitfalls

1. Assuming Automatic Failover Is Enough

Automatic failover is only as good as your configuration. Misconfigured health checks, network partitions, or resource exhaustion can prevent proper failover.

Solution: Regularly test your failover process and monitor alerting.

2. Ignoring Replication Lag

High replication lag means data loss risk during failover. If your standby is 5 minutes behind, you could lose 5 minutes of transactions.

Solution: Monitor replication lag and set alerts for thresholds that match your RPO (Recovery Point Objective).

3. Forgetting to Update Application Configuration

After failover, your application might still point to the old master. This causes connection errors until you update your configuration.

Solution: Use DNS-based load balancing or a service discovery system that automatically updates on failover.

4. Over-Engineering for High Availability

Not every application needs 99.999% uptime. A blog site can tolerate more downtime than a payment processing system.

Solution: Determine your RTO (Recovery Time Objective) and RPO (Recovery Point Objective) and choose patterns that meet those requirements.

Choosing the Right Pattern

Selecting the right HA pattern depends on your application's requirements, budget, and team expertise.

Start with these questions:

What is your required uptime (RTO/RPO)?
How much data loss can you tolerate?
What is your budget for database infrastructure?
How complex is your application's data model?
Do you have expertise in managing HA systems?

Decision guide:

99.9% uptime, moderate complexity: Active-passive replication
99.99% uptime, strong consistency: Cluster-based (synchronous)
Global users, high write volume: Active-active with conflict resolution
Read-heavy application: Read replicas + active-passive

Conclusion

Database high availability is a critical component of reliable systems. By understanding the core HA patterns—active-passive, active-active, cluster-based, and read replicas—you can choose the right approach for your application.

Remember that HA is not a one-time implementation but an ongoing process. Regular testing, monitoring, and maintenance ensure your HA system continues to protect your data and keep your application running.

Platforms like ServerlessBase simplify database HA by providing managed services with built-in replication, automatic failover, and monitoring. This lets you focus on your application logic while the platform handles the complex HA infrastructure.

Next steps:

Assess your current database architecture for single points of failure
Choose an HA pattern that meets your RTO/RPO requirements
Implement monitoring and alerting for replication lag and node health
Schedule regular failover drills to test your HA setup
Document your HA procedures for your operations team