Kubernetes etcd: The Distributed Key-Value Store

You've deployed a Kubernetes cluster, configured your workloads, and everything seems to be working. Then you try to access the API server, and it's unresponsive. Or worse, you try to scale your deployment, and nothing happens. The root cause is almost always etcd—the distributed key-value store that Kubernetes uses to store all its state.

etcd is the backbone of Kubernetes. Without it, you don't have a cluster. Understanding how etcd works, how to secure it, and how to keep it healthy is essential for anyone running production Kubernetes clusters.

What is etcd?

etcd is a distributed, consistent key-value store designed for reliable, distributed systems. It was created by CoreOS specifically for Kubernetes to provide a reliable backend for storing cluster state.

Think of etcd as a database, but with different priorities. Traditional databases optimize for transactional consistency and complex queries. etcd optimizes for:

Consistency: All nodes see the same data at the same time
Availability: The system remains operational even during network partitions
Partition tolerance: The system continues to function despite network failures

These three properties define the CAP theorem, and etcd makes specific trade-offs to achieve consistency and partition tolerance at the expense of availability during network partitions.

How etcd Stores Kubernetes State

Kubernetes stores everything in etcd as key-value pairs. Here are some examples:

# A deployment configuration
/registry/deployments/my-app
{
  "apiVersion": "apps/v1",
  "kind": "Deployment",
  "metadata": {
    "name": "my-app",
    "namespace": "default"
  },
  "spec": {
    "replicas": 3,
    "selector": {
      "matchLabels": {
        "app": "my-app"
      }
    }
  }
}
 
# A pod status
/registry/pods/default/my-app-abc123
{
  "apiVersion": "v1",
  "kind": "Pod",
  "status": {
    "phase": "Running",
    "podIP": "10.244.1.5",
    "containerStatuses": [
      {
        "name": "my-app",
        "ready": true,
        "state": {
          "running": {}
        }
      }
    ]
  }
}

Every resource in Kubernetes—deployments, services, secrets, configmaps, nodes, and more—exists as a key in etcd. When you create a deployment, Kubernetes writes the deployment object to etcd. When you scale it, Kubernetes updates the replicas count in etcd. When a pod fails, Kubernetes updates the pod's status in etcd.

etcd Architecture

etcd is built on several core concepts that make it reliable and consistent.

Raft Consensus Algorithm

etcd uses the Raft consensus algorithm to maintain consistency across multiple nodes. Raft ensures that all nodes agree on the same state, even in the presence of failures.

Raft divides leadership into three roles:

Leader: Accepts client writes and propagates them to followers
Follower: Passively replicates entries from the leader
Candidate: Elected temporarily to become leader if the current leader fails

When a client writes to etcd, the leader:

Appends the entry to its log
Sends the entry to all followers
Waits for a majority (quorum) to acknowledge
Once acknowledged, commits the entry and responds to the client

This process guarantees that no data is lost and all nodes eventually converge to the same state.

Quorum and Majority

Quorum is the minimum number of nodes that must agree on a write for it to be considered committed. In a 3-node etcd cluster, you need 2 out of 3 nodes to agree (2/3 majority).

If you have 5 nodes, you need 3 out of 5 (3/5 majority). This provides fault tolerance: the cluster can tolerate up to 2 node failures (2 out of 5) without losing data.

Leader Election and Failover

When the leader fails, followers enter a timeout period and vote for a new leader. This election process typically takes 1-2 seconds. During this time, the cluster is unavailable for writes, but reads can still be served from followers.

Watch and Lease Mechanisms

etcd supports two important features:

Watch: Clients can subscribe to changes in specific keys. When a key is modified, deleted, or created, the watcher receives an event. This is how Kubernetes controllers detect changes and react to them.
Lease: Keys can be associated with a lease that expires after a specified duration. This is used for temporary resources like pods and leases for leader election.

etcd Data Model

etcd stores data as key-value pairs with a hierarchical structure. Keys are strings, and values are arbitrary byte arrays.

Key Structure

etcd uses a hierarchical key structure with namespaces:

/registry/<resource-type>/<namespace>/<name>

For example:

/registry/deployments/default/my-app
/registry/pods/default/my-app-abc123
/registry/nodes/k8s-node-1

Versioning and Revision

Every key in etcd has a version number. When a key is updated, its version increments. This allows etcd to track changes over time and support rollback.

etcd also maintains a global revision counter that increments with every change. This is useful for snapshotting and incremental backups.

etcd in Kubernetes

Kubernetes uses etcd in several critical ways.

API Server as the Frontend

The Kubernetes API server is the only component that directly interacts with etcd. All other components read from or write to the API server, which then updates etcd.

This separation provides several benefits:

The API server can cache frequently accessed data
The API server can implement authorization and admission control
The API server can provide a consistent interface regardless of the backend

Controller Loop

Kubernetes controllers run in a loop that watches for changes in etcd and reconciles the desired state with the actual state.

// Pseudocode for a controller
for {
    // Watch for changes to deployments
    events := watch.Deployments()
 
    for event in events {
        deployment := event.Object
 
        // Get current pod status from etcd
        pods := getPods(deployment.Spec.Selector)
 
        // Reconcile: ensure desired state
        if len(pods) != deployment.Spec.Replicas {
            scaleDeployment(deployment)
        }
    }
}

Scheduler and Kubelet

The scheduler reads deployment and node information from etcd to make placement decisions. The kubelet reads pod specifications from etcd and ensures the pods are running on its node.

Securing etcd

etcd is a critical component, and securing it is essential.

TLS Encryption

etcd requires TLS for all client-server communication. You should configure:

Server certificates: For etcd nodes to authenticate to each other
Client certificates: For API servers to authenticate to etcd
Peer certificates: For etcd nodes to authenticate to each other

# Generate etcd certificates
openssl genrsa -out etcd-ca.key 4096
openssl req -new -key etcd-ca.key -out etcd-ca.csr -subj "/CN=etcd-ca"
openssl x509 -req -in etcd-ca.csr -CA etcd-ca.key -CAkey etcd-ca.key -CAcreateserial -out etcd-ca.crt -days 365
 
openssl genrsa -out etcd-server.key 4096
openssl req -new -key etcd-server.key -out etcd-server.csr -subj "/CN=etcd-server"
openssl x509 -req -in etcd-server.csr -CA etcd-ca.crt -CAkey etcd-ca.key -CAcreateserial -out etcd-server.crt -days 365

Authentication

etcd supports multiple authentication methods:

Basic authentication: Username/password pairs
TLS client authentication: Certificates for client authentication
JWT tokens: JSON Web Tokens for API authentication

For production, use TLS client authentication with strong certificates.

Authorization

etcd supports two authorization modes:

RBAC (Role-Based Access Control): Fine-grained permissions based on roles and policies
Simple: Allow or deny based on client certificate CN (common name)

RBAC is recommended for production clusters.

Strong Passwords and Backups

If using basic authentication, use strong, unique passwords for each user. Regularly backup etcd data and test your restore procedures.

Monitoring etcd Health

Monitoring etcd is critical for cluster health.

Health Endpoint

etcd provides a health endpoint that checks the health of the cluster:

curl -k https://localhost:2379/health

A healthy response looks like:

{"health":"true"}

Metrics

etcd exposes Prometheus-compatible metrics on port 2379/metrics:

curl -k https://localhost:2379/metrics

Key metrics to monitor:

etcd_server_has_leader: Whether a leader exists
etcd_server_leader_changes_seen_total: Number of leader changes
etcd_server_proposals_committed_total: Number of committed proposals
etcd_server_proposals_failed_total: Number of failed proposals
etcd_mvcc_db_total_size_in_bytes: Size of the etcd database

Common Issues

Leader election loop: Frequent leader changes indicate instability
High proposal latency: Slow writes indicate network or disk issues
High disk usage: etcd database growth can cause performance degradation

Backup and Recovery

Regular backups are essential for disaster recovery.

Snapshotting

etcd provides a snapshot command to create a backup:

ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --cacert=/etc/etcd/ssl/ca.crt \
  --cert=/etc/etcd/ssl/server.crt \
  --key=/etc/etcd/ssl/server.key \
  --endpoints=https://127.0.0.1:2379

Backup Strategy

A good backup strategy includes:

Regular snapshots: Take snapshots daily or hourly
Off-site storage: Store snapshots in a separate location
Versioning: Keep multiple snapshots with different retention periods
Testing: Regularly test restore procedures

Restoring from Backup

To restore from a snapshot:

# Stop etcd
systemctl stop etcd
 
# Remove existing data
rm -rf /var/lib/etcd/*
 
# Restore from snapshot
ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
  --cacert=/etc/etcd/ssl/ca.crt \
  --cert=/etc/etcd/ssl/server.crt \
  --key=/etc/etcd/ssl/server.key \
  --data-dir=/var/lib/etcd
 
# Start etcd
systemctl start etcd

Scaling etcd

As your cluster grows, etcd may need to scale.

Adding Nodes

To add a new node to an etcd cluster:

Install etcd on the new node
Configure it to connect to the existing cluster
Start the etcd service

The new node will join the cluster and begin replicating data.

Quorum Considerations

When adding nodes, be aware of quorum requirements:

3-node cluster: 2/3 majority
5-node cluster: 3/5 majority
7-node cluster: 4/7 majority

Never reduce the number of nodes below the quorum threshold, or you'll lose the ability to write to etcd.

Performance Tuning

For large clusters, consider:

Disk I/O: Use SSDs for etcd data
Memory: Allocate sufficient memory for etcd
Network: Ensure low-latency network between etcd nodes
Compaction: Regularly compact the etcd database to reclaim space

etcd Alternatives

While etcd is the default for Kubernetes, other distributed key-value stores exist:

Store	Pros	Cons
etcd	Kubernetes-native, Raft consensus, mature	Complex to manage
Consul	Service discovery, health checking, KV store	Not optimized for Kubernetes
Zookeeper	Mature, strong consistency	Complex configuration
etcd3	Lightweight, Go-based	Less feature-rich than etcd

For Kubernetes, etcd remains the best choice due to its tight integration and proven track record.

Practical Example: Monitoring etcd with Prometheus

Here's how to set up etcd monitoring with Prometheus:

# prometheus.yml
scrape_configs:
  - job_name: 'etcd'
    scheme: https
    tls_config:
      ca_file: /etc/prometheus/ssl/etcd-ca.crt
      cert_file: /etc/prometheus/ssl/etcd-client.crt
      key_file: /etc/prometheus/ssl/etcd-client.key
    static_configs:
      - targets: ['etcd-1:2379', 'etcd-2:2379', 'etcd-3:2379']

Create an alert for leader changes:

# alert_rules.yml
groups:
  - name: etcd
    rules:
      - alert: EtcdLeaderChange
        expr: rate(etcd_server_leader_changes_seen_total[5m]) > 0
        for: 5m
        annotations:
          summary: "etcd leader has changed"
          description: "etcd leader has changed {{ $value }} times in the last 5 minutes"
 
      - alert: EtcdNoLeader
        expr: etcd_server_has_leader == 0
        for: 1m
        annotations:
          summary: "etcd has no leader"
          description: "etcd cluster has no leader for more than 1 minute"

Conclusion

etcd is the foundation of Kubernetes. It stores all cluster state, provides consistency across nodes, and enables the controller loop to maintain the desired state. Understanding etcd's architecture, securing it properly, monitoring its health, and maintaining regular backups are essential skills for any Kubernetes operator.

The next time you experience issues with your Kubernetes cluster, check etcd first. It's almost always the root cause.

If you're managing a Kubernetes cluster, consider using a platform like ServerlessBase that handles etcd management, backup, and monitoring automatically, so you can focus on your applications rather than infrastructure maintenance.