Kubernetes etcd: The Distributed Key-Value Store
You've deployed a Kubernetes cluster, configured your workloads, and everything seems to be working. Then you try to access the API server, and it's unresponsive. Or worse, you try to scale your deployment, and nothing happens. The root cause is almost always etcd—the distributed key-value store that Kubernetes uses to store all its state.
etcd is the backbone of Kubernetes. Without it, you don't have a cluster. Understanding how etcd works, how to secure it, and how to keep it healthy is essential for anyone running production Kubernetes clusters.
What is etcd?
etcd is a distributed, consistent key-value store designed for reliable, distributed systems. It was created by CoreOS specifically for Kubernetes to provide a reliable backend for storing cluster state.
Think of etcd as a database, but with different priorities. Traditional databases optimize for transactional consistency and complex queries. etcd optimizes for:
- Consistency: All nodes see the same data at the same time
- Availability: The system remains operational even during network partitions
- Partition tolerance: The system continues to function despite network failures
These three properties define the CAP theorem, and etcd makes specific trade-offs to achieve consistency and partition tolerance at the expense of availability during network partitions.
How etcd Stores Kubernetes State
Kubernetes stores everything in etcd as key-value pairs. Here are some examples:
Every resource in Kubernetes—deployments, services, secrets, configmaps, nodes, and more—exists as a key in etcd. When you create a deployment, Kubernetes writes the deployment object to etcd. When you scale it, Kubernetes updates the replicas count in etcd. When a pod fails, Kubernetes updates the pod's status in etcd.
etcd Architecture
etcd is built on several core concepts that make it reliable and consistent.
Raft Consensus Algorithm
etcd uses the Raft consensus algorithm to maintain consistency across multiple nodes. Raft ensures that all nodes agree on the same state, even in the presence of failures.
Raft divides leadership into three roles:
- Leader: Accepts client writes and propagates them to followers
- Follower: Passively replicates entries from the leader
- Candidate: Elected temporarily to become leader if the current leader fails
When a client writes to etcd, the leader:
- Appends the entry to its log
- Sends the entry to all followers
- Waits for a majority (quorum) to acknowledge
- Once acknowledged, commits the entry and responds to the client
This process guarantees that no data is lost and all nodes eventually converge to the same state.
Quorum and Majority
Quorum is the minimum number of nodes that must agree on a write for it to be considered committed. In a 3-node etcd cluster, you need 2 out of 3 nodes to agree (2/3 majority).
If you have 5 nodes, you need 3 out of 5 (3/5 majority). This provides fault tolerance: the cluster can tolerate up to 2 node failures (2 out of 5) without losing data.
Leader Election and Failover
When the leader fails, followers enter a timeout period and vote for a new leader. This election process typically takes 1-2 seconds. During this time, the cluster is unavailable for writes, but reads can still be served from followers.
Watch and Lease Mechanisms
etcd supports two important features:
-
Watch: Clients can subscribe to changes in specific keys. When a key is modified, deleted, or created, the watcher receives an event. This is how Kubernetes controllers detect changes and react to them.
-
Lease: Keys can be associated with a lease that expires after a specified duration. This is used for temporary resources like pods and leases for leader election.
etcd Data Model
etcd stores data as key-value pairs with a hierarchical structure. Keys are strings, and values are arbitrary byte arrays.
Key Structure
etcd uses a hierarchical key structure with namespaces:
For example:
/registry/deployments/default/my-app/registry/pods/default/my-app-abc123/registry/nodes/k8s-node-1
Versioning and Revision
Every key in etcd has a version number. When a key is updated, its version increments. This allows etcd to track changes over time and support rollback.
etcd also maintains a global revision counter that increments with every change. This is useful for snapshotting and incremental backups.
etcd in Kubernetes
Kubernetes uses etcd in several critical ways.
API Server as the Frontend
The Kubernetes API server is the only component that directly interacts with etcd. All other components read from or write to the API server, which then updates etcd.
This separation provides several benefits:
- The API server can cache frequently accessed data
- The API server can implement authorization and admission control
- The API server can provide a consistent interface regardless of the backend
Controller Loop
Kubernetes controllers run in a loop that watches for changes in etcd and reconciles the desired state with the actual state.
Scheduler and Kubelet
The scheduler reads deployment and node information from etcd to make placement decisions. The kubelet reads pod specifications from etcd and ensures the pods are running on its node.
Securing etcd
etcd is a critical component, and securing it is essential.
TLS Encryption
etcd requires TLS for all client-server communication. You should configure:
- Server certificates: For etcd nodes to authenticate to each other
- Client certificates: For API servers to authenticate to etcd
- Peer certificates: For etcd nodes to authenticate to each other
Authentication
etcd supports multiple authentication methods:
- Basic authentication: Username/password pairs
- TLS client authentication: Certificates for client authentication
- JWT tokens: JSON Web Tokens for API authentication
For production, use TLS client authentication with strong certificates.
Authorization
etcd supports two authorization modes:
- RBAC (Role-Based Access Control): Fine-grained permissions based on roles and policies
- Simple: Allow or deny based on client certificate CN (common name)
RBAC is recommended for production clusters.
Strong Passwords and Backups
If using basic authentication, use strong, unique passwords for each user. Regularly backup etcd data and test your restore procedures.
Monitoring etcd Health
Monitoring etcd is critical for cluster health.
Health Endpoint
etcd provides a health endpoint that checks the health of the cluster:
A healthy response looks like:
Metrics
etcd exposes Prometheus-compatible metrics on port 2379/metrics:
Key metrics to monitor:
etcd_server_has_leader: Whether a leader existsetcd_server_leader_changes_seen_total: Number of leader changesetcd_server_proposals_committed_total: Number of committed proposalsetcd_server_proposals_failed_total: Number of failed proposalsetcd_mvcc_db_total_size_in_bytes: Size of the etcd database
Common Issues
- Leader election loop: Frequent leader changes indicate instability
- High proposal latency: Slow writes indicate network or disk issues
- High disk usage: etcd database growth can cause performance degradation
Backup and Recovery
Regular backups are essential for disaster recovery.
Snapshotting
etcd provides a snapshot command to create a backup:
Backup Strategy
A good backup strategy includes:
- Regular snapshots: Take snapshots daily or hourly
- Off-site storage: Store snapshots in a separate location
- Versioning: Keep multiple snapshots with different retention periods
- Testing: Regularly test restore procedures
Restoring from Backup
To restore from a snapshot:
Scaling etcd
As your cluster grows, etcd may need to scale.
Adding Nodes
To add a new node to an etcd cluster:
- Install etcd on the new node
- Configure it to connect to the existing cluster
- Start the etcd service
The new node will join the cluster and begin replicating data.
Quorum Considerations
When adding nodes, be aware of quorum requirements:
- 3-node cluster: 2/3 majority
- 5-node cluster: 3/5 majority
- 7-node cluster: 4/7 majority
Never reduce the number of nodes below the quorum threshold, or you'll lose the ability to write to etcd.
Performance Tuning
For large clusters, consider:
- Disk I/O: Use SSDs for etcd data
- Memory: Allocate sufficient memory for etcd
- Network: Ensure low-latency network between etcd nodes
- Compaction: Regularly compact the etcd database to reclaim space
etcd Alternatives
While etcd is the default for Kubernetes, other distributed key-value stores exist:
| Store | Pros | Cons |
|---|---|---|
| etcd | Kubernetes-native, Raft consensus, mature | Complex to manage |
| Consul | Service discovery, health checking, KV store | Not optimized for Kubernetes |
| Zookeeper | Mature, strong consistency | Complex configuration |
| etcd3 | Lightweight, Go-based | Less feature-rich than etcd |
For Kubernetes, etcd remains the best choice due to its tight integration and proven track record.
Practical Example: Monitoring etcd with Prometheus
Here's how to set up etcd monitoring with Prometheus:
Create an alert for leader changes:
Conclusion
etcd is the foundation of Kubernetes. It stores all cluster state, provides consistency across nodes, and enables the controller loop to maintain the desired state. Understanding etcd's architecture, securing it properly, monitoring its health, and maintaining regular backups are essential skills for any Kubernetes operator.
The next time you experience issues with your Kubernetes cluster, check etcd first. It's almost always the root cause.
If you're managing a Kubernetes cluster, consider using a platform like ServerlessBase that handles etcd management, backup, and monitoring automatically, so you can focus on your applications rather than infrastructure maintenance.