ServerlessBase Blog
  • Kubernetes etcd: The Distributed Key-Value Store

    A comprehensive guide to understanding etcd, the distributed key-value store that powers Kubernetes' configuration and state management.

    Kubernetes etcd: The Distributed Key-Value Store

    You've deployed a Kubernetes cluster, configured your workloads, and everything seems to be working. Then you try to access the API server, and it's unresponsive. Or worse, you try to scale your deployment, and nothing happens. The root cause is almost always etcd—the distributed key-value store that Kubernetes uses to store all its state.

    etcd is the backbone of Kubernetes. Without it, you don't have a cluster. Understanding how etcd works, how to secure it, and how to keep it healthy is essential for anyone running production Kubernetes clusters.

    What is etcd?

    etcd is a distributed, consistent key-value store designed for reliable, distributed systems. It was created by CoreOS specifically for Kubernetes to provide a reliable backend for storing cluster state.

    Think of etcd as a database, but with different priorities. Traditional databases optimize for transactional consistency and complex queries. etcd optimizes for:

    • Consistency: All nodes see the same data at the same time
    • Availability: The system remains operational even during network partitions
    • Partition tolerance: The system continues to function despite network failures

    These three properties define the CAP theorem, and etcd makes specific trade-offs to achieve consistency and partition tolerance at the expense of availability during network partitions.

    How etcd Stores Kubernetes State

    Kubernetes stores everything in etcd as key-value pairs. Here are some examples:

    # A deployment configuration
    /registry/deployments/my-app
    {
      "apiVersion": "apps/v1",
      "kind": "Deployment",
      "metadata": {
        "name": "my-app",
        "namespace": "default"
      },
      "spec": {
        "replicas": 3,
        "selector": {
          "matchLabels": {
            "app": "my-app"
          }
        }
      }
    }
     
    # A pod status
    /registry/pods/default/my-app-abc123
    {
      "apiVersion": "v1",
      "kind": "Pod",
      "status": {
        "phase": "Running",
        "podIP": "10.244.1.5",
        "containerStatuses": [
          {
            "name": "my-app",
            "ready": true,
            "state": {
              "running": {}
            }
          }
        ]
      }
    }

    Every resource in Kubernetes—deployments, services, secrets, configmaps, nodes, and more—exists as a key in etcd. When you create a deployment, Kubernetes writes the deployment object to etcd. When you scale it, Kubernetes updates the replicas count in etcd. When a pod fails, Kubernetes updates the pod's status in etcd.

    etcd Architecture

    etcd is built on several core concepts that make it reliable and consistent.

    Raft Consensus Algorithm

    etcd uses the Raft consensus algorithm to maintain consistency across multiple nodes. Raft ensures that all nodes agree on the same state, even in the presence of failures.

    Raft divides leadership into three roles:

    1. Leader: Accepts client writes and propagates them to followers
    2. Follower: Passively replicates entries from the leader
    3. Candidate: Elected temporarily to become leader if the current leader fails

    When a client writes to etcd, the leader:

    1. Appends the entry to its log
    2. Sends the entry to all followers
    3. Waits for a majority (quorum) to acknowledge
    4. Once acknowledged, commits the entry and responds to the client

    This process guarantees that no data is lost and all nodes eventually converge to the same state.

    Quorum and Majority

    Quorum is the minimum number of nodes that must agree on a write for it to be considered committed. In a 3-node etcd cluster, you need 2 out of 3 nodes to agree (2/3 majority).

    If you have 5 nodes, you need 3 out of 5 (3/5 majority). This provides fault tolerance: the cluster can tolerate up to 2 node failures (2 out of 5) without losing data.

    Leader Election and Failover

    When the leader fails, followers enter a timeout period and vote for a new leader. This election process typically takes 1-2 seconds. During this time, the cluster is unavailable for writes, but reads can still be served from followers.

    Watch and Lease Mechanisms

    etcd supports two important features:

    • Watch: Clients can subscribe to changes in specific keys. When a key is modified, deleted, or created, the watcher receives an event. This is how Kubernetes controllers detect changes and react to them.

    • Lease: Keys can be associated with a lease that expires after a specified duration. This is used for temporary resources like pods and leases for leader election.

    etcd Data Model

    etcd stores data as key-value pairs with a hierarchical structure. Keys are strings, and values are arbitrary byte arrays.

    Key Structure

    etcd uses a hierarchical key structure with namespaces:

    /registry/<resource-type>/<namespace>/<name>

    For example:

    • /registry/deployments/default/my-app
    • /registry/pods/default/my-app-abc123
    • /registry/nodes/k8s-node-1

    Versioning and Revision

    Every key in etcd has a version number. When a key is updated, its version increments. This allows etcd to track changes over time and support rollback.

    etcd also maintains a global revision counter that increments with every change. This is useful for snapshotting and incremental backups.

    etcd in Kubernetes

    Kubernetes uses etcd in several critical ways.

    API Server as the Frontend

    The Kubernetes API server is the only component that directly interacts with etcd. All other components read from or write to the API server, which then updates etcd.

    This separation provides several benefits:

    • The API server can cache frequently accessed data
    • The API server can implement authorization and admission control
    • The API server can provide a consistent interface regardless of the backend

    Controller Loop

    Kubernetes controllers run in a loop that watches for changes in etcd and reconciles the desired state with the actual state.

    // Pseudocode for a controller
    for {
        // Watch for changes to deployments
        events := watch.Deployments()
     
        for event in events {
            deployment := event.Object
     
            // Get current pod status from etcd
            pods := getPods(deployment.Spec.Selector)
     
            // Reconcile: ensure desired state
            if len(pods) != deployment.Spec.Replicas {
                scaleDeployment(deployment)
            }
        }
    }

    Scheduler and Kubelet

    The scheduler reads deployment and node information from etcd to make placement decisions. The kubelet reads pod specifications from etcd and ensures the pods are running on its node.

    Securing etcd

    etcd is a critical component, and securing it is essential.

    TLS Encryption

    etcd requires TLS for all client-server communication. You should configure:

    1. Server certificates: For etcd nodes to authenticate to each other
    2. Client certificates: For API servers to authenticate to etcd
    3. Peer certificates: For etcd nodes to authenticate to each other
    # Generate etcd certificates
    openssl genrsa -out etcd-ca.key 4096
    openssl req -new -key etcd-ca.key -out etcd-ca.csr -subj "/CN=etcd-ca"
    openssl x509 -req -in etcd-ca.csr -CA etcd-ca.key -CAkey etcd-ca.key -CAcreateserial -out etcd-ca.crt -days 365
     
    openssl genrsa -out etcd-server.key 4096
    openssl req -new -key etcd-server.key -out etcd-server.csr -subj "/CN=etcd-server"
    openssl x509 -req -in etcd-server.csr -CA etcd-ca.crt -CAkey etcd-ca.key -CAcreateserial -out etcd-server.crt -days 365

    Authentication

    etcd supports multiple authentication methods:

    1. Basic authentication: Username/password pairs
    2. TLS client authentication: Certificates for client authentication
    3. JWT tokens: JSON Web Tokens for API authentication

    For production, use TLS client authentication with strong certificates.

    Authorization

    etcd supports two authorization modes:

    1. RBAC (Role-Based Access Control): Fine-grained permissions based on roles and policies
    2. Simple: Allow or deny based on client certificate CN (common name)

    RBAC is recommended for production clusters.

    Strong Passwords and Backups

    If using basic authentication, use strong, unique passwords for each user. Regularly backup etcd data and test your restore procedures.

    Monitoring etcd Health

    Monitoring etcd is critical for cluster health.

    Health Endpoint

    etcd provides a health endpoint that checks the health of the cluster:

    curl -k https://localhost:2379/health

    A healthy response looks like:

    {"health":"true"}

    Metrics

    etcd exposes Prometheus-compatible metrics on port 2379/metrics:

    curl -k https://localhost:2379/metrics

    Key metrics to monitor:

    • etcd_server_has_leader: Whether a leader exists
    • etcd_server_leader_changes_seen_total: Number of leader changes
    • etcd_server_proposals_committed_total: Number of committed proposals
    • etcd_server_proposals_failed_total: Number of failed proposals
    • etcd_mvcc_db_total_size_in_bytes: Size of the etcd database

    Common Issues

    1. Leader election loop: Frequent leader changes indicate instability
    2. High proposal latency: Slow writes indicate network or disk issues
    3. High disk usage: etcd database growth can cause performance degradation

    Backup and Recovery

    Regular backups are essential for disaster recovery.

    Snapshotting

    etcd provides a snapshot command to create a backup:

    ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
      --cacert=/etc/etcd/ssl/ca.crt \
      --cert=/etc/etcd/ssl/server.crt \
      --key=/etc/etcd/ssl/server.key \
      --endpoints=https://127.0.0.1:2379

    Backup Strategy

    A good backup strategy includes:

    1. Regular snapshots: Take snapshots daily or hourly
    2. Off-site storage: Store snapshots in a separate location
    3. Versioning: Keep multiple snapshots with different retention periods
    4. Testing: Regularly test restore procedures

    Restoring from Backup

    To restore from a snapshot:

    # Stop etcd
    systemctl stop etcd
     
    # Remove existing data
    rm -rf /var/lib/etcd/*
     
    # Restore from snapshot
    ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \
      --cacert=/etc/etcd/ssl/ca.crt \
      --cert=/etc/etcd/ssl/server.crt \
      --key=/etc/etcd/ssl/server.key \
      --data-dir=/var/lib/etcd
     
    # Start etcd
    systemctl start etcd

    Scaling etcd

    As your cluster grows, etcd may need to scale.

    Adding Nodes

    To add a new node to an etcd cluster:

    1. Install etcd on the new node
    2. Configure it to connect to the existing cluster
    3. Start the etcd service

    The new node will join the cluster and begin replicating data.

    Quorum Considerations

    When adding nodes, be aware of quorum requirements:

    • 3-node cluster: 2/3 majority
    • 5-node cluster: 3/5 majority
    • 7-node cluster: 4/7 majority

    Never reduce the number of nodes below the quorum threshold, or you'll lose the ability to write to etcd.

    Performance Tuning

    For large clusters, consider:

    1. Disk I/O: Use SSDs for etcd data
    2. Memory: Allocate sufficient memory for etcd
    3. Network: Ensure low-latency network between etcd nodes
    4. Compaction: Regularly compact the etcd database to reclaim space

    etcd Alternatives

    While etcd is the default for Kubernetes, other distributed key-value stores exist:

    StoreProsCons
    etcdKubernetes-native, Raft consensus, matureComplex to manage
    ConsulService discovery, health checking, KV storeNot optimized for Kubernetes
    ZookeeperMature, strong consistencyComplex configuration
    etcd3Lightweight, Go-basedLess feature-rich than etcd

    For Kubernetes, etcd remains the best choice due to its tight integration and proven track record.

    Practical Example: Monitoring etcd with Prometheus

    Here's how to set up etcd monitoring with Prometheus:

    # prometheus.yml
    scrape_configs:
      - job_name: 'etcd'
        scheme: https
        tls_config:
          ca_file: /etc/prometheus/ssl/etcd-ca.crt
          cert_file: /etc/prometheus/ssl/etcd-client.crt
          key_file: /etc/prometheus/ssl/etcd-client.key
        static_configs:
          - targets: ['etcd-1:2379', 'etcd-2:2379', 'etcd-3:2379']

    Create an alert for leader changes:

    # alert_rules.yml
    groups:
      - name: etcd
        rules:
          - alert: EtcdLeaderChange
            expr: rate(etcd_server_leader_changes_seen_total[5m]) > 0
            for: 5m
            annotations:
              summary: "etcd leader has changed"
              description: "etcd leader has changed {{ $value }} times in the last 5 minutes"
     
          - alert: EtcdNoLeader
            expr: etcd_server_has_leader == 0
            for: 1m
            annotations:
              summary: "etcd has no leader"
              description: "etcd cluster has no leader for more than 1 minute"

    Conclusion

    etcd is the foundation of Kubernetes. It stores all cluster state, provides consistency across nodes, and enables the controller loop to maintain the desired state. Understanding etcd's architecture, securing it properly, monitoring its health, and maintaining regular backups are essential skills for any Kubernetes operator.

    The next time you experience issues with your Kubernetes cluster, check etcd first. It's almost always the root cause.

    If you're managing a Kubernetes cluster, consider using a platform like ServerlessBase that handles etcd management, backup, and monitoring automatically, so you can focus on your applications rather than infrastructure maintenance.

    Leave comment