ServerlessBase Blog
  • Introduction to Kubernetes Cluster Upgrades: Best Practices

    A comprehensive guide to safely upgrading Kubernetes clusters with minimal downtime and risk

    Introduction to Kubernetes Cluster Upgrades: Best Practices

    You've been running a Kubernetes cluster for months. Your applications are deployed, your monitoring is set up, and everything seems stable. Then you notice the release notes: Kubernetes 1.28 is out with security patches and new features. You know you should upgrade, but the thought of bringing down your production workloads keeps you up at night.

    Kubernetes cluster upgrades are one of the most critical operations you'll perform. A failed upgrade can leave your cluster in a broken state, require manual intervention, or worse—cause data loss. This guide walks you through the upgrade process, common pitfalls, and best practices that keep your clusters running smoothly.

    Understanding Upgrade Versions and Compatibility

    Kubernetes follows a strict version compatibility policy. You can only upgrade to the next minor version in the sequence. For example, if you're running 1.27, you can upgrade to 1.28, but not directly to 1.29. This is because Kubernetes maintains backward compatibility within a minor version but introduces breaking changes between minor versions.

    The upgrade process involves three main components: the control plane, the worker nodes, and the Kubernetes API. The control plane manages cluster state and requires a coordinated upgrade. Worker nodes run your workloads and must be upgraded one at a time to avoid disrupting running pods.

    # Check your current cluster version
    kubectl version --short
     
    # Verify you're on a supported upgrade path
    # Example: 1.27.x can upgrade to 1.28.x

    Pre-Upgrade Checklist

    Before you begin any upgrade, you need to verify your cluster is in a good state. This checklist prevents surprises during the upgrade process.

    Verify cluster health:

    # Check node status
    kubectl get nodes
     
    # Verify all pods are running
    kubectl get pods --all-namespaces
     
    # Check for any critical errors in the system
    kubectl get events --sort-by='.lastTimestamp'

    Review your application manifests:

    • Ensure all deployments have proper resource requests and limits
    • Verify that your applications are compatible with the new Kubernetes version
    • Check that your ingress controllers and CNI plugins support the target version

    Backup your cluster:

    # Backup etcd (the database that stores all cluster state)
    # This command exports the current etcd snapshot
    ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
      --cacert=/etc/kubernetes/pki/etcd/ca.crt \
      --cert=/etc/kubernetes/pki/etcd/server.crt \
      --key=/etc/kubernetes/pki/etcd/server.key \
      --endpoints=https://127.0.0.1:2379

    Test in a staging environment: Never upgrade production clusters directly. Create a staging cluster with similar workloads and test the upgrade process there first. This identifies compatibility issues before they affect your production systems.

    Control Plane Upgrade Process

    The control plane upgrade is the most critical step. It involves upgrading the API server, controller manager, scheduler, and etcd. Most Kubernetes distributions provide automated tools to handle this, but understanding the process helps you troubleshoot issues.

    For managed Kubernetes (EKS, GKE, AKS):

    # AWS EKS example - upgrade the control plane
    aws eks update-cluster-version \
      --name my-cluster \
      --kubernetes-version 1.28
     
    # Monitor the upgrade progress
    aws eks describe-cluster \
      --name my-cluster \
      --query 'cluster.status'

    For self-hosted clusters (kubeadm):

    # Upgrade the control plane components
    kubeadm upgrade apply v1.28.0
     
    # This command upgrades the API server, controller manager, and scheduler
    # It also updates the etcd database if needed

    The control plane upgrade typically takes 5-15 minutes depending on your cluster size. During this time, the API server is unavailable for new requests, but existing connections remain active. Your applications continue to function normally.

    Worker Node Upgrade Strategy

    Worker nodes must be upgraded one at a time to avoid disrupting running pods. This is because the kubelet binary must match the control plane version, and pods are scheduled based on node conditions.

    Upgrade nodes sequentially:

    # Get the list of worker nodes
    kubectl get nodes
     
    # Drain a node to safely move its pods elsewhere
    kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
     
    # Upgrade the node's kubelet
    ssh <node-name>
    sudo apt-get update
    sudo apt-get install -y kubelet=1.28.0-00 kubeadm=1.28.0-00
    sudo systemctl restart kubelet
     
    # Mark the node as ready
    kubectl uncordon <node-name>

    For automated upgrades with kubeadm:

    # Upgrade all nodes in one command
    kubeadm upgrade node
     
    # This command upgrades the kubelet and kube-proxy on the node
    # It also updates the container runtime if needed

    The sequential upgrade process means your cluster experiences no downtime. As you upgrade each node, its pods are rescheduled to other nodes. Once all nodes are upgraded, your entire cluster is running the new version.

    Handling Version Skips

    Sometimes you need to skip a minor version. For example, you might be on 1.25 and want to go directly to 1.28. This is possible but requires careful planning.

    For managed Kubernetes: Most managed services don't allow version skips. You must upgrade through each intermediate version. For example, you can only go from 1.25 to 1.26, then 1.26 to 1.27, and finally 1.27 to 1.28.

    For self-hosted clusters:

    # You can skip versions with kubeadm
    kubeadm upgrade apply v1.28.0
     
    # This works because kubeadm handles the intermediate version upgrades automatically
    # However, you should still test each intermediate version in staging first

    Version skips are generally not recommended for production clusters. Each intermediate version introduces its own changes and potential issues. Upgrading through each version gives you more opportunities to catch problems early.

    Post-Upgrade Verification

    After completing the upgrade, you must verify that everything is working correctly. This step catches issues that might not be immediately apparent.

    Verify cluster components:

    # Check all control plane components are running
    kubectl get pods -n kube-system
     
    # Verify the API server is responding
    kubectl cluster-info
     
    # Check node status
    kubectl get nodes -o wide

    Verify application workloads:

    # Check that all pods are running
    kubectl get pods --all-namespaces
     
    # Verify your applications are healthy
    kubectl get deployments --all-namespaces
    kubectl get statefulsets --all-namespaces

    Check for any remaining issues:

    # Look for any warnings or errors
    kubectl get events --sort-by='.lastTimestamp' | grep -i error
     
    # Verify your ingress resources are working
    kubectl get ingress --all-namespaces

    Common Upgrade Pitfalls

    Forgetting to drain nodes: Never upgrade a node without first draining it. This moves all pods off the node so they can be rescheduled elsewhere. Forgetting this step causes pods to be stuck in a Terminating state.

    Upgrading incompatible applications: Some applications might not work with the new Kubernetes version. Always test your applications in staging before upgrading production. Check the release notes for any breaking changes.

    Ignoring resource requirements: Upgrades can temporarily increase resource usage. Ensure your nodes have enough capacity to handle the additional load during the upgrade process.

    Not backing up etcd: The etcd database contains all cluster state. Always take a backup before upgrading, especially for production clusters. If something goes wrong, you can restore from the backup.

    Rollback Strategy

    Despite your best efforts, upgrades can sometimes fail. Having a rollback plan is essential for minimizing downtime.

    For managed Kubernetes:

    # Roll back to the previous version
    aws eks update-cluster-version \
      --name my-cluster \
      --kubernetes-version 1.27.0

    For self-hosted clusters:

    # Roll back the control plane
    kubeadm upgrade revert v1.27.0
     
    # Roll back individual nodes
    kubeadm upgrade node --rollback

    Restoring from etcd backup: If the rollback doesn't resolve the issue, you might need to restore from your etcd backup. This is a last resort that requires significant downtime.

    Best Practices Summary

    Upgrading Kubernetes clusters doesn't have to be stressful. Follow these best practices to keep your upgrades smooth and reliable:

    1. Always test in staging first - Never upgrade production directly
    2. Upgrade during maintenance windows - Plan upgrades when traffic is lowest
    3. Upgrade sequentially - Never skip versions without testing
    4. Drain nodes properly - Always move pods before upgrading a node
    5. Verify after every step - Check cluster health after each upgrade phase
    6. Have a rollback plan - Know how to revert if something goes wrong
    7. Monitor during and after upgrades - Watch for errors and performance issues
    8. Keep your tooling updated - Ensure kubectl, kubeadm, and other tools are current

    Platforms like ServerlessBase simplify the upgrade process by providing automated tools for cluster management and monitoring. They handle the complex orchestration of control plane and node upgrades, reducing the risk of human error.

    Conclusion

    Kubernetes cluster upgrades are a routine part of cluster operations, but they require careful planning and execution. By following the steps outlined in this guide, you can upgrade your clusters with confidence and minimal disruption.

    The key is preparation. Verify your cluster health, test in staging, and have a rollback plan. When you follow these best practices, upgrades become routine maintenance rather than a source of anxiety.

    Ready to upgrade your cluster? Start by checking your current version and reviewing the upgrade path. Remember to test everything in staging first, and you'll be running the latest Kubernetes version in no time.

    Leave comment