Introduction to Kubernetes Cluster Upgrades: Best Practices

You've been running a Kubernetes cluster for months. Your applications are deployed, your monitoring is set up, and everything seems stable. Then you notice the release notes: Kubernetes 1.28 is out with security patches and new features. You know you should upgrade, but the thought of bringing down your production workloads keeps you up at night.

Kubernetes cluster upgrades are one of the most critical operations you'll perform. A failed upgrade can leave your cluster in a broken state, require manual intervention, or worse—cause data loss. This guide walks you through the upgrade process, common pitfalls, and best practices that keep your clusters running smoothly.

Understanding Upgrade Versions and Compatibility

Kubernetes follows a strict version compatibility policy. You can only upgrade to the next minor version in the sequence. For example, if you're running 1.27, you can upgrade to 1.28, but not directly to 1.29. This is because Kubernetes maintains backward compatibility within a minor version but introduces breaking changes between minor versions.

The upgrade process involves three main components: the control plane, the worker nodes, and the Kubernetes API. The control plane manages cluster state and requires a coordinated upgrade. Worker nodes run your workloads and must be upgraded one at a time to avoid disrupting running pods.

# Check your current cluster version
kubectl version --short
 
# Verify you're on a supported upgrade path
# Example: 1.27.x can upgrade to 1.28.x

Pre-Upgrade Checklist

Before you begin any upgrade, you need to verify your cluster is in a good state. This checklist prevents surprises during the upgrade process.

Verify cluster health:

# Check node status
kubectl get nodes
 
# Verify all pods are running
kubectl get pods --all-namespaces
 
# Check for any critical errors in the system
kubectl get events --sort-by='.lastTimestamp'

Review your application manifests:

Ensure all deployments have proper resource requests and limits
Verify that your applications are compatible with the new Kubernetes version
Check that your ingress controllers and CNI plugins support the target version

Backup your cluster:

# Backup etcd (the database that stores all cluster state)
# This command exports the current etcd snapshot
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --endpoints=https://127.0.0.1:2379

Test in a staging environment: Never upgrade production clusters directly. Create a staging cluster with similar workloads and test the upgrade process there first. This identifies compatibility issues before they affect your production systems.

Control Plane Upgrade Process

The control plane upgrade is the most critical step. It involves upgrading the API server, controller manager, scheduler, and etcd. Most Kubernetes distributions provide automated tools to handle this, but understanding the process helps you troubleshoot issues.

For managed Kubernetes (EKS, GKE, AKS):

# AWS EKS example - upgrade the control plane
aws eks update-cluster-version \
  --name my-cluster \
  --kubernetes-version 1.28
 
# Monitor the upgrade progress
aws eks describe-cluster \
  --name my-cluster \
  --query 'cluster.status'

For self-hosted clusters (kubeadm):

# Upgrade the control plane components
kubeadm upgrade apply v1.28.0
 
# This command upgrades the API server, controller manager, and scheduler
# It also updates the etcd database if needed

The control plane upgrade typically takes 5-15 minutes depending on your cluster size. During this time, the API server is unavailable for new requests, but existing connections remain active. Your applications continue to function normally.

Worker Node Upgrade Strategy

Worker nodes must be upgraded one at a time to avoid disrupting running pods. This is because the kubelet binary must match the control plane version, and pods are scheduled based on node conditions.

Upgrade nodes sequentially:

# Get the list of worker nodes
kubectl get nodes
 
# Drain a node to safely move its pods elsewhere
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
 
# Upgrade the node's kubelet
ssh <node-name>
sudo apt-get update
sudo apt-get install -y kubelet=1.28.0-00 kubeadm=1.28.0-00
sudo systemctl restart kubelet
 
# Mark the node as ready
kubectl uncordon <node-name>

For automated upgrades with kubeadm:

# Upgrade all nodes in one command
kubeadm upgrade node
 
# This command upgrades the kubelet and kube-proxy on the node
# It also updates the container runtime if needed

The sequential upgrade process means your cluster experiences no downtime. As you upgrade each node, its pods are rescheduled to other nodes. Once all nodes are upgraded, your entire cluster is running the new version.

Handling Version Skips

Sometimes you need to skip a minor version. For example, you might be on 1.25 and want to go directly to 1.28. This is possible but requires careful planning.

For managed Kubernetes: Most managed services don't allow version skips. You must upgrade through each intermediate version. For example, you can only go from 1.25 to 1.26, then 1.26 to 1.27, and finally 1.27 to 1.28.

For self-hosted clusters:

# You can skip versions with kubeadm
kubeadm upgrade apply v1.28.0
 
# This works because kubeadm handles the intermediate version upgrades automatically
# However, you should still test each intermediate version in staging first

Version skips are generally not recommended for production clusters. Each intermediate version introduces its own changes and potential issues. Upgrading through each version gives you more opportunities to catch problems early.

Post-Upgrade Verification

After completing the upgrade, you must verify that everything is working correctly. This step catches issues that might not be immediately apparent.

Verify cluster components:

# Check all control plane components are running
kubectl get pods -n kube-system
 
# Verify the API server is responding
kubectl cluster-info
 
# Check node status
kubectl get nodes -o wide

Verify application workloads:

# Check that all pods are running
kubectl get pods --all-namespaces
 
# Verify your applications are healthy
kubectl get deployments --all-namespaces
kubectl get statefulsets --all-namespaces

Check for any remaining issues:

# Look for any warnings or errors
kubectl get events --sort-by='.lastTimestamp' | grep -i error
 
# Verify your ingress resources are working
kubectl get ingress --all-namespaces

Common Upgrade Pitfalls

Forgetting to drain nodes: Never upgrade a node without first draining it. This moves all pods off the node so they can be rescheduled elsewhere. Forgetting this step causes pods to be stuck in a Terminating state.

Upgrading incompatible applications: Some applications might not work with the new Kubernetes version. Always test your applications in staging before upgrading production. Check the release notes for any breaking changes.

Ignoring resource requirements: Upgrades can temporarily increase resource usage. Ensure your nodes have enough capacity to handle the additional load during the upgrade process.

Not backing up etcd: The etcd database contains all cluster state. Always take a backup before upgrading, especially for production clusters. If something goes wrong, you can restore from the backup.

Rollback Strategy

Despite your best efforts, upgrades can sometimes fail. Having a rollback plan is essential for minimizing downtime.

For managed Kubernetes:

# Roll back to the previous version
aws eks update-cluster-version \
  --name my-cluster \
  --kubernetes-version 1.27.0

For self-hosted clusters:

# Roll back the control plane
kubeadm upgrade revert v1.27.0
 
# Roll back individual nodes
kubeadm upgrade node --rollback

Restoring from etcd backup: If the rollback doesn't resolve the issue, you might need to restore from your etcd backup. This is a last resort that requires significant downtime.

Best Practices Summary

Upgrading Kubernetes clusters doesn't have to be stressful. Follow these best practices to keep your upgrades smooth and reliable:

Always test in staging first - Never upgrade production directly
Upgrade during maintenance windows - Plan upgrades when traffic is lowest
Upgrade sequentially - Never skip versions without testing
Drain nodes properly - Always move pods before upgrading a node
Verify after every step - Check cluster health after each upgrade phase
Have a rollback plan - Know how to revert if something goes wrong
Monitor during and after upgrades - Watch for errors and performance issues
Keep your tooling updated - Ensure kubectl, kubeadm, and other tools are current

Platforms like ServerlessBase simplify the upgrade process by providing automated tools for cluster management and monitoring. They handle the complex orchestration of control plane and node upgrades, reducing the risk of human error.

Conclusion

Kubernetes cluster upgrades are a routine part of cluster operations, but they require careful planning and execution. By following the steps outlined in this guide, you can upgrade your clusters with confidence and minimal disruption.

The key is preparation. Verify your cluster health, test in staging, and have a rollback plan. When you follow these best practices, upgrades become routine maintenance rather than a source of anxiety.

Ready to upgrade your cluster? Start by checking your current version and reviewing the upgrade path. Remember to test everything in staging first, and you'll be running the latest Kubernetes version in no time.