Horizontal Pod Autoscaler (HPA) Explained
You've deployed your application to Kubernetes, and it works great under normal load. But what happens when traffic spikes unexpectedly? Your pods might crash under the pressure, or users experience slow response times. This is where the Horizontal Pod Autoscaler (HPA) comes in.
The HPA is a Kubernetes component that automatically adjusts the number of replica pods based on observed CPU utilization, memory utilization, or custom metrics. Instead of manually scaling your application up and down, HPA monitors your workload and scales it to match demand, ensuring consistent performance without manual intervention.
How HPA Works
Think of HPA as a smart thermostat for your application. Just as a thermostat monitors room temperature and adjusts the heating or cooling system to maintain a set point, HPA monitors your application's resource usage and adjusts the number of running pods to keep performance within desired bounds.
The HPA controller runs continuously in the Kubernetes control plane. It periodically polls the metrics server (or custom metrics API) to collect current usage data, compares it against the target thresholds you've configured, and then adjusts the replica count accordingly.
The Scaling Loop
The scaling process follows a predictable cycle:
- Metrics Collection: HPA queries the metrics server every 15 seconds (by default) to get current CPU and memory usage
- Threshold Comparison: It compares the observed metrics against your target values
- Decision Making: If usage exceeds the target, HPA decides how many additional replicas are needed
- Replica Adjustment: It updates the Deployment or ReplicaSet to add or remove pods
- Verification: The new pod count is verified, and the cycle repeats
This continuous loop ensures your application stays within your desired performance envelope, automatically handling traffic spikes, seasonal variations, and unexpected surges.
CPU-Based Scaling
The most common use case for HPA is CPU-based scaling. You define a target CPU utilization percentage (typically 70-80%), and HPA ensures that your pods don't exceed this threshold.
Example: CPU-Based HPA Configuration
This configuration tells Kubernetes to maintain between 2 and 10 replicas of the web-app deployment, with the goal of keeping average CPU utilization at 70%.
Understanding the Math
When HPA calculates the required replica count, it uses this formula:
For example, if you have 3 replicas with 50% CPU utilization and your target is 70%:
No scaling is needed because you're below the target. But if utilization rises to 90%:
HPA would add one more replica to bring utilization back toward the target.
Memory-Based Scaling
Memory-based scaling works similarly to CPU, but with some important differences. Memory metrics are less precise because they're collected as a cumulative value rather than a per-pod measurement.
Memory HPA Configuration
Memory-based scaling is useful for applications that consume significant memory resources, but it requires careful tuning. If you set the target too high, pods might be terminated due to OOM (Out of Memory) errors before HPA can scale them up.
Custom Metrics Scaling
Beyond CPU and memory, HPA can scale based on any metric exposed by the metrics server or custom metrics API. This enables scaling based on application-specific metrics like request latency, database connection pool usage, or custom business metrics.
Custom Metrics Example
This configuration scales the API gateway based on HTTP requests per second, ensuring it can handle traffic spikes without performance degradation.
Scaling Behavior and Cooldown
HPA includes sophisticated behavior controls to prevent unnecessary scaling fluctuations. The behavior section allows you to configure scaling policies for both upscaling and downscaling.
Scaling Behavior Configuration
Key Behavior Parameters
- stabilizationWindowSeconds: Prevents rapid scaling up and down during temporary metric fluctuations. A 60-second window means HPA won't scale down for 5 minutes after a spike.
- periodSeconds: How often the scaling policy is evaluated. Lower values enable faster response but can cause more frequent adjustments.
- selectPolicy: Determines which policy to use when multiple policies match.
Maxuses the most aggressive scaling,Minuses the most conservative.
Practical Walkthrough: Setting Up HPA
Let's walk through deploying an application with HPA enabled. We'll use a simple Node.js application as an example.
Step 1: Create a Deployment
First, create a deployment for your application:
Step 2: Expose the Application
Create a service to expose your deployment:
Step 3: Create the HPA
Now create the HorizontalPodAutoscaler:
This command creates an HPA with a 70% CPU target, minimum 2 replicas, and maximum 10 replicas.
Step 4: Verify HPA Status
Check the HPA status:
You should see output like:
The 45% value shows current CPU utilization, which is below the 70% target, so no scaling is needed.
Step 5: Simulate Traffic
Use a load testing tool like hey to simulate traffic:
Watch the HPA response:
You'll see the replica count increase as CPU utilization rises above 70%.
Common HPA Patterns
Pattern 1: Multi-Metric Scaling
Scale based on both CPU and memory to handle different types of resource pressure:
Pattern 2: Target Average Value
Scale based on absolute metric values rather than percentages:
Pattern 3: Cool-Down Periods
Prevent rapid scaling fluctuations with appropriate cooldown periods:
Limitations and Best Practices
Limitations
- No Vertical Scaling: HPA only scales horizontally (adds/removes pods). For vertical scaling (increasing pod resources), use Vertical Pod Autoscaler (VPA).
- Metric Collection Delay: Metrics are collected every 15 seconds by default, which can cause temporary performance degradation during rapid scaling events.
- Stateful Applications: HPA works best with stateless applications. Stateful applications may require additional coordination.
- Startup Time: New pods take time to start and reach steady state, which can delay scaling responses.
Best Practices
- Set Appropriate Min/Max Values: Never set
maxReplicasto infinity. Set a reasonable upper limit based on your infrastructure capacity. - Use Stabilization Windows: Always configure stabilization windows to prevent scaling oscillations.
- Monitor HPA Performance: Regularly check HPA metrics and adjust targets based on actual application behavior.
- Combine with Vertical Scaling: Use VPA for resource allocation and HPA for replica count management.
- Test Scaling Behavior: Simulate traffic spikes in staging environments to verify HPA behavior before production deployment.
Troubleshooting HPA
Issue: HPA Not Scaling
Symptoms: CPU utilization exceeds target but replica count doesn't increase.
Common Causes:
- Metrics server not installed or not collecting metrics
- HPA not properly configured
- Min replicas equals max replicas
- Application not exposing metrics
Solutions:
Issue: Scaling Oscillations
Symptoms: Replica count rapidly increases and decreases.
Solutions:
- Increase stabilization window seconds
- Adjust scaling policies (reduce scaling rate)
- Check if metrics are fluctuating due to external factors
Issue: Pods Terminating Immediately
Symptoms: New pods are created but immediately terminated.
Solutions:
- Check application logs for startup errors
- Verify resource limits are appropriate
- Ensure application can handle the pod's resource allocation
Conclusion
The Horizontal Pod Autoscaler is a powerful tool for maintaining application performance under varying load conditions. By automatically scaling pods based on CPU, memory, or custom metrics, HPA ensures your application remains responsive during traffic spikes while avoiding unnecessary resource consumption during quiet periods.
Remember that HPA is most effective when combined with proper monitoring, appropriate scaling policies, and realistic target values. Start with conservative settings and adjust based on your application's behavior and your infrastructure's capacity.
Platforms like ServerlessBase simplify the deployment and management of Kubernetes applications, including HPA configuration, making it easier to implement autoscaling strategies without deep Kubernetes expertise.
Next Steps
- Install Metrics Server: Ensure the metrics server is installed in your cluster
- Configure HPA: Apply HPA configurations to your existing deployments
- Monitor Performance: Use monitoring tools to track HPA behavior and application performance
- Tune Parameters: Adjust scaling policies and targets based on observed behavior
- Test Under Load: Simulate traffic spikes to verify HPA responds appropriately