Horizontal Pod Autoscaler (HPA) Explained

You've deployed your application to Kubernetes, and it works great under normal load. But what happens when traffic spikes unexpectedly? Your pods might crash under the pressure, or users experience slow response times. This is where the Horizontal Pod Autoscaler (HPA) comes in.

The HPA is a Kubernetes component that automatically adjusts the number of replica pods based on observed CPU utilization, memory utilization, or custom metrics. Instead of manually scaling your application up and down, HPA monitors your workload and scales it to match demand, ensuring consistent performance without manual intervention.

How HPA Works

Think of HPA as a smart thermostat for your application. Just as a thermostat monitors room temperature and adjusts the heating or cooling system to maintain a set point, HPA monitors your application's resource usage and adjusts the number of running pods to keep performance within desired bounds.

The HPA controller runs continuously in the Kubernetes control plane. It periodically polls the metrics server (or custom metrics API) to collect current usage data, compares it against the target thresholds you've configured, and then adjusts the replica count accordingly.

The Scaling Loop

The scaling process follows a predictable cycle:

Metrics Collection: HPA queries the metrics server every 15 seconds (by default) to get current CPU and memory usage
Threshold Comparison: It compares the observed metrics against your target values
Decision Making: If usage exceeds the target, HPA decides how many additional replicas are needed
Replica Adjustment: It updates the Deployment or ReplicaSet to add or remove pods
Verification: The new pod count is verified, and the cycle repeats

This continuous loop ensures your application stays within your desired performance envelope, automatically handling traffic spikes, seasonal variations, and unexpected surges.

CPU-Based Scaling

The most common use case for HPA is CPU-based scaling. You define a target CPU utilization percentage (typically 70-80%), and HPA ensures that your pods don't exceed this threshold.

Example: CPU-Based HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

This configuration tells Kubernetes to maintain between 2 and 10 replicas of the web-app deployment, with the goal of keeping average CPU utilization at 70%.

Understanding the Math

When HPA calculates the required replica count, it uses this formula:

targetReplicas = ceil(
  currentReplicas * (currentMetric / targetMetric)
)

For example, if you have 3 replicas with 50% CPU utilization and your target is 70%:

targetReplicas = ceil(3 * (50 / 70)) = ceil(2.14) = 3

No scaling is needed because you're below the target. But if utilization rises to 90%:

targetReplicas = ceil(3 * (90 / 70)) = ceil(3.86) = 4

HPA would add one more replica to bring utilization back toward the target.

Memory-Based Scaling

Memory-based scaling works similarly to CPU, but with some important differences. Memory metrics are less precise because they're collected as a cumulative value rather than a per-pod measurement.

Memory HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: memory-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: memory-sensitive-app
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Memory-based scaling is useful for applications that consume significant memory resources, but it requires careful tuning. If you set the target too high, pods might be terminated due to OOM (Out of Memory) errors before HPA can scale them up.

Custom Metrics Scaling

Beyond CPU and memory, HPA can scale based on any metric exposed by the metrics server or custom metrics API. This enables scaling based on application-specific metrics like request latency, database connection pool usage, or custom business metrics.

Custom Metrics Example

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: custom-metrics-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

This configuration scales the API gateway based on HTTP requests per second, ensuring it can handle traffic spikes without performance degradation.

Scaling Behavior and Cooldown

HPA includes sophisticated behavior controls to prevent unnecessary scaling fluctuations. The behavior section allows you to configure scaling policies for both upscaling and downscaling.

Scaling Behavior Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: behavior-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 10
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
      selectPolicy: Min

Key Behavior Parameters

stabilizationWindowSeconds: Prevents rapid scaling up and down during temporary metric fluctuations. A 60-second window means HPA won't scale down for 5 minutes after a spike.
periodSeconds: How often the scaling policy is evaluated. Lower values enable faster response but can cause more frequent adjustments.
selectPolicy: Determines which policy to use when multiple policies match. Max uses the most aggressive scaling, Min uses the most conservative.

Practical Walkthrough: Setting Up HPA

Let's walk through deploying an application with HPA enabled. We'll use a simple Node.js application as an example.

Step 1: Create a Deployment

First, create a deployment for your application:

kubectl create deployment web-app --image=node:18-alpine --replicas=2 --port=3000

Step 2: Expose the Application

Create a service to expose your deployment:

kubectl expose deployment web-app --type=LoadBalancer --port=80 --target-port=3000

Step 3: Create the HPA

Now create the HorizontalPodAutoscaler:

kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10

This command creates an HPA with a 70% CPU target, minimum 2 replicas, and maximum 10 replicas.

Step 4: Verify HPA Status

Check the HPA status:

kubectl get hpa

You should see output like:

NAME       REFERENCE               TARGETS         MINPODS   MAXPODS   REPLICAS
web-app    Deployment/web-app      45%/70%         2         10        2

The 45% value shows current CPU utilization, which is below the 70% target, so no scaling is needed.

Step 5: Simulate Traffic

Use a load testing tool like hey to simulate traffic:

hey -z 30s -c 100 http://<your-service-ip>

Watch the HPA response:

kubectl get hpa -w

You'll see the replica count increase as CPU utilization rises above 70%.

Common HPA Patterns

Pattern 1: Multi-Metric Scaling

Scale based on both CPU and memory to handle different types of resource pressure:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: multi-metric-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Pattern 2: Target Average Value

Scale based on absolute metric values rather than percentages:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: absolute-value-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "500"

Pattern 3: Cool-Down Periods

Prevent rapid scaling fluctuations with appropriate cooldown periods:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cooldown-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 8
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 120
      policies:
      - type: Percent
        value: 200
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Percent
        value: 25
        periodSeconds: 120

Limitations and Best Practices

Limitations

No Vertical Scaling: HPA only scales horizontally (adds/removes pods). For vertical scaling (increasing pod resources), use Vertical Pod Autoscaler (VPA).
Metric Collection Delay: Metrics are collected every 15 seconds by default, which can cause temporary performance degradation during rapid scaling events.
Stateful Applications: HPA works best with stateless applications. Stateful applications may require additional coordination.
Startup Time: New pods take time to start and reach steady state, which can delay scaling responses.

Best Practices

Set Appropriate Min/Max Values: Never set maxReplicas to infinity. Set a reasonable upper limit based on your infrastructure capacity.
Use Stabilization Windows: Always configure stabilization windows to prevent scaling oscillations.
Monitor HPA Performance: Regularly check HPA metrics and adjust targets based on actual application behavior.
Combine with Vertical Scaling: Use VPA for resource allocation and HPA for replica count management.
Test Scaling Behavior: Simulate traffic spikes in staging environments to verify HPA behavior before production deployment.

Troubleshooting HPA

Issue: HPA Not Scaling

Symptoms: CPU utilization exceeds target but replica count doesn't increase.

Common Causes:

Metrics server not installed or not collecting metrics
HPA not properly configured
Min replicas equals max replicas
Application not exposing metrics

Solutions:

# Check if metrics server is installed
kubectl get pods -n kube-system | grep metrics-server
 
# Verify HPA configuration
kubectl describe hpa <hpa-name>
 
# Check metrics server metrics
kubectl top pods

Issue: Scaling Oscillations

Symptoms: Replica count rapidly increases and decreases.

Solutions:

Increase stabilization window seconds
Adjust scaling policies (reduce scaling rate)
Check if metrics are fluctuating due to external factors

Issue: Pods Terminating Immediately

Symptoms: New pods are created but immediately terminated.

Solutions:

Check application logs for startup errors
Verify resource limits are appropriate
Ensure application can handle the pod's resource allocation

Conclusion

The Horizontal Pod Autoscaler is a powerful tool for maintaining application performance under varying load conditions. By automatically scaling pods based on CPU, memory, or custom metrics, HPA ensures your application remains responsive during traffic spikes while avoiding unnecessary resource consumption during quiet periods.

Remember that HPA is most effective when combined with proper monitoring, appropriate scaling policies, and realistic target values. Start with conservative settings and adjust based on your application's behavior and your infrastructure's capacity.

Platforms like ServerlessBase simplify the deployment and management of Kubernetes applications, including HPA configuration, making it easier to implement autoscaling strategies without deep Kubernetes expertise.

Next Steps

Install Metrics Server: Ensure the metrics server is installed in your cluster
Configure HPA: Apply HPA configurations to your existing deployments
Monitor Performance: Use monitoring tools to track HPA behavior and application performance
Tune Parameters: Adjust scaling policies and targets based on observed behavior
Test Under Load: Simulate traffic spikes to verify HPA responds appropriately