ServerlessBase Blog
  • Horizontal Pod Autoscaler (HPA) Explained

    Learn how Kubernetes Horizontal Pod Autoscaler automatically scales your application pods based on CPU, memory, and custom metrics to maintain optimal performance.

    Horizontal Pod Autoscaler (HPA) Explained

    You've deployed your application to Kubernetes, and it works great under normal load. But what happens when traffic spikes unexpectedly? Your pods might crash under the pressure, or users experience slow response times. This is where the Horizontal Pod Autoscaler (HPA) comes in.

    The HPA is a Kubernetes component that automatically adjusts the number of replica pods based on observed CPU utilization, memory utilization, or custom metrics. Instead of manually scaling your application up and down, HPA monitors your workload and scales it to match demand, ensuring consistent performance without manual intervention.

    How HPA Works

    Think of HPA as a smart thermostat for your application. Just as a thermostat monitors room temperature and adjusts the heating or cooling system to maintain a set point, HPA monitors your application's resource usage and adjusts the number of running pods to keep performance within desired bounds.

    The HPA controller runs continuously in the Kubernetes control plane. It periodically polls the metrics server (or custom metrics API) to collect current usage data, compares it against the target thresholds you've configured, and then adjusts the replica count accordingly.

    The Scaling Loop

    The scaling process follows a predictable cycle:

    1. Metrics Collection: HPA queries the metrics server every 15 seconds (by default) to get current CPU and memory usage
    2. Threshold Comparison: It compares the observed metrics against your target values
    3. Decision Making: If usage exceeds the target, HPA decides how many additional replicas are needed
    4. Replica Adjustment: It updates the Deployment or ReplicaSet to add or remove pods
    5. Verification: The new pod count is verified, and the cycle repeats

    This continuous loop ensures your application stays within your desired performance envelope, automatically handling traffic spikes, seasonal variations, and unexpected surges.

    CPU-Based Scaling

    The most common use case for HPA is CPU-based scaling. You define a target CPU utilization percentage (typically 70-80%), and HPA ensures that your pods don't exceed this threshold.

    Example: CPU-Based HPA Configuration

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: web-app-hpa
      namespace: production
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: web-app
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70

    This configuration tells Kubernetes to maintain between 2 and 10 replicas of the web-app deployment, with the goal of keeping average CPU utilization at 70%.

    Understanding the Math

    When HPA calculates the required replica count, it uses this formula:

    targetReplicas = ceil(
      currentReplicas * (currentMetric / targetMetric)
    )

    For example, if you have 3 replicas with 50% CPU utilization and your target is 70%:

    targetReplicas = ceil(3 * (50 / 70)) = ceil(2.14) = 3

    No scaling is needed because you're below the target. But if utilization rises to 90%:

    targetReplicas = ceil(3 * (90 / 70)) = ceil(3.86) = 4

    HPA would add one more replica to bring utilization back toward the target.

    Memory-Based Scaling

    Memory-based scaling works similarly to CPU, but with some important differences. Memory metrics are less precise because they're collected as a cumulative value rather than a per-pod measurement.

    Memory HPA Configuration

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: memory-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: memory-sensitive-app
      minReplicas: 1
      maxReplicas: 5
      metrics:
      - type: Resource
        resource:
          name: memory
          target:
            type: Utilization
            averageUtilization: 80

    Memory-based scaling is useful for applications that consume significant memory resources, but it requires careful tuning. If you set the target too high, pods might be terminated due to OOM (Out of Memory) errors before HPA can scale them up.

    Custom Metrics Scaling

    Beyond CPU and memory, HPA can scale based on any metric exposed by the metrics server or custom metrics API. This enables scaling based on application-specific metrics like request latency, database connection pool usage, or custom business metrics.

    Custom Metrics Example

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: custom-metrics-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: api-gateway
      minReplicas: 3
      maxReplicas: 20
      metrics:
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: "1000"

    This configuration scales the API gateway based on HTTP requests per second, ensuring it can handle traffic spikes without performance degradation.

    Scaling Behavior and Cooldown

    HPA includes sophisticated behavior controls to prevent unnecessary scaling fluctuations. The behavior section allows you to configure scaling policies for both upscaling and downscaling.

    Scaling Behavior Configuration

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: behavior-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: app
      minReplicas: 2
      maxReplicas: 10
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 60
          policies:
          - type: Percent
            value: 100
            periodSeconds: 15
          - type: Pods
            value: 4
            periodSeconds: 15
          selectPolicy: Max
        scaleDown:
          stabilizationWindowSeconds: 300
          policies:
          - type: Percent
            value: 50
            periodSeconds: 60
          selectPolicy: Min

    Key Behavior Parameters

    • stabilizationWindowSeconds: Prevents rapid scaling up and down during temporary metric fluctuations. A 60-second window means HPA won't scale down for 5 minutes after a spike.
    • periodSeconds: How often the scaling policy is evaluated. Lower values enable faster response but can cause more frequent adjustments.
    • selectPolicy: Determines which policy to use when multiple policies match. Max uses the most aggressive scaling, Min uses the most conservative.

    Practical Walkthrough: Setting Up HPA

    Let's walk through deploying an application with HPA enabled. We'll use a simple Node.js application as an example.

    Step 1: Create a Deployment

    First, create a deployment for your application:

    kubectl create deployment web-app --image=node:18-alpine --replicas=2 --port=3000

    Step 2: Expose the Application

    Create a service to expose your deployment:

    kubectl expose deployment web-app --type=LoadBalancer --port=80 --target-port=3000

    Step 3: Create the HPA

    Now create the HorizontalPodAutoscaler:

    kubectl autoscale deployment web-app --cpu-percent=70 --min=2 --max=10

    This command creates an HPA with a 70% CPU target, minimum 2 replicas, and maximum 10 replicas.

    Step 4: Verify HPA Status

    Check the HPA status:

    kubectl get hpa

    You should see output like:

    NAME       REFERENCE               TARGETS         MINPODS   MAXPODS   REPLICAS
    web-app    Deployment/web-app      45%/70%         2         10        2

    The 45% value shows current CPU utilization, which is below the 70% target, so no scaling is needed.

    Step 5: Simulate Traffic

    Use a load testing tool like hey to simulate traffic:

    hey -z 30s -c 100 http://<your-service-ip>

    Watch the HPA response:

    kubectl get hpa -w

    You'll see the replica count increase as CPU utilization rises above 70%.

    Common HPA Patterns

    Pattern 1: Multi-Metric Scaling

    Scale based on both CPU and memory to handle different types of resource pressure:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: multi-metric-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: app
      minReplicas: 3
      maxReplicas: 15
      metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 70
      - type: Resource
        resource:
          name: memory
          target:
            type: Utilization
            averageUtilization: 80

    Pattern 2: Target Average Value

    Scale based on absolute metric values rather than percentages:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: absolute-value-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: app
      minReplicas: 2
      maxReplicas: 10
      metrics:
      - type: Pods
        pods:
          metric:
            name: http_requests_per_second
          target:
            type: AverageValue
            averageValue: "500"

    Pattern 3: Cool-Down Periods

    Prevent rapid scaling fluctuations with appropriate cooldown periods:

    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: cooldown-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: app
      minReplicas: 2
      maxReplicas: 8
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 120
          policies:
          - type: Percent
            value: 200
            periodSeconds: 60
        scaleDown:
          stabilizationWindowSeconds: 600
          policies:
          - type: Percent
            value: 25
            periodSeconds: 120

    Limitations and Best Practices

    Limitations

    1. No Vertical Scaling: HPA only scales horizontally (adds/removes pods). For vertical scaling (increasing pod resources), use Vertical Pod Autoscaler (VPA).
    2. Metric Collection Delay: Metrics are collected every 15 seconds by default, which can cause temporary performance degradation during rapid scaling events.
    3. Stateful Applications: HPA works best with stateless applications. Stateful applications may require additional coordination.
    4. Startup Time: New pods take time to start and reach steady state, which can delay scaling responses.

    Best Practices

    1. Set Appropriate Min/Max Values: Never set maxReplicas to infinity. Set a reasonable upper limit based on your infrastructure capacity.
    2. Use Stabilization Windows: Always configure stabilization windows to prevent scaling oscillations.
    3. Monitor HPA Performance: Regularly check HPA metrics and adjust targets based on actual application behavior.
    4. Combine with Vertical Scaling: Use VPA for resource allocation and HPA for replica count management.
    5. Test Scaling Behavior: Simulate traffic spikes in staging environments to verify HPA behavior before production deployment.

    Troubleshooting HPA

    Issue: HPA Not Scaling

    Symptoms: CPU utilization exceeds target but replica count doesn't increase.

    Common Causes:

    • Metrics server not installed or not collecting metrics
    • HPA not properly configured
    • Min replicas equals max replicas
    • Application not exposing metrics

    Solutions:

    # Check if metrics server is installed
    kubectl get pods -n kube-system | grep metrics-server
     
    # Verify HPA configuration
    kubectl describe hpa <hpa-name>
     
    # Check metrics server metrics
    kubectl top pods

    Issue: Scaling Oscillations

    Symptoms: Replica count rapidly increases and decreases.

    Solutions:

    • Increase stabilization window seconds
    • Adjust scaling policies (reduce scaling rate)
    • Check if metrics are fluctuating due to external factors

    Issue: Pods Terminating Immediately

    Symptoms: New pods are created but immediately terminated.

    Solutions:

    • Check application logs for startup errors
    • Verify resource limits are appropriate
    • Ensure application can handle the pod's resource allocation

    Conclusion

    The Horizontal Pod Autoscaler is a powerful tool for maintaining application performance under varying load conditions. By automatically scaling pods based on CPU, memory, or custom metrics, HPA ensures your application remains responsive during traffic spikes while avoiding unnecessary resource consumption during quiet periods.

    Remember that HPA is most effective when combined with proper monitoring, appropriate scaling policies, and realistic target values. Start with conservative settings and adjust based on your application's behavior and your infrastructure's capacity.

    Platforms like ServerlessBase simplify the deployment and management of Kubernetes applications, including HPA configuration, making it easier to implement autoscaling strategies without deep Kubernetes expertise.

    Next Steps

    1. Install Metrics Server: Ensure the metrics server is installed in your cluster
    2. Configure HPA: Apply HPA configurations to your existing deployments
    3. Monitor Performance: Use monitoring tools to track HPA behavior and application performance
    4. Tune Parameters: Adjust scaling policies and targets based on observed behavior
    5. Test Under Load: Simulate traffic spikes to verify HPA responds appropriately

    Leave comment