ServerlessBase Blog
  • Introduction to Kubernetes Scheduling and Node Affinity

    Learn how Kubernetes schedules pods across nodes and uses node affinity rules to control pod placement for optimal resource utilization and performance.

    Introduction to Kubernetes Scheduling and Node Affinity

    You've deployed your first Kubernetes cluster, and your application is running. But have you ever wondered how Kubernetes decides which node to place each pod on? The default scheduler does a decent job, but real-world applications often have specific requirements that the default behavior doesn't handle. Maybe you need to place a database pod on a node with high memory, or keep a GPU-intensive application on nodes with dedicated hardware. This is where scheduling and node affinity come into play.

    In this article, you'll learn how Kubernetes scheduling works under the hood, why it matters for your applications, and how to use node affinity to control pod placement with precision. Understanding these concepts will help you optimize resource utilization, improve performance, and avoid common scheduling pitfalls that can lead to application instability.

    How Kubernetes Scheduling Works

    Kubernetes uses a pluggable scheduler architecture. The default scheduler (kube-scheduler) is a single process that runs on the control plane and makes scheduling decisions for all pods that don't already have a node assigned. When a pod is created without a nodeName field, the scheduler evaluates all available nodes and selects the best one based on a set of predicates and priorities.

    Scheduling Predicates

    Predicates are rules that determine if a node can schedule a pod. The scheduler checks each predicate in sequence, and if any predicate fails, the node is eliminated from consideration. Common predicates include:

    • PodFitsResources: Ensures the node has sufficient CPU and memory resources
    • PodFitsHostPorts: Checks if the node has available host ports
    • MatchNodeSelector: Verifies the node matches the pod's node selector
    • HostName: Confirms the node matches the pod's nodeName field

    Scheduling Priorities

    Once predicates filter the candidate nodes, priorities are applied to rank them. Higher-scoring nodes are preferred. Common priorities include:

    • LeastRequestedPriority: Prefers nodes with fewer allocated resources
    • ImageLocalityPriority: Prefers nodes that already have the pod's images
    • InterPodAffinityPriority: Considers pod affinity and anti-affinity rules

    The scheduler selects the node with the highest total priority score. If multiple nodes have the same score, it picks one arbitrarily.

    Understanding Node Affinity

    Node affinity is a Kubernetes feature that allows you to influence which nodes a pod is scheduled on. It's similar to node selectors but more powerful, offering both required and preferred rules.

    Node Selector vs Node Affinity

    Node selectors are simple key-value pairs that require exact matches. If a pod has a node selector, the pod can only be scheduled on nodes that have all the specified labels.

    apiVersion: v1
    kind: Pod
    metadata:
      name: nginx-pod
    spec:
      nodeSelector:
        disktype: ssd
      containers:
      - name: nginx
        image: nginx

    Node affinity, on the other hand, supports operators like In, NotIn, Exists, DoesNotExist, Gt, and Lt, giving you much more control over pod placement.

    Required Node Affinity

    Required node affinity rules are similar to node selectors but more flexible. If a pod specifies required affinity, the scheduler will only consider nodes that satisfy all the affinity rules.

    apiVersion: v1
    kind: Pod
    metadata:
      name: database-pod
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: disktype
                operator: In
                values:
                - ssd
              - key: hardware
                operator: In
                values:
                - gpu
      containers:
      - name: postgres
        image: postgres:15

    In this example, the pod will only be scheduled on nodes that have both disktype=ssd and hardware=gpu labels. If no such node exists, the pod will remain in Pending state until one becomes available.

    Node Selector Terms

    Each nodeSelectorTerms list is an OR condition. If you have multiple nodeSelectorTerms, the pod can be scheduled on a node that matches any one of them. Within a nodeSelectorTerm, matchExpressions are AND conditions.

    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: zone
              operator: In
              values:
              - us-east-1a
              - us-east-1b
          - matchExpressions:
            - key: instance-type
              operator: In
              values:
              - large

    This pod can be scheduled on any node in zones us-east-1a or us-east-1b with instance type large.

    Preferred Node Affinity

    Preferred node affinity rules are soft constraints. The scheduler will try to satisfy them, but if no node meets all required affinity rules, the pod can still be scheduled on nodes that don't meet the preferred rules.

    apiVersion: v1
    kind: Pod
    metadata:
      name: cache-pod
    spec:
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            preference:
              matchExpressions:
              - key: cache
                operator: Exists
          - weight: 50
            preference:
              matchExpressions:
              - key: dedicated
                operator: In
                values:
                - redis
      containers:
      - name: redis
        image: redis:7-alpine

    This pod prefers nodes with the cache label (weight 100) and nodes with the dedicated=redis label (weight 50). The scheduler will assign higher priority to nodes matching the first preference.

    Weight Values

    Weights range from 1 to 100. The scheduler calculates a total score for each node by summing the weights of all matching preferences. A node matching the cache label gets 100 points, while a node matching both cache and dedicated=redis gets 150 points.

    Practical Example: Database Pod Scheduling

    Let's walk through a real-world scenario where node affinity is essential. You're deploying a PostgreSQL database that requires:

    1. High memory (at least 8GB)
    2. SSD storage (not HDD)
    3. Placement on dedicated database nodes
    apiVersion: v1
    kind: Pod
    metadata:
      name: postgres-pod
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: storage
                operator: In
                values:
                - ssd
              - key: role
                operator: In
                values:
                - database
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 80
            preference:
              matchExpressions:
              - key: memory
                operator: Gt
                values:
                - "16384"
      containers:
      - name: postgres
        image: postgres:15
        resources:
          requests:
            memory: "8Gi"
            cpu: "2"
          limits:
            memory: "16Gi"
            cpu: "4"
        env:
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgres-secret
              key: password

    This configuration ensures the database runs on nodes with SSD storage and database role, while also preferring nodes with more than 16GB of memory.

    Node Affinity vs Pod Affinity

    It's important to distinguish between node affinity and pod affinity. Node affinity controls which nodes a pod can run on, while pod affinity controls which pods should be co-located on the same node.

    Pod Affinity Example

    apiVersion: v1
    kind: Pod
    metadata:
      name: app-pod
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - database
            topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: myapp:1.0

    This pod will only be scheduled on nodes that already have a pod with app=database label. This is useful for keeping related workloads close together.

    When to Use Which

    Use node affinity when you need to control node characteristics (hardware, location, labels). Use pod affinity when you need to control pod placement relative to other pods (database and application co-location).

    Common Scheduling Patterns

    1. Dedicated Nodes for Critical Workloads

    Create dedicated nodes for high-priority applications:

    apiVersion: v1
    kind: Pod
    metadata:
      name: critical-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: priority
                operator: In
                values:
                - high
      containers:
      - name: app
        image: critical-app:1.0

    2. GPU-Accelerated Workloads

    Schedule GPU pods on nodes with dedicated hardware:

    apiVersion: v1
    kind: Pod
    metadata:
      name: ml-training
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: accelerator
                operator: In
                values:
                - nvidia-gpu
      containers:
      - name: training
        image: tensorflow:2.12
        resources:
          limits:
            nvidia.com/gpu: 4

    3. Geographic Distribution

    Schedule pods in specific availability zones:

    apiVersion: v1
    kind: Pod
    metadata:
      name: geo-app
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                - us-east-1a
                - us-east-1b

    Troubleshooting Scheduling Issues

    Pod Stays in Pending State

    If a pod remains in Pending state, check the events:

    kubectl describe pod <pod-name>

    Look for messages like:

    • 0/3 nodes are available: 3 Insufficient cpu.
    • 0/3 nodes are available: 3 node(s) didn't match node affinity.
    • 0/3 nodes are available: 3 node(s) had taint {key: value}, that the pod didn't tolerate.

    Insufficient Resources

    If the pod can't find a node with enough resources, check node capacity:

    kubectl describe nodes

    Consider increasing node sizes or adding more nodes to your cluster.

    Node Affinity Not Working

    Verify that nodes have the required labels:

    kubectl get nodes --show-labels

    If labels are missing, you can add them using node labels:

    kubectl label nodes <node-name> disktype=ssd
    kubectl label nodes <node-name> role=database

    Best Practices

    1. Use Required Affinity for Critical Constraints

    For requirements that must be met (like hardware specifications), use requiredDuringSchedulingIgnoredDuringExecution. This prevents pods from being scheduled on incompatible nodes.

    2. Prefer Lightweight Constraints

    Avoid overly specific affinity rules that reduce node availability. If you need 8GB memory but 16GB is available, prefer Gt over exact matches.

    3. Combine Affinity Rules

    Use both required and preferred affinity to create flexible scheduling policies. Required rules ensure compatibility, while preferred rules optimize placement.

    4. Document Your Scheduling Strategy

    Document the labels and affinity rules you use so other team members understand pod placement decisions.

    5. Monitor Scheduling Performance

    Use metrics like pod scheduling duration and node utilization to evaluate if your scheduling strategy is effective.

    Conclusion

    Kubernetes scheduling and node affinity give you fine-grained control over pod placement, enabling you to optimize resource utilization, improve performance, and ensure your applications run on appropriate hardware. By understanding predicates, priorities, and affinity rules, you can design scheduling strategies that meet your application's specific requirements.

    The key takeaways are:

    • Use node selectors for simple exact matches
    • Use node affinity for flexible, powerful pod placement rules
    • Combine required and preferred affinity for optimal scheduling
    • Monitor pod status and events to troubleshoot scheduling issues

    As you scale your Kubernetes deployments, these concepts become increasingly important. Platforms like ServerlessBase simplify deployment management and can help you implement these scheduling strategies more easily by providing intuitive interfaces for configuring node labels and pod placement rules.

    Next Steps

    Now that you understand Kubernetes scheduling and node affinity, consider exploring related concepts:

    • Pod Disruption Budgets: Ensure high availability during node maintenance
    • Taints and Tolerations: Control pod placement on tainted nodes
    • Cluster Autoscaling: Automatically add nodes based on resource requests
    • Horizontal Pod Autoscaler: Scale pods based on CPU, memory, or custom metrics

    Experiment with these features in your development cluster to build a deeper understanding of Kubernetes scheduling mechanics.

    Leave comment