Introduction to Kubernetes Scheduling and Node Affinity

You've deployed your first Kubernetes cluster, and your application is running. But have you ever wondered how Kubernetes decides which node to place each pod on? The default scheduler does a decent job, but real-world applications often have specific requirements that the default behavior doesn't handle. Maybe you need to place a database pod on a node with high memory, or keep a GPU-intensive application on nodes with dedicated hardware. This is where scheduling and node affinity come into play.

In this article, you'll learn how Kubernetes scheduling works under the hood, why it matters for your applications, and how to use node affinity to control pod placement with precision. Understanding these concepts will help you optimize resource utilization, improve performance, and avoid common scheduling pitfalls that can lead to application instability.

How Kubernetes Scheduling Works

Kubernetes uses a pluggable scheduler architecture. The default scheduler (kube-scheduler) is a single process that runs on the control plane and makes scheduling decisions for all pods that don't already have a node assigned. When a pod is created without a nodeName field, the scheduler evaluates all available nodes and selects the best one based on a set of predicates and priorities.

Scheduling Predicates

Predicates are rules that determine if a node can schedule a pod. The scheduler checks each predicate in sequence, and if any predicate fails, the node is eliminated from consideration. Common predicates include:

PodFitsResources: Ensures the node has sufficient CPU and memory resources
PodFitsHostPorts: Checks if the node has available host ports
MatchNodeSelector: Verifies the node matches the pod's node selector
HostName: Confirms the node matches the pod's nodeName field

Scheduling Priorities

Once predicates filter the candidate nodes, priorities are applied to rank them. Higher-scoring nodes are preferred. Common priorities include:

LeastRequestedPriority: Prefers nodes with fewer allocated resources
ImageLocalityPriority: Prefers nodes that already have the pod's images
InterPodAffinityPriority: Considers pod affinity and anti-affinity rules

The scheduler selects the node with the highest total priority score. If multiple nodes have the same score, it picks one arbitrarily.

Understanding Node Affinity

Node affinity is a Kubernetes feature that allows you to influence which nodes a pod is scheduled on. It's similar to node selectors but more powerful, offering both required and preferred rules.

Node Selector vs Node Affinity

Node selectors are simple key-value pairs that require exact matches. If a pod has a node selector, the pod can only be scheduled on nodes that have all the specified labels.

apiVersion: v1
kind: Pod
metadata:
  name: nginx-pod
spec:
  nodeSelector:
    disktype: ssd
  containers:
  - name: nginx
    image: nginx

Node affinity, on the other hand, supports operators like In, NotIn, Exists, DoesNotExist, Gt, and Lt, giving you much more control over pod placement.

Required Node Affinity

Required node affinity rules are similar to node selectors but more flexible. If a pod specifies required affinity, the scheduler will only consider nodes that satisfy all the affinity rules.

apiVersion: v1
kind: Pod
metadata:
  name: database-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: disktype
            operator: In
            values:
            - ssd
          - key: hardware
            operator: In
            values:
            - gpu
  containers:
  - name: postgres
    image: postgres:15

In this example, the pod will only be scheduled on nodes that have both disktype=ssd and hardware=gpu labels. If no such node exists, the pod will remain in Pending state until one becomes available.

Node Selector Terms

Each nodeSelectorTerms list is an OR condition. If you have multiple nodeSelectorTerms, the pod can be scheduled on a node that matches any one of them. Within a nodeSelectorTerm, matchExpressions are AND conditions.

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: zone
          operator: In
          values:
          - us-east-1a
          - us-east-1b
      - matchExpressions:
        - key: instance-type
          operator: In
          values:
          - large

This pod can be scheduled on any node in zones us-east-1a or us-east-1b with instance type large.

Preferred Node Affinity

Preferred node affinity rules are soft constraints. The scheduler will try to satisfy them, but if no node meets all required affinity rules, the pod can still be scheduled on nodes that don't meet the preferred rules.

apiVersion: v1
kind: Pod
metadata:
  name: cache-pod
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        preference:
          matchExpressions:
          - key: cache
            operator: Exists
      - weight: 50
        preference:
          matchExpressions:
          - key: dedicated
            operator: In
            values:
            - redis
  containers:
  - name: redis
    image: redis:7-alpine

This pod prefers nodes with the cache label (weight 100) and nodes with the dedicated=redis label (weight 50). The scheduler will assign higher priority to nodes matching the first preference.

Weight Values

Weights range from 1 to 100. The scheduler calculates a total score for each node by summing the weights of all matching preferences. A node matching the cache label gets 100 points, while a node matching both cache and dedicated=redis gets 150 points.

Practical Example: Database Pod Scheduling

Let's walk through a real-world scenario where node affinity is essential. You're deploying a PostgreSQL database that requires:

High memory (at least 8GB)
SSD storage (not HDD)
Placement on dedicated database nodes

apiVersion: v1
kind: Pod
metadata:
  name: postgres-pod
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: storage
            operator: In
            values:
            - ssd
          - key: role
            operator: In
            values:
            - database
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: memory
            operator: Gt
            values:
            - "16384"
  containers:
  - name: postgres
    image: postgres:15
    resources:
      requests:
        memory: "8Gi"
        cpu: "2"
      limits:
        memory: "16Gi"
        cpu: "4"
    env:
    - name: POSTGRES_PASSWORD
      valueFrom:
        secretKeyRef:
          name: postgres-secret
          key: password

This configuration ensures the database runs on nodes with SSD storage and database role, while also preferring nodes with more than 16GB of memory.

Node Affinity vs Pod Affinity

It's important to distinguish between node affinity and pod affinity. Node affinity controls which nodes a pod can run on, while pod affinity controls which pods should be co-located on the same node.

Pod Affinity Example

apiVersion: v1
kind: Pod
metadata:
  name: app-pod
spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
            - database
        topologyKey: kubernetes.io/hostname
  containers:
  - name: app
    image: myapp:1.0

This pod will only be scheduled on nodes that already have a pod with app=database label. This is useful for keeping related workloads close together.

When to Use Which

Use node affinity when you need to control node characteristics (hardware, location, labels). Use pod affinity when you need to control pod placement relative to other pods (database and application co-location).

Common Scheduling Patterns

1. Dedicated Nodes for Critical Workloads

Create dedicated nodes for high-priority applications:

apiVersion: v1
kind: Pod
metadata:
  name: critical-app
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: priority
            operator: In
            values:
            - high
  containers:
  - name: app
    image: critical-app:1.0

2. GPU-Accelerated Workloads

Schedule GPU pods on nodes with dedicated hardware:

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: accelerator
            operator: In
            values:
            - nvidia-gpu
  containers:
  - name: training
    image: tensorflow:2.12
    resources:
      limits:
        nvidia.com/gpu: 4

3. Geographic Distribution

Schedule pods in specific availability zones:

apiVersion: v1
kind: Pod
metadata:
  name: geo-app
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
            - us-east-1b

Troubleshooting Scheduling Issues

Pod Stays in Pending State

If a pod remains in Pending state, check the events:

kubectl describe pod <pod-name>

Look for messages like:

0/3 nodes are available: 3 Insufficient cpu.
0/3 nodes are available: 3 node(s) didn't match node affinity.
0/3 nodes are available: 3 node(s) had taint {key: value}, that the pod didn't tolerate.

Insufficient Resources

If the pod can't find a node with enough resources, check node capacity:

kubectl describe nodes

Consider increasing node sizes or adding more nodes to your cluster.

Node Affinity Not Working

Verify that nodes have the required labels:

kubectl get nodes --show-labels

If labels are missing, you can add them using node labels:

kubectl label nodes <node-name> disktype=ssd
kubectl label nodes <node-name> role=database

Best Practices

1. Use Required Affinity for Critical Constraints

For requirements that must be met (like hardware specifications), use requiredDuringSchedulingIgnoredDuringExecution. This prevents pods from being scheduled on incompatible nodes.

2. Prefer Lightweight Constraints

Avoid overly specific affinity rules that reduce node availability. If you need 8GB memory but 16GB is available, prefer Gt over exact matches.

3. Combine Affinity Rules

Use both required and preferred affinity to create flexible scheduling policies. Required rules ensure compatibility, while preferred rules optimize placement.

4. Document Your Scheduling Strategy

Document the labels and affinity rules you use so other team members understand pod placement decisions.

5. Monitor Scheduling Performance

Use metrics like pod scheduling duration and node utilization to evaluate if your scheduling strategy is effective.

Conclusion

Kubernetes scheduling and node affinity give you fine-grained control over pod placement, enabling you to optimize resource utilization, improve performance, and ensure your applications run on appropriate hardware. By understanding predicates, priorities, and affinity rules, you can design scheduling strategies that meet your application's specific requirements.

The key takeaways are:

Use node selectors for simple exact matches
Use node affinity for flexible, powerful pod placement rules
Combine required and preferred affinity for optimal scheduling
Monitor pod status and events to troubleshoot scheduling issues

As you scale your Kubernetes deployments, these concepts become increasingly important. Platforms like ServerlessBase simplify deployment management and can help you implement these scheduling strategies more easily by providing intuitive interfaces for configuring node labels and pod placement rules.

Next Steps

Now that you understand Kubernetes scheduling and node affinity, consider exploring related concepts:

Pod Disruption Budgets: Ensure high availability during node maintenance
Taints and Tolerations: Control pod placement on tainted nodes
Cluster Autoscaling: Automatically add nodes based on resource requests
Horizontal Pod Autoscaler: Scale pods based on CPU, memory, or custom metrics

Experiment with these features in your development cluster to build a deeper understanding of Kubernetes scheduling mechanics.