ServerlessBase Blog
  • Understanding Container Namespaces and Cgroups

    A 150-160 character meta description containing 'container namespaces and cgroups'

    Understanding Container Namespaces and Cgroups

    You've probably heard that containers are lightweight because they share the host kernel. But have you actually thought about how that works? How does a single Linux kernel manage to run dozens of isolated processes without them stepping on each other's toes? The answer lies in two fundamental Linux kernel features: namespaces and control groups (cgroups).

    What Are Namespaces?

    Think of namespaces as a set of rules that define what a process can see. When you run a container, the kernel creates a new namespace for that process, and the process only sees the world through that namespace's lens.

    The Six Types of Namespaces

    Linux provides six different types of namespaces, each isolating a different aspect of the system:

    NamespaceWhat It IsolatesWhy It Matters
    Mount namespaceFile system mountsProcesses see only their own mounted filesystems
    Network namespaceNetwork interfacesEach container gets its own network stack
    PID namespaceProcess IDsProcesses in one container see different PIDs than the host
    UTS namespaceHostname and domain nameContainers can have their own names
    IPC namespaceInter-process communicationContainers use separate System V IPC and POSIX message queues
    User namespaceUser and group IDsProcesses can run with different user permissions

    Mount Namespaces in Action

    Mount namespaces are probably the most intuitive. When you start a container, the kernel creates a new mount namespace and mounts the container's root filesystem at /. The process inside the container sees / as its root directory, even though the host system has its own filesystem structure.

    # Inside the container
    ls /
    bin  dev  etc  home  lib  lib64  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

    The container's view is much cleaner because it doesn't see the host's boot, media, or other directories.

    Network Namespaces and the Network Stack

    Network namespaces give each container its own network stack. This means a container can have its own IP addresses, network interfaces, routing tables, and firewall rules. When you run ip addr inside a container, you'll see only the interfaces configured for that container, not the host's network interfaces.

    # Inside the container
    ip addr
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
        link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
        inet 127.0.0.1/8 scope host lo
           valid_lft forever preferred_lft forever
    2: eth0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
        link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
        inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
           valid_lft forever preferred_lft forever

    The container has its own loopback interface and a virtual Ethernet interface (eth0) connected to the container network.

    PID Namespaces and Process Isolation

    PID namespaces are crucial for process isolation. When you run ps aux inside a container, you'll see only the processes running inside that container, not the host's processes. The first process in a container always has PID 1, even if that process is actually PID 42 on the host.

    # Inside the container
    ps aux
    USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root         1  0.0  0.0   4160   920 ?        Ss   10:00   0:00 /bin/bash
    root        10  0.0  0.0   4160   920 ?        S    10:01   0:00 ps aux

    The container's shell has PID 1, and ps shows only the processes in that namespace.

    User Namespaces and Permission Isolation

    User namespaces allow processes to run with different user and group IDs than the host. This is important for security because you can run a container as an unprivileged user (UID 65534) even though the host user is root (UID 0). The kernel maps the container's root user to a non-root user on the host.

    # Inside the container (running as root)
    id
    uid=0(root) gid=0(root) groups=0(root)
     
    # On the host (the actual user running the container)
    id
    uid=1000(user) gid=1000(user) groups=1000(user)

    The container thinks it's running as root, but on the host, it's actually running as user 1000. This provides a significant security improvement.

    What Are Cgroups?

    While namespaces provide isolation, they don't control resource usage. A process in its own namespace could still consume all available CPU, memory, or disk I/O. That's where cgroups come in.

    The Purpose of Cgroups

    Cgroups (control groups) limit, account for, and isolate the resource usage (CPU, memory, disk I/O, network bandwidth) of a collection of processes. Think of cgroups as a resource manager that ensures fair distribution and prevents one process from monopolizing system resources.

    Controlling CPU Usage

    Cgroups can limit CPU usage using the cpu subsystem. You can set a maximum CPU quota and period to control how much CPU time a process can use.

    # Create a cgroup for limiting CPU
    sudo mkdir -p /sys/fs/cgroup/cpu/my-container
     
    # Set the CPU quota to 50% (500ms per 1s period)
    echo 500000 | sudo tee /sys/fs/cgroup/cpu/my-container/cpu.cfs_quota_us
    echo 1000000 | sudo tee /sys/fs/cgroup/cpu/my-container/cpu.cfs_period_us
     
    # Move a process into the cgroup
    echo $$ | sudo tee /sys/fs/cgroup/cpu/my-container/tasks

    The process will now be limited to 50% CPU usage. If it tries to use more, the kernel will throttle it.

    Controlling Memory Usage

    Memory cgroups prevent processes from consuming excessive memory. You can set a memory limit, and the kernel will kill the process if it exceeds the limit.

    # Create a memory cgroup
    sudo mkdir -p /sys/fs/cgroup/memory/my-container
     
    # Set a memory limit of 512MB
    echo 536870912 | sudo tee /sys/fs/cgroup/memory/my-container/memory.limit_in_bytes
     
    # Move a process into the cgroup
    echo $$ | sudo tee /sys/fs/cgroup/memory/my-container/tasks

    If the process tries to allocate more than 512MB, the kernel will trigger OOM (out of memory) and kill the process.

    Controlling Disk I/O

    Cgroups can also limit disk I/O, which is useful for preventing a single process from saturating the disk and affecting other processes.

    # Create an I/O cgroup
    sudo mkdir -p /sys/fs/cgroup/blkio/my-container
     
    # Set a read/write I/O limit (bytes per second)
    echo "1024 1024" | sudo tee /sys/fs/cgroup/blkio/my-container/blkio.throttle.read_bps_device
    echo "1024 1024" | sudo tee /sys/fs/cgroup/blkio/my-container/blkio.throttle.write_bps_device
     
    # Move a process into the cgroup
    echo $$ | sudo tee /sys/fs/cgroup/blkio/my-container/tasks

    The process will now be limited to 1KB of I/O per second.

    How Namespaces and Cgroups Work Together

    Namespaces and cgroups work together to provide container isolation. Namespaces provide the illusion of isolation, while cgroups provide actual resource control.

    The Container Startup Process

    When you start a container, the following happens:

    1. The container runtime creates a new PID namespace for the container process
    2. The container runtime creates a new mount namespace and mounts the container's root filesystem
    3. The container runtime creates a new network namespace and configures virtual network interfaces
    4. The container runtime creates a new IPC namespace
    5. The container runtime creates a new UTS namespace
    6. The container runtime creates a new user namespace (if configured)
    7. The container runtime creates cgroups to limit CPU, memory, and I/O usage
    8. The container process is moved into the new namespaces and cgroups

    The Container Runtime's Role

    Container runtimes like Docker and Podman are responsible for creating and managing namespaces and cgroups. When you run docker run, the runtime:

    1. Creates a new network namespace and configures a virtual bridge network
    2. Creates a new mount namespace and mounts the container image's filesystem
    3. Creates cgroups to limit resource usage
    4. Sets up the container's environment variables and command
    5. Starts the container process

    Common Use Cases

    Isolating Development Environments

    Developers often use containers to isolate their development environments. Each developer can have their own container with their own dependencies, without affecting other developers or the host system.

    Resource Management in Multi-Tenant Environments

    In multi-tenant environments, cgroups ensure that one tenant's applications don't consume all available resources. You can create different cgroups for different tenants and set resource limits accordingly.

    Testing Resource Limits

    Developers can use containers to test how their applications behave under resource constraints. By setting CPU and memory limits, you can simulate production conditions and identify potential issues.

    Security Isolation

    Namespaces provide a layer of security by isolating processes. Even if a vulnerability is found in a container, it can't access the host's resources or other containers' resources.

    Limitations and Challenges

    Shared Kernel

    Containers share the host kernel, which means kernel vulnerabilities can affect all containers. This is why it's important to keep the host kernel updated and use security features like SELinux and AppArmor.

    Namespace Limitations

    Not all system resources are namespaced. For example, the kernel's internal structures and global system calls are not namespaced. This means containers can still access some host resources.

    Cgroup Limitations

    Cgroups can't prevent all resource exhaustion. For example, if a process forks infinitely, it can consume all available memory and CPU, even with cgroups. You need to use additional mechanisms like ulimits and OOM kill to prevent this.

    Performance Overhead

    Namespaces and cgroups add some overhead to process creation and resource management. However, this overhead is minimal compared to the benefits of isolation and resource control.

    Conclusion

    Namespaces and cgroups are the foundation of container isolation. Namespaces provide the illusion of isolation by hiding system resources from processes, while cgroups provide actual resource control by limiting CPU, memory, and I/O usage.

    Together, they enable containers to be lightweight, secure, and efficient. Understanding how namespaces and cgroups work is essential for working with containers effectively.

    If you're managing deployments at scale, platforms like ServerlessBase can help you automate container management and ensure consistent resource allocation across your infrastructure.

    Leave comment