ServerlessBase Blog
  • Introduction to Server Hardware Monitoring

    Learn how to monitor server hardware components like CPU, memory, disk, and network to ensure optimal performance and prevent failures.

    Introduction to Server Hardware Monitoring

    You've deployed your application, configured your databases, and set up your monitoring stack. Everything looks good in the logs, but you're still worried. Is the CPU actually being overutilized? Is the disk I/O causing latency? Is the memory leak slowly eating up resources? These are the questions that keep DevOps engineers up at night.

    Server hardware monitoring is the practice of tracking the physical and virtual resources of your servers in real-time. Unlike application-level monitoring, which tells you that a service is down, hardware monitoring tells you why it's struggling. A slow database might be a query problem, or it might be that the disk is full. A high CPU usage might indicate a runaway process, or it could be a legitimate workload spike.

    In this article, we'll cover the essential hardware metrics you should monitor, the tools available to do it, and a practical walkthrough of setting up comprehensive monitoring for a Linux server. By the end, you'll have a clear picture of what to watch, how to interpret the data, and how to set up alerts that actually help you catch problems before they impact users.

    Core Hardware Metrics to Monitor

    Before diving into tools, let's establish what we're actually monitoring. Server hardware consists of several distinct components, each with its own characteristics and failure modes. Monitoring each component separately gives you a complete picture of your server's health.

    CPU Metrics

    The central processing unit is the brain of your server. When CPU metrics are healthy, your applications run smoothly. When they're unhealthy, everything slows down. The key metrics to track include:

    • CPU Utilization: The percentage of time the CPU is actively processing work. High utilization (above 80%) often indicates a bottleneck, but context matters. A web server might run at 90% utilization during peak traffic without issues, while a database server at 90% might be struggling to keep up.

    • CPU Load Average: A three-number average representing the number of processes waiting for CPU time over the last 1, 5, and 15 minutes. Load averages above the number of CPU cores suggest the system is overloaded. For example, a 4-core server with a load average of 8.0 is significantly stressed.

    • CPU Context Switches: The number of times the CPU switches from one task to another. High context switch rates can indicate inefficient scheduling, often caused by too many processes competing for CPU time or a poorly designed application.

    • User vs System vs Idle Time: User time is time spent executing user processes, system time is time spent executing kernel code, and idle time is time the CPU has nothing to do. A healthy balance is user time for your applications, system time for the OS, and idle time when traffic is low.

    Memory Metrics

    Memory is often misunderstood. Unlike disk space, memory is volatile—if it's not being used, it's not a problem. The key is to monitor memory usage patterns, not just the raw numbers.

    • Total Memory: The physical RAM installed in the server. This is a static value, but it's useful for context.

    • Used Memory: The amount of memory currently in use. This includes both active processes and cached files. Don't be alarmed by high "used" memory if a large portion is actually cached—this is a good thing.

    • Available Memory: The amount of memory that's actually free for new processes. This is the metric that matters most. If available memory drops below a threshold (e.g., 10%), your system will start swapping to disk, which is a performance killer.

    • Swap Usage: When physical memory is exhausted, the system uses disk space as virtual memory. Swap is much slower than RAM, so high swap usage is a red flag. Ideally, swap should be near zero.

    • Page Faults: The number of times the memory system needs to read or write data to disk. High page fault rates indicate memory pressure and can cause significant slowdowns.

    Disk Metrics

    Disk performance is often the hidden bottleneck in server infrastructure. Applications that run fine on one server might struggle on another simply because the disk I/O is slower.

    • Disk Utilization: The percentage of time the disk is busy. High utilization (above 70-80%) means the disk is constantly working, leaving no room for spikes in I/O demand.

    • Disk I/O Operations per Second (IOPS): The number of read and write operations the disk can handle per second. Different disk types have different IOPS capabilities. SSDs can handle thousands of IOPS, while traditional HDDs might only handle a few hundred.

    • Disk Throughput: The amount of data transferred per second, typically measured in MB/s. This is important for large file operations, backups, and streaming workloads.

    • Disk Space Usage: The percentage of disk capacity used. Running out of disk space is a common cause of application failures. It's also important to monitor disk health indicators like SMART errors, which can predict drive failures before they happen.

    • Queue Length: The number of I/O requests waiting to be processed. A queue length of zero means the disk can handle all incoming requests immediately. A queue length of 10 or more indicates the disk is overwhelmed.

    Network Metrics

    Network performance is critical for distributed systems, APIs, and any application that serves external users. Network issues can be subtle and hard to diagnose, making monitoring essential.

    • Network Throughput: The amount of data transferred over the network per second. This is typically measured in bits per second (bps) or megabits per second (Mbps). High throughput is good, but it's important to distinguish between inbound and outbound traffic.

    • Network Latency: The time it takes for a packet to travel from source to destination. High latency can make applications feel sluggish, especially for interactive features or real-time systems.

    • Packet Loss: The percentage of packets that don't arrive at their destination. Packet loss is a serious issue that can cause data corruption, connection drops, and application errors.

    • Network Errors: The number of transmission errors, such as checksum errors or collisions. High error rates indicate physical layer problems, such as bad cabling or faulty network hardware.

    • Connection Count: The number of active network connections. An unusually high number of connections can indicate a DDoS attack, a misbehaving application, or a resource leak.

    Comparison of Monitoring Approaches

    Different monitoring approaches offer different trade-offs in terms of complexity, cost, and granularity. Understanding these trade-offs helps you choose the right approach for your needs.

    FactorAgent-Based MonitoringAgentless MonitoringCloud Provider Monitoring
    Setup ComplexityHigh - requires installing software on each serverLow - uses existing protocolsLow - integrated with cloud console
    Data CollectionDirect, frequent, customizableIndirect, protocol-dependentLimited to provider-specific metrics
    CostFree (open-source agents) or paid (commercial)FreeOften included, but can be expensive at scale
    ScalabilityGood, but agent management adds overheadExcellent, no agent overheadExcellent, but vendor lock-in
    Custom MetricsFull control over what to monitorLimited to what protocols exposeVery limited, provider-specific only
    Best ForOn-premise, hybrid, multi-cloud environmentsSimple setups, quick monitoringCloud-native workloads, AWS/Azure/GCP users

    Agent-based monitoring involves installing a monitoring agent on each server. The agent collects metrics locally and sends them to a central collector. This approach offers the most flexibility and control but requires ongoing maintenance. Popular agent-based tools include Prometheus with Node Exporter, Datadog Agent, and New Relic Agent.

    Agentless monitoring relies on existing protocols like SNMP, SSH, or cloud provider APIs to collect metrics. This approach is simpler to set up but offers less granularity and control. It's well-suited for quick monitoring needs or environments where you can't install agents.

    Cloud provider monitoring is built into services like AWS CloudWatch, Azure Monitor, and Google Cloud Monitoring. These services provide excellent visibility into cloud resources but are limited to what the provider exposes. They're ideal for cloud-native workloads but don't help with on-premise or hybrid environments.

    Setting Up Hardware Monitoring with Prometheus and Node Exporter

    Now let's walk through a practical setup of hardware monitoring using Prometheus and Node Exporter. This combination is popular in the DevOps community because it's open-source, flexible, and widely supported.

    Prerequisites

    You'll need a Linux server with root or sudo access. For this example, we'll use Ubuntu 22.04 LTS, but the process is similar on other distributions. You'll also need a Prometheus server, which we'll set up in a later step.

    Step 1: Install Node Exporter

    Node Exporter is a lightweight agent that exports hardware and kernel-related metrics in an exposed HTTP endpoint. It's the standard way to collect system metrics for Prometheus.

    # Download the latest Node Exporter release
    wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
     
    # Extract the archive
    tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
     
    # Move the binary to a permanent location
    sudo mv node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
     
    # Create a system user for Node Exporter
    sudo useradd --no-create-home --shell /bin/false node_exporter
     
    # Set ownership and permissions
    sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
     
    # Create a systemd service file
    sudo nano /etc/systemd/system/node_exporter.service

    Add the following content to the service file:

    [Unit]
    Description=Node Exporter
    After=network.target
     
    [Service]
    User=node_exporter
    Group=node_exporter
    Type=simple
    ExecStart=/usr/local/bin/node_exporter \
      --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
     
    [Install]
    WantedBy=multi-user.target

    Save and exit the file. Then enable and start the service:

    # Reload systemd to recognize the new service
    sudo systemctl daemon-reload
     
    # Enable Node Exporter to start on boot
    sudo systemctl enable node_exporter
     
    # Start the service
    sudo systemctl start node_exporter

    Verify that Node Exporter is running:

    sudo systemctl status node_exporter

    You should see "active (running)" in the output. You can also verify the metrics endpoint:

    curl http://localhost:9100/metrics

    This will return a long list of metrics in the Prometheus text format. Each line represents a metric with its value and labels.

    Step 2: Configure Prometheus to Scrape Node Exporter

    Now that Node Exporter is running, we need to configure Prometheus to scrape its metrics. Prometheus is a time-series database that collects metrics from various sources and stores them for querying and alerting.

    Edit your Prometheus configuration file (typically /etc/prometheus/prometheus.yml):

    global:
      scrape_interval: 15s
      evaluation_interval: 15s
     
    scrape_configs:
      - job_name: 'node_exporter'
        static_configs:
          - targets: ['localhost:9100']

    This configuration tells Prometheus to scrape metrics from localhost:9100 every 15 seconds. If you have multiple servers, you can add them to the targets list:

    scrape_configs:
      - job_name: 'node_exporter'
        static_configs:
          - targets: ['server1:9100', 'server2:9100', 'server3:9100']

    After updating the configuration, restart Prometheus:

    sudo systemctl restart prometheus

    Step 3: Create Grafana Dashboards

    Prometheus is great for storing metrics, but Grafana provides the visualization and alerting capabilities that make monitoring actionable. Grafana is a popular open-source analytics and interactive visualization platform.

    First, install Grafana. On Ubuntu:

    # Add Grafana's repository
    sudo apt-get install -y software-properties-common
    sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
     
    # Import Grafana's GPG key
    wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
     
    # Install Grafana
    sudo apt-get update
    sudo apt-get install -y grafana
     
    # Start and enable Grafana
    sudo systemctl start grafana-server
    sudo systemctl enable grafana-server

    Access Grafana at http://localhost:3000 (default credentials: admin/admin). Log in and navigate to Configuration > Data Sources > Add data source. Select "Prometheus" and configure it with your Prometheus URL (typically http://localhost:9090).

    After adding the data source, create a dashboard. You can either build it from scratch or import a pre-built dashboard. For Node Exporter, there are several community dashboards available. Import one of these by navigating to Dashboards > Import and pasting the dashboard ID.

    A good dashboard includes panels for CPU utilization, memory usage, disk I/O, and network traffic. Each panel should have a clear title, a relevant query, and appropriate alerting rules. For example, a CPU utilization panel might use the query 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) to show the percentage of CPU time spent in non-idle modes.

    Step 4: Set Up Alerts

    Monitoring is useless if you don't act on the data. Alerts notify you when something goes wrong, allowing you to respond before users are affected.

    In Grafana, navigate to Alerting > Alert rules and create a new rule. For example, you might create an alert for high CPU utilization:

    expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU utilization on {{ $labels.instance }}"
      description: "CPU utilization is {{ $value }}% for more than 5 minutes."

    This rule triggers a warning alert when CPU utilization exceeds 80% for 5 consecutive minutes. You can also create critical alerts for more severe issues, such as disk space running out or swap usage exceeding a threshold.

    Configure notification channels in Grafana, such as email, Slack, or PagerDuty. This ensures that alerts are delivered to the right people at the right time.

    Interpreting Monitoring Data

    Collecting metrics is only half the battle. Interpreting the data correctly is what enables proactive problem-solving. Let's look at some common scenarios and how to diagnose them.

    Scenario 1: High CPU Utilization

    High CPU utilization can have many causes, from legitimate workload spikes to runaway processes. Start by identifying which CPU modes are consuming the most resources:

    top -o %CPU

    This command shows processes sorted by CPU usage. Look for processes with unusually high usage. If you see a single process consuming most of the CPU, it might be a bug or a misconfiguration. If multiple processes are consuming CPU, it could be a legitimate workload increase.

    Check the CPU load average:

    uptime

    Compare the load average to the number of CPU cores. If the load average is significantly higher than the number of cores, the system is overloaded. Consider scaling up your server or optimizing your application.

    Scenario 2: High Memory Usage

    High memory usage isn't always a problem. The key is to distinguish between used memory and available memory:

    free -h

    The free command shows memory usage in human-readable format. Look at the available column—if it's low (e.g., less than 10% of total memory), your system might start swapping. Check swap usage:

    free -h

    If swap usage is high, your system is under memory pressure. This can cause significant performance degradation. Consider increasing memory, optimizing your application to use less memory, or adding more servers.

    Scenario 3: High Disk I/O

    High disk I/O can cause application slowdowns, even if the disk isn't full. Check disk utilization:

    iostat -x 1

    The iostat command shows disk utilization and I/O statistics. Look at the %util column—if it's consistently high (above 70%), the disk is constantly busy. Check the await column, which shows the average I/O wait time. High await values indicate slow disk performance.

    If you're using an HDD, consider upgrading to an SSD. If you're already using an SSD, check for inefficient I/O patterns, such as frequent small writes or random I/O. These patterns can degrade SSD performance over time.

    Scenario 4: Network Issues

    Network issues can be subtle and hard to diagnose. Start by checking network throughput:

    iftop -i eth0

    The iftop command shows real-time network traffic. Look for unusual spikes or sustained high traffic. Check for packet loss:

    ping -c 100 google.com

    If you see a significant number of packet losses, there might be a network connectivity issue. Check network errors:

    netstat -i

    The netstat command shows network interface statistics. Look for high error rates, which can indicate physical layer problems.

    Best Practices for Hardware Monitoring

    Effective monitoring requires more than just installing tools. It requires a thoughtful approach to what you monitor, how you interpret the data, and how you respond to alerts.

    Define Clear Baselines

    Every environment is different. What's normal for one server might be abnormal for another. Establish baselines for your metrics by monitoring for an extended period (at least a week) before setting up alerts. This helps you distinguish between normal fluctuations and actual problems.

    Set Appropriate Alert Thresholds

    Alert thresholds are a balancing act. Too sensitive, and you'll be overwhelmed with false positives. Too insensitive, and you'll miss real problems. Start with conservative thresholds and adjust based on your experience and environment.

    Consider using multiple alert levels (warning, critical) for the same metric. For example, CPU utilization above 80% for 5 minutes might be a warning, while CPU utilization above 95% for 1 minute might be critical.

    A single data point doesn't tell you much. Trends over time provide context. Use time-based queries in Prometheus and Grafana to analyze trends. For example, instead of alerting on a single high value, alert when a metric exceeds a threshold for a sustained period.

    Regularly Review and Optimize

    Monitoring infrastructure isn't set it and forget it. Regularly review your dashboards, alert rules, and data retention policies. Remove obsolete metrics, adjust thresholds, and optimize queries for performance. As your environment changes, your monitoring should evolve accordingly.

    Integrate with Incident Management

    Monitoring data should feed into your incident management process. When an alert fires, it should trigger a predefined workflow that notifies the right people, provides context, and guides the response. This reduces response time and improves incident resolution.

    Conclusion

    Server hardware monitoring is an essential practice for any production environment. By tracking CPU, memory, disk, and network metrics, you gain visibility into the health of your infrastructure and can proactively address issues before they impact users.

    The combination of Prometheus, Node Exporter, and Grafana provides a powerful, open-source monitoring stack that's widely used in the industry. This setup offers flexibility, scalability, and deep integration with alerting and visualization tools.

    Remember that monitoring is only useful if you act on the data. Establish baselines, set appropriate thresholds, and integrate monitoring with your incident management process. Regularly review and optimize your monitoring setup to ensure it remains effective as your environment evolves.

    Platforms like ServerlessBase can simplify the deployment and management of monitoring infrastructure, handling the complexity of Prometheus, Grafana, and alerting so you can focus on your applications. With the right monitoring in place, you'll have the confidence that your servers are healthy and performing as expected.

    Next Steps

    Now that you understand server hardware monitoring, consider exploring related topics:

    • Application Performance Monitoring (APM): Learn how to monitor your applications at a deeper level, including database queries, API calls, and user sessions.

    • Infrastructure as Code for Monitoring: Use tools like Terraform to provision and manage your monitoring infrastructure as code.

    • Advanced Alerting Strategies: Implement multi-level alerts, notification routing, and on-call rotation to improve incident response.

    • Log Management: Complement metrics monitoring with log aggregation and analysis tools like ELK Stack or Loki.

    By combining hardware monitoring with application monitoring and log analysis, you'll have a comprehensive observability strategy that helps you understand not just what's happening, but why it's happening.

    Leave comment