ServerlessBase Blog
  • Database Indexing: How It Works and Why It Matters

    A comprehensive guide to database indexing, covering B-tree structures, query optimization, and best practices for improving database performance.

    Database Indexing: How It Works and Why It Matters

    You've probably experienced the frustration of a slow database query. You run a simple SELECT * FROM users WHERE email = 'user@example.com' and wait several seconds for the result. Meanwhile, the same query on a properly indexed table returns instantly. The difference isn't your code or your data—it's indexing.

    Database indexing is one of the most powerful performance optimization techniques available, yet it's often misunderstood or misused. Let's break down what indexing actually does, how it works under the hood, and when to use it effectively.

    What Is a Database Index?

    Think of a database index like the index at the back of a book. When you want to find a specific topic, you don't read every page sequentially. Instead, you jump directly to the relevant section using the index. A database index works the same way: it creates a separate data structure that allows the database engine to quickly locate rows without scanning every single row in the table.

    Without indexes, a database must perform a full table scan—it examines every row to find the matching data. This is fine for small tables, but becomes prohibitively slow as tables grow to millions or billions of rows.

    How Indexes Work: The B-Tree Structure

    Most relational databases use B-tree (balanced tree) structures for indexes. A B-tree is a self-balancing tree data structure that maintains sorted data and allows for efficient search, insertion, and deletion operations.

    Visualizing a B-Tree

    Imagine a B-tree as a hierarchy of nodes:

    Root Node
    ├── Left Child (values < 50)
    │   ├── Left Grandchild (values < 25)
    │   │   ├── Leaf Node: [10, 15, 20]
    │   │   └── Leaf Node: [22, 24]
    │   └── Right Grandchild (values 25-49)
    │       ├── Leaf Node: [30, 35, 40]
    │       └── Leaf Node: [45, 48]
    └── Right Child (values > 50)
        ├── Left Grandchild (values 51-75)
        │   ├── Leaf Node: [55, 60, 65]
        │   └── Leaf Node: [70, 72, 74]
        └── Right Grandchild (values > 75)
            └── Leaf Node: [80, 85, 90]

    When you query for a value like 35, the database starts at the root, compares 35 with the root's values, and follows the appropriate child pointer. It repeats this process until it reaches a leaf node, where it finds the exact value or determines that the value doesn't exist.

    B-Tree Advantages

    B-trees offer several key advantages:

    • Balanced structure: All leaf nodes are at the same depth, ensuring consistent query performance regardless of where the data is located.
    • Efficient lookups: Each query requires at most O(log n) comparisons, where n is the number of indexed values.
    • Range queries: B-trees excel at range queries (e.g., WHERE created_at BETWEEN '2024-01-01' AND '2024-12-31') because they maintain sorted data.
    • Disk-friendly: B-trees are designed to work well with disk storage, minimizing I/O operations.

    Index Types and When to Use Them

    Single-Column Indexes

    The simplest form of indexing, a single-column index indexes only one column:

    CREATE INDEX idx_users_email ON users(email);

    This index is useful for queries that filter on a single column:

    -- Uses the index efficiently
    SELECT * FROM users WHERE email = 'user@example.com';
     
    -- Cannot use the index
    SELECT * FROM users WHERE email LIKE '%@example.com';

    Composite Indexes

    A composite index indexes multiple columns together. The order of columns matters:

    CREATE INDEX idx_users_created_email ON users(created_at, email);

    Composite indexes are powerful because they can satisfy multiple queries:

    -- Uses the index (first column matches)
    SELECT * FROM users WHERE created_at = '2024-01-01';
     
    -- Uses the index (both columns match)
    SELECT * FROM users WHERE created_at = '2024-01-01' AND email = 'user@example.com';
     
    -- Cannot use the index (second column matches, first doesn't)
    SELECT * FROM users WHERE email = 'user@example.com';

    Key rule: The leftmost prefix principle applies. If you have a composite index on (created_at, email), queries that filter on created_at alone can use the index, but queries that filter on email alone cannot.

    Unique Indexes

    A unique index ensures that all indexed values are distinct:

    CREATE UNIQUE INDEX idx_users_email ON users(email);

    This automatically enforces uniqueness and is often used for primary keys:

    CREATE TABLE users (
      id SERIAL PRIMARY KEY,
      email VARCHAR(255) UNIQUE NOT NULL
    );

    Partial Indexes

    A partial index indexes only a subset of rows based on a condition:

    CREATE INDEX idx_active_users_email ON users(email)
    WHERE status = 'active';

    This is more efficient than a full index because it's smaller and faster to maintain. Use partial indexes for:

    • Frequently filtered columns (e.g., only active users)
    • Low-cardinality columns (e.g., only status = 'pending')
    • Historical data (e.g., only records older than 1 year)

    Expression Indexes

    An expression index indexes the result of an expression rather than the column value itself:

    CREATE INDEX idx_users_lower_email ON users(LOWER(email));

    This enables efficient case-insensitive searches:

    -- Uses the expression index
    SELECT * FROM users WHERE LOWER(email) = 'user@example.com';

    Indexing Trade-offs

    Performance Benefits

    Proper indexing dramatically improves query performance:

    Query TypeWithout IndexWith Index
    Single row lookupFull table scan (O(n))B-tree traversal (O(log n))
    Range queryFull table scan (O(n))B-tree traversal (O(log n))
    InsertO(1)O(log n) for index update
    UpdateO(1)O(log n) for index update
    DeleteO(1)O(log n) for index update

    For a table with 1 million rows, a full table scan might take 500ms, while a B-tree index lookup takes 0.1ms—a 5000x improvement.

    Storage Overhead

    Indexes consume additional storage space. A typical rule of thumb is that indexes use 10-30% of the table's storage. For a 100GB table, you might need 10-30GB of additional disk space for indexes.

    Write Performance Impact

    Every INSERT, UPDATE, and DELETE operation must update all relevant indexes. This means:

    • INSERT: O(log n) index updates
    • UPDATE: O(log n) index updates for each indexed column modified
    • DELETE: O(log n) index updates

    For high-write workloads, excessive indexes can become a bottleneck. The key is to balance read performance with write overhead.

    Common Indexing Mistakes

    1. Indexing Columns Used in LIKE with Leading Wildcards

    -- This query CANNOT use the index
    SELECT * FROM users WHERE email LIKE '%@example.com';
     
    -- This query CAN use the index
    SELECT * FROM users WHERE email LIKE 'user%@example.com';

    Leading wildcards (%pattern) prevent index usage because the database cannot determine where to start searching in the index.

    2. Over-Indexing

    Creating too many indexes leads to:

    • Increased storage costs
    • Slower write operations
    • Index maintenance overhead
    • Confusion about which index to use

    Rule: Only index columns that are frequently used in WHERE, JOIN, and ORDER BY clauses.

    3. Ignoring Index Selectivity

    Index selectivity measures how many unique values exist in a column:

    -- High selectivity (good for indexing)
    SELECT COUNT(DISTINCT email) / COUNT(*) FROM users;
    -- Result: 0.95 (95% of emails are unique)
     
    -- Low selectivity (poor for indexing)
    SELECT COUNT(DISTINCT status) / COUNT(*) FROM users;
    -- Result: 0.33 (only 3 unique statuses)

    Columns with high selectivity (many unique values) make better indexes than columns with low selectivity (few unique values).

    4. Forgetting to Index Foreign Keys

    Foreign key columns should almost always be indexed:

    CREATE TABLE orders (
      id SERIAL PRIMARY KEY,
      user_id INTEGER REFERENCES users(id),
      created_at TIMESTAMP DEFAULT NOW()
    );
     
    -- Index the foreign key column
    CREATE INDEX idx_orders_user_id ON orders(user_id);

    Without this index, JOIN operations between tables become full table scans.

    Practical Indexing Strategy

    Step 1: Identify Slow Queries

    Start by identifying queries that are slow:

    -- PostgreSQL: Enable query logging
    ALTER DATABASE your_database SET log_min_duration_statement = 1000;
     
    -- MySQL: Enable slow query log
    SET GLOBAL slow_query_log = 'ON';
    SET GLOBAL long_query_time = 1;

    Step 2: Analyze Query Plans

    Use the EXPLAIN command to see how queries use indexes:

    EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'user@example.com';

    Look for:

    • Index Scan or Index Only Scan: Good, uses the index
    • Seq Scan: Bad, full table scan
    • Bitmap Index Scan: Acceptable for large tables

    Step 3: Create Appropriate Indexes

    Based on your analysis, create indexes:

    -- Single-column index for simple lookups
    CREATE INDEX idx_users_email ON users(email);
     
    -- Composite index for common query patterns
    CREATE INDEX idx_orders_user_created ON orders(user_id, created_at);
     
    -- Partial index for frequently filtered data
    CREATE INDEX idx_active_orders ON orders(id)
    WHERE status = 'pending';

    Step 4: Monitor and Maintain

    Regularly monitor index usage and performance:

    -- PostgreSQL: Check index usage statistics
    SELECT
      schemaname,
      tablename,
      indexname,
      idx_scan as index_scans,
      idx_tup_read as tuples_read,
      idx_tup_fetch as tuples_fetched
    FROM pg_stat_user_indexes
    ORDER BY idx_scan ASC;
     
    -- MySQL: Check index usage
    SELECT
      object_schema,
      object_name,
      index_name,
      count_star as index_scans
    FROM performance_schema.table_io_waits_summary_by_index_usage
    WHERE index_name IS NOT NULL
    ORDER BY count_star DESC;

    Drop indexes that are rarely used:

    DROP INDEX IF EXISTS idx_unused_column;

    Advanced Indexing Techniques

    Covering Indexes

    A covering index includes all columns needed for a query, allowing the database to satisfy the query entirely from the index without accessing the table:

    -- Table definition
    CREATE TABLE users (
      id SERIAL PRIMARY KEY,
      email VARCHAR(255) NOT NULL,
      name VARCHAR(255),
      created_at TIMESTAMP DEFAULT NOW()
    );
     
    -- Covering index for SELECT queries
    CREATE INDEX idx_users_email_name ON users(email, name);
     
    -- This query uses the covering index
    SELECT email, name FROM users WHERE email = 'user@example.com';

    Covering indexes are extremely efficient because they avoid table lookups entirely.

    Indexes for Sorting

    Index columns used in ORDER BY clauses can dramatically improve performance:

    -- Without index: Full table scan + sort (O(n log n))
    SELECT * FROM users ORDER BY created_at DESC;
     
    -- With index: Index scan (O(log n))
    SELECT * FROM users ORDER BY created_at DESC;

    Indexes for Joins

    For JOIN operations, index the columns used in the JOIN condition:

    -- Table A
    CREATE TABLE orders (
      id SERIAL PRIMARY KEY,
      user_id INTEGER NOT NULL,
      amount DECIMAL(10, 2)
    );
     
    -- Table B
    CREATE TABLE users (
      id SERIAL PRIMARY KEY,
      name VARCHAR(255) NOT NULL,
      email VARCHAR(255) NOT NULL
    );
     
    -- Index foreign key columns for efficient JOINs
    CREATE INDEX idx_orders_user_id ON orders(user_id);
    CREATE INDEX idx_users_id ON users(id);

    Indexing Best Practices

    1. Start with the most selective columns: Index columns with high selectivity first.
    2. Use composite indexes for common query patterns: Analyze your actual queries, not hypothetical ones.
    3. Monitor index usage: Drop indexes that aren't being used.
    4. Consider partial indexes: Filter indexes for frequently accessed subsets of data.
    5. Balance read and write performance: More indexes improve reads but slow writes.
    6. Test index changes: Always test indexing changes in a staging environment.
    7. Use appropriate index types: Choose the right index type for your use case (B-tree, hash, GiST, GIN, etc.).
    8. Avoid over-indexing: Every index adds overhead.

    Conclusion

    Database indexing is a fundamental optimization technique that can transform slow queries into fast ones. By understanding how indexes work—particularly B-tree structures—and applying the principles outlined above, you can significantly improve your database performance.

    The key is to be intentional about indexing: analyze your actual query patterns, create indexes that address real performance issues, and regularly monitor and maintain your indexes. Remember that indexing is not a set-it-and-forget-it solution—it requires ongoing attention and optimization as your data and queries evolve.

    For production applications, platforms like ServerlessBase can help manage database deployments and monitoring, making it easier to implement and maintain effective indexing strategies across your infrastructure.

    Next Steps

    Now that you understand database indexing, consider exploring related topics:

    • Query optimization: Learn how to optimize individual queries beyond indexing
    • Database normalization: Understand how schema design affects indexing efficiency
    • Database performance monitoring: Set up tools to track query performance and index usage
    • Database backup strategies: Ensure your indexes are included in backup and recovery procedures

    Leave comment