Guide

Database indexing explained

A missing index is one of the cheapest performance bugs to fix and one of the most expensive to discover in production. Your query planner can scan millions of rows to find ten matches — or jump straight to them in a few page reads if the right B-tree exists. Indexes are not free: every insert and update must maintain them, and a pile of redundant indexes will slow writes and bloat backups. This guide explains how relational databases use indexes, which index shapes match which queries, how to read EXPLAIN plans, and the checklist teams use before adding another key — whether you are tuning a REST API backed by PostgreSQL or debugging a slow report that only shows up after normalization split one wide table into five.

What an index actually is

An index is a separate data structure the database maintains alongside your table rows. Instead of reading every row to satisfy a WHERE clause, the planner walks the index — usually a B-tree (balanced tree of sorted keys pointing to row locations) — and retrieves only matching entries. Think of a phone book sorted by last name: finding "Nguyen" does not require reading every page from A to Z.

Most OLTP workloads (Postgres, MySQL InnoDB, SQL Server) default to B-trees for equality and range queries on scalar columns. Other structures appear for specialized cases: hash indexes for exact-match-only lookups (limited in Postgres), GIN/GiST for full-text search and JSON containment, and BRIN for very large append-only tables with natural physical ordering. If you are new to SQL itself, start with SQL fundamentals before diving into planner details.

The critical mental model: an index trades read speed for write cost and storage. Every INSERT, UPDATE on indexed columns, and DELETE must update each relevant index tree. That is why "index everything" is as bad as "index nothing."

Primary, unique, and foreign-key indexes

Primary key indexes are created automatically when you declare PRIMARY KEY. They uniquely identify rows and are the default join target in most schemas. Use stable surrogate keys (UUID or bigint sequences) unless you have a strong natural key — email addresses change; order numbers might collide across shards.

Unique indexes enforce one value per key (or per key combination) and accelerate lookups on those columns. A common pattern: CREATE UNIQUE INDEX users_email_idx ON users (lower(email)); — functional indexes let you enforce case-insensitive uniqueness without storing duplicate normalized columns.

Foreign-key indexes are not always created automatically. If orders.user_id references users.id but only users.id is indexed, every join from orders to users and every DELETE FROM users WHERE id = ? may scan the entire orders table to check referential integrity. Index the referencing column: CREATE INDEX orders_user_id_idx ON orders (user_id);. This single addition often fixes slow dashboards and is essential when you batch-load children in an N+1 fix with WHERE user_id IN (...).

Compound indexes and the left-prefix rule

A compound (composite) index spans multiple columns in a fixed order: CREATE INDEX orders_status_created_idx ON orders (status, created_at DESC);. The order matters because B-trees sort lexicographically by the first column, then break ties with the second, and so on.

The left-prefix rule: this index efficiently serves queries that filter on status alone, or on status plus created_at, but not queries that filter only on created_at without status. If both patterns are hot, you may need two indexes — or redesign the query.

Put the most selective column first when equality filters precede range filters. A typical pattern for "active orders in a date range":

WHERE status = 'open' AND created_at > '2026-01-01' — index (status, created_at) works well because status is equality and created_at is a range on the remaining tail.
WHERE created_at > '2026-01-01' alone — needs (created_at) or a different index; the compound index above will not help.

Avoid duplicating single-column indexes when a compound index already covers the left prefix — extra indexes slow writes without helping reads.

Covering indexes and index-only scans

A covering index includes every column the query needs, so the database never touches the heap (table) at all — an index-only scan. In PostgreSQL:

CREATE INDEX orders_covering_idx
  ON orders (user_id, created_at DESC)
  INCLUDE (total_cents, currency);

A query like SELECT total_cents, currency FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20 can be satisfied entirely from the index pages. INCLUDE columns (Postgres 11+) are stored in leaf nodes but are not part of the search key — useful for wide tables where you only filter on a few columns but select a handful more.

MySQL offers similar behavior when all selected columns exist in a secondary index; the planner reports Using index in EXPLAIN. Covering indexes are powerful for read-heavy list endpoints but enlarge index size — measure before adding INCLUDE on large text columns.

Partial and expression indexes

A partial index indexes only rows matching a predicate:

CREATE INDEX orders_open_idx
  ON orders (created_at)
  WHERE status = 'open';

If 95% of orders are archived and queries only touch open ones, this index is smaller and faster than indexing all rows. Partial indexes shine for soft-delete patterns (WHERE deleted_at IS NULL), status queues, and boolean flags that are skewed.

Expression indexes index computed values — lowercase email, JSON fields, date truncations: CREATE INDEX events_day_idx ON events ((created_at::date));. The query must use the same expression in the WHERE clause for the planner to match it.

Reading EXPLAIN plans

Never guess — run EXPLAIN (ANALYZE, BUFFERS) on Postgres or EXPLAIN FORMAT=JSON on MySQL for the actual query with production-like parameters. Red flags:

Seq Scan / ALL on large tables — full table scan. Acceptable for tiny tables; catastrophic at millions of rows unless you need most rows anyway.
Rows estimate vs actual wildly off — stale statistics; run ANALYZE or increase stats targets on skewed columns.
Nested Loop with huge inner loops — missing join index on the inner table's filter column.
Sort on large row sets — consider an index that matches ORDER BY to avoid an in-memory or disk sort.
Bitmap Heap Scan followed by recheck — index used but table visits still costly; covering index may help.

Compare execution time and shared buffers hit before and after index changes. A plan that uses an index but is slower than a seq scan on a 200-row table is normal — the planner chose correctly.

Pair EXPLAIN work with connection pool metrics: faster queries release connections sooner under load, but pathological lock contention from hot index pages is a separate problem addressed by transaction isolation tuning, not more indexes.

Write overhead, bloat, and maintenance

Each secondary index duplicates work on every row change. High-write tables (event logs, ledger entries, click streams) should carry fewer indexes than read-heavy dimension tables. Bulk loads often drop non-essential indexes, load data, then recreate indexes — faster than maintaining trees row by row.

Index bloat accumulates when many updates leave dead leaf pages; REINDEX or pg_repack (Postgres) and OPTIMIZE TABLE (MySQL) reclaim space. Monitor index size vs table size; bloated indexes waste cache.

Unused indexes still cost writes. Postgres exposes pg_stat_user_indexes.idx_scan; zero scans over weeks means candidates for removal after confirming no rare reporting job depends on them.

When not to index

Low-cardinality columns alone — boolean is_active on a balanced table often yields a seq scan anyway; combine with selective columns or use partial indexes.
Tiny tables — under a few thousand rows, seq scans are fine.
Columns you never filter or join on — indexes on unreferenced display-only fields waste space.
Leading wildcard LIKE — LIKE '%foo' cannot use a standard B-tree; use trigram (pg_trgm) or full-text indexes.
Over-indexing write-heavy paths — payment and inventory tables need restraint; cache hot reads in Redis after indexing the underlying lookup keys.

Production checklist

Index every foreign-key referencing column used in joins or cascades.
Match compound index column order to equality filters before range/sort columns.
Run EXPLAIN ANALYZE on top ten slow queries from APM logs weekly.
Keep table statistics fresh — automate ANALYZE after large ETL jobs.
Prefer partial indexes for skewed status or soft-delete filters.
Evaluate covering indexes for hot list endpoints with narrow SELECT lists.
Track index scan counts; drop indexes with zero use after a full business cycle.
Load-test write throughput after adding indexes to high-insert tables.
Document index rationale in migration comments — future you will forget why (status, created_at) exists.
Revisit indexes after schema normalization or sharding — join patterns change.

Key takeaways

Indexes are sorted lookup structures — usually B-trees — that avoid full table scans.
Foreign-key columns need indexes on the referencing side for fast joins and integrity checks.
Compound indexes follow the left-prefix rule; column order must match query patterns.
Covering and partial indexes are high-leverage tools for read-heavy, skewed workloads.
EXPLAIN ANALYZE is mandatory; estimates, scan types, and buffer hits tell the real story.
Every index taxes writes — add deliberately, measure, and prune unused keys.