Guide
Database indexing explained
A missing index is one of the cheapest performance bugs to fix and one of the most expensive to discover in production. Your query planner can scan millions of rows to find ten matches — or jump straight to them in a few page reads if the right B-tree exists. Indexes are not free: every insert and update must maintain them, and a pile of redundant indexes will slow writes and bloat backups. This guide explains how relational databases use indexes, which index shapes match which queries, how to read EXPLAIN plans, and the checklist teams use before adding another key — whether you are tuning a REST API backed by PostgreSQL or debugging a slow report that only shows up after normalization split one wide table into five.
What an index actually is
An index is a separate data structure the database maintains alongside your table rows.
Instead of reading every row to satisfy a WHERE clause, the planner walks
the index — usually a B-tree (balanced tree of sorted keys pointing to
row locations) — and retrieves only matching entries. Think of a phone book sorted by
last name: finding "Nguyen" does not require reading every page from A to Z.
Most OLTP workloads (Postgres, MySQL InnoDB, SQL Server) default to B-trees for equality and range queries on scalar columns. Other structures appear for specialized cases: hash indexes for exact-match-only lookups (limited in Postgres), GIN/GiST for full-text search and JSON containment, and BRIN for very large append-only tables with natural physical ordering. If you are new to SQL itself, start with SQL fundamentals before diving into planner details.
The critical mental model: an index trades read speed for
write cost and storage. Every INSERT, UPDATE
on indexed columns, and DELETE must update each relevant index tree. That
is why "index everything" is as bad as "index nothing."
Primary, unique, and foreign-key indexes
Primary key indexes are created automatically when you declare
PRIMARY KEY. They uniquely identify rows and are the default join target
in most schemas. Use stable surrogate keys (UUID or bigint sequences) unless you have
a strong natural key — email addresses change; order numbers might collide across shards.
Unique indexes enforce one value per key (or per key combination) and
accelerate lookups on those columns. A common pattern:
CREATE UNIQUE INDEX users_email_idx ON users (lower(email)); — functional
indexes let you enforce case-insensitive uniqueness without storing duplicate
normalized columns.
Foreign-key indexes are not always created automatically. If
orders.user_id references users.id but only
users.id is indexed, every join from orders to users and every
DELETE FROM users WHERE id = ? may scan the entire orders table to check
referential integrity. Index the referencing column:
CREATE INDEX orders_user_id_idx ON orders (user_id);. This single addition
often fixes slow dashboards and is essential when you batch-load children in an
N+1 fix
with WHERE user_id IN (...).
Compound indexes and the left-prefix rule
A compound (composite) index spans multiple columns in a fixed order:
CREATE INDEX orders_status_created_idx ON orders (status, created_at DESC);.
The order matters because B-trees sort lexicographically by the first column, then
break ties with the second, and so on.
The left-prefix rule: this index efficiently serves queries that filter
on status alone, or on status plus created_at,
but not queries that filter only on created_at without
status. If both patterns are hot, you may need two indexes — or redesign
the query.
Put the most selective column first when equality filters precede range filters. A typical pattern for "active orders in a date range":
WHERE status = 'open' AND created_at > '2026-01-01'— index(status, created_at)works well becausestatusis equality andcreated_atis a range on the remaining tail.WHERE created_at > '2026-01-01'alone — needs(created_at)or a different index; the compound index above will not help.
Avoid duplicating single-column indexes when a compound index already covers the left prefix — extra indexes slow writes without helping reads.
Covering indexes and index-only scans
A covering index includes every column the query needs, so the database never touches the heap (table) at all — an index-only scan. In PostgreSQL:
CREATE INDEX orders_covering_idx
ON orders (user_id, created_at DESC)
INCLUDE (total_cents, currency);
A query like SELECT total_cents, currency FROM orders WHERE user_id = $1 ORDER BY
created_at DESC LIMIT 20 can be satisfied entirely from the index pages. INCLUDE
columns (Postgres 11+) are stored in leaf nodes but are not part of the search key —
useful for wide tables where you only filter on a few columns but select a handful more.
MySQL offers similar behavior when all selected columns exist in a secondary index; the
planner reports Using index in EXPLAIN. Covering indexes are powerful for
read-heavy list endpoints but enlarge index size — measure before adding INCLUDE on
large text columns.
Partial and expression indexes
A partial index indexes only rows matching a predicate:
CREATE INDEX orders_open_idx
ON orders (created_at)
WHERE status = 'open';
If 95% of orders are archived and queries only touch open ones, this index is smaller
and faster than indexing all rows. Partial indexes shine for soft-delete patterns
(WHERE deleted_at IS NULL), status queues, and boolean flags that are
skewed.
Expression indexes index computed values — lowercase email, JSON
fields, date truncations:
CREATE INDEX events_day_idx ON events ((created_at::date));. The query
must use the same expression in the WHERE clause for the planner to match it.
Reading EXPLAIN plans
Never guess — run EXPLAIN (ANALYZE, BUFFERS) on Postgres or
EXPLAIN FORMAT=JSON on MySQL for the actual query with production-like
parameters. Red flags:
- Seq Scan / ALL on large tables — full table scan. Acceptable for tiny tables; catastrophic at millions of rows unless you need most rows anyway.
- Rows estimate vs actual wildly off — stale statistics; run
ANALYZEor increase stats targets on skewed columns. - Nested Loop with huge inner loops — missing join index on the inner table's filter column.
- Sort on large row sets — consider an index that matches
ORDER BYto avoid an in-memory or disk sort. - Bitmap Heap Scan followed by recheck — index used but table visits still costly; covering index may help.
Compare execution time and shared buffers hit before and after index changes. A plan that uses an index but is slower than a seq scan on a 200-row table is normal — the planner chose correctly.
Pair EXPLAIN work with connection pool metrics: faster queries release connections sooner under load, but pathological lock contention from hot index pages is a separate problem addressed by transaction isolation tuning, not more indexes.
Write overhead, bloat, and maintenance
Each secondary index duplicates work on every row change. High-write tables (event logs, ledger entries, click streams) should carry fewer indexes than read-heavy dimension tables. Bulk loads often drop non-essential indexes, load data, then recreate indexes — faster than maintaining trees row by row.
Index bloat accumulates when many updates leave dead leaf pages;
REINDEX or pg_repack (Postgres) and OPTIMIZE TABLE
(MySQL) reclaim space. Monitor index size vs table size; bloated indexes waste cache.
Unused indexes still cost writes. Postgres exposes
pg_stat_user_indexes.idx_scan; zero scans over weeks means candidates for
removal after confirming no rare reporting job depends on them.
When not to index
- Low-cardinality columns alone — boolean
is_activeon a balanced table often yields a seq scan anyway; combine with selective columns or use partial indexes. - Tiny tables — under a few thousand rows, seq scans are fine.
- Columns you never filter or join on — indexes on unreferenced display-only fields waste space.
- Leading wildcard LIKE —
LIKE '%foo'cannot use a standard B-tree; use trigram (pg_trgm) or full-text indexes. - Over-indexing write-heavy paths — payment and inventory tables need restraint; cache hot reads in Redis after indexing the underlying lookup keys.
Production checklist
- Index every foreign-key referencing column used in joins or cascades.
- Match compound index column order to equality filters before range/sort columns.
- Run EXPLAIN ANALYZE on top ten slow queries from APM logs weekly.
- Keep table statistics fresh — automate ANALYZE after large ETL jobs.
- Prefer partial indexes for skewed status or soft-delete filters.
- Evaluate covering indexes for hot list endpoints with narrow SELECT lists.
- Track index scan counts; drop indexes with zero use after a full business cycle.
- Load-test write throughput after adding indexes to high-insert tables.
- Document index rationale in migration comments — future you will forget why
(status, created_at)exists. - Revisit indexes after schema normalization or sharding — join patterns change.
Key takeaways
- Indexes are sorted lookup structures — usually B-trees — that avoid full table scans.
- Foreign-key columns need indexes on the referencing side for fast joins and integrity checks.
- Compound indexes follow the left-prefix rule; column order must match query patterns.
- Covering and partial indexes are high-leverage tools for read-heavy, skewed workloads.
- EXPLAIN ANALYZE is mandatory; estimates, scan types, and buffer hits tell the real story.
- Every index taxes writes — add deliberately, measure, and prune unused keys.
Related reading
- SQL fundamentals explained — SELECT, JOIN, and WHERE before tuning plans
- N+1 query problem explained — batch loaders need indexes on IN-clause columns
- Database normalization explained — more joins after normalization means more index discipline
- SQL joins explained — how join order and keys interact with the planner