Guide

Recurrent neural networks (RNN, LSTM, GRU) explained

A support agent reads a ticket thread from top to bottom, updating her mental model with each new message. The order matters: “refund approved” followed by “chargeback filed” tells a different story than the reverse. Feed-forward networks treat each input independently; they have no memory of what came before. Recurrent neural networks (RNNs) solve that by maintaining a hidden state that flows from timestep to timestep, letting the model condition predictions on prior context. Vanilla RNNs are elegant but struggle with long dependencies because gradients shrink across many steps — the vanishing gradient problem. LSTM and GRU cells add gating mechanisms that selectively remember and forget, dominating sequence modeling from 2014 until transformers took over large-scale language modeling. RNNs remain valuable for streaming sensor data, lightweight on-device models, and problems where sequence length is modest and latency budgets are tight. This guide covers recurrence mechanics, backpropagation through time (BPTT), LSTM and GRU internals, bidirectional and stacked architectures, sequence I/O patterns, a Harbor Support escalation forecaster worked example, an architecture decision table, pitfalls, and a production checklist.

Why sequences need recurrence

Many real inputs are ordered sequences: words in a sentence, clicks in a session, heartbeats in an ECG, price ticks in a trading day. Shuffling the order destroys meaning. A bag-of-words representation loses “not good” vs “good” and cannot capture temporal patterns like “user browsed three pages then abandoned cart.”

RNNs process one element x_t per timestep t and update a hidden vector h_t that summarizes everything seen so far. At each step the network can emit an output y_t (many-to-many), only at the end (many-to-one), or from a single seed (one-to-many). That flexibility made RNNs the default for machine translation, speech recognition, and time-series forecasting before attention-based models scaled to billions of parameters.

Sequence I/O patterns

Many-to-one — read a full review, output sentiment score; read a sensor window, output failure probability.
One-to-many — image captioning: one CNN embedding seeds an RNN that generates word by word.
Many-to-many (aligned) — part-of-speech tagging: one label per input token.
Many-to-many (encoder-decoder) — translation: encoder RNN compresses source, decoder RNN generates target (often with attention bridging the gap).

Vanilla RNN: hidden state recurrence

The simplest recurrent cell combines the current input with the previous hidden state through shared weight matrices:

h_t = tanh(W_xh · x_t + W_hh · h_{t-1} + b_h)
y_t = W_hy · h_t + b_y

The same weights are applied at every timestep — weight sharing across time, analogous to kernel sharing in CNNs across space. The hidden state h_t is a fixed-size summary of the prefix x₁..x_t. During training, the network unrolls the recurrence for T steps and applies backpropagation through time (BPTT): gradients flow from the loss at step T backward through each unrolled copy of the cell.

Vanishing and exploding gradients

Because h_t depends on h_t-1 through repeated multiplication by W_hh, gradients w.r.t. early timesteps involve products of Jacobian matrices. If the largest singular value of W_hh is consistently below 1, gradients vanish and early tokens cannot be learned; above 1, they explode and training diverges. Sigmoid and tanh saturation makes this worse. Practical mitigations include gradient clipping, careful initialization (orthogonal W_hh), truncated BPTT (backprop only through the last k steps), and — most importantly — gated architectures.

LSTM: long short-term memory with gates

Hochreiter and Schmidhuber’s LSTM (1997, popularized ~2014) introduces a separate cell state c_t that acts like a conveyor belt, plus three gates controlling information flow:

Forget gate f_t = σ(W_f · [h_{t-1}, x_t]) — what to erase from c_t-1.
Input gate i_t = σ(W_i · [h_{t-1}, x_t]) and candidate c̃_t = tanh(W_c · [h_{t-1}, x_t]) — what new information to write.
Output gate o_t = σ(W_o · [h_{t-1}, x_t]) — what part of the cell state becomes the visible hidden state h_t = o_t ⊙ tanh(c_t).

Cell update: c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t. The additive path through c_t lets gradients flow across many timesteps without repeated destructive multiplication — the core fix for long-range dependencies. LSTMs became the workhorse for speech (Deep Speech), early neural MT, and time-series baselines.

Peephole and layer-normalized variants

Peephole connections let gates peek at the cell state directly. Layer normalization inside recurrent cells stabilizes training on long sequences. PyTorch’s nn.LSTM defaults are sufficient for most projects; enable batch_first=True when your tensors are (batch, seq, feature) instead of (seq, batch, feature).

GRU: a lighter gated recurrent unit

The GRU (Cho et al., 2014) merges forget and input gates into an update gate z_t and adds a reset gate r_t that controls how much past hidden state contributes to the candidate:

z_t = σ(W_z · [h_{t-1}, x_t])          # update: blend old vs new
r_t = σ(W_r · [h_{t-1}, x_t])          # reset: drop past when computing candidate
h̃_t = tanh(W · [r_t ⊙ h_{t-1}, x_t])
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t

GRUs have fewer parameters than LSTMs (no separate cell state) and often match LSTM accuracy on medium-length tasks while training faster. On very long sequences or noisy financial data, LSTMs sometimes edge ahead; the only reliable answer is cross-validation on your metric. Many production tabular-sequence models use a bidirectional GRU as a strong, compact baseline before reaching for transformers.

Stacked, bidirectional, and dropout in depth

Stacked RNNs feed the hidden outputs of layer l as inputs to layer l+1, letting higher layers learn increasingly abstract temporal features — analogous to CNN depth. Two to three layers are typical; deeper stacks need residual connections or they become hard to train.

Bidirectional RNNs run one RNN forward and another backward over the sequence, concatenating (or summing) their hidden states at each position. Context from both past and future helps tagging and encoding tasks. They are unsuitable for real-time streaming prediction where future tokens are unavailable unless you accept latency buffering.

Apply dropout between stacked layers (not inside the recurrent loop on the same timestep unless using variational dropout). Pack padded sequences with pack_padded_sequence in PyTorch so padding tokens do not waste compute or bias the hidden state.

Training practicalities: BPTT, truncation, and regularization

Full BPTT through 10,000 timesteps is memory-prohibitive. Truncated BPTT splits the sequence into chunks of length k, detaching the hidden state between chunks during backprop while still carrying it forward for inference continuity. This trades some long-range gradient signal for feasible GPU memory.

Teacher forcing during seq2seq training feeds the decoder the ground-truth previous token instead of its own prediction, accelerating convergence but creating exposure bias at inference. Scheduled sampling gradually replaces gold tokens with model outputs. Modern MT uses transformers with teacher forcing on the full target in parallel; RNN decoders needed these tricks more desperately.

Normalize inputs (zero mean, unit variance per feature channel). Use learning-rate warmup and gradient clipping (e.g. max norm 1.0). Monitor validation loss on held-out time ranges, not random row splits — temporal leakage inflates scores on financial and IoT data.

Worked example: Harbor Support ticket escalation forecaster

Harbor Support routes enterprise tickets to Tier 1 or auto-escalates to Tier 3 engineers. Each ticket is a sequence of events: status changes, agent replies, customer messages, SLA timer ticks, and integration webhooks. A flat feature vector per ticket (last message sentiment, total reply count) missed escalation triggers buried early in the thread.

The team built a many-to-one bidirectional GRU:

Embed each event type (12 categories) and scalar features (hours since open, priority code).
Concatenate embeddings per timestep into a 64-dim input vector.
Two-layer BiGRU, 128 hidden units per direction; take final forward and backward states, concatenate.
Linear head outputs escalation probability within the next 4 hours.

Training data: 240k resolved tickets, time-based split (last 90 days holdout). Truncated BPTT with chunk length 64; median thread length 18 events, p95 = 112. Class weighting for rare escalations (3.2% positive rate). AUC 0.89 vs 0.81 for XGBoost on aggregated tabular features. Inference: 6 ms CPU per ticket on p95 length — fast enough for the routing webhook. They retrain weekly; hidden states are not persisted across tickets.

When Harbor later experimented with a small fine-tuned encoder-only transformer on the same labels, AUC reached 0.91 but latency jumped to 180 ms without GPU. The GRU stays in production for real-time routing; the transformer handles overnight batch risk scoring.

Architecture decision table: RNN vs CNN vs transformer

Approach	Best when	Watch out for
Vanilla RNN	Teaching, tiny sequences (<10 steps), embedded devices with extreme parameter budgets	Long-range learning; almost always prefer GRU/LSTM
LSTM / GRU	Streaming sensors, modest text/speech, on-device seq models, <500 tokens, tight CPU latency	Sequential inference; limited parallelization vs transformers
1D CNN / TCN	Fixed receptive fields, local patterns, very fast GPU inference on uniform-length windows	Global context needs deep stacks or dilation
Transformer	Long context, web-scale pretraining, parallel training, SOTA NLP/vision-language	O(n²) attention memory; overkill for short tabular event streams
State-space (Mamba)	Very long sequences needing linear-time recurrence with modern quality	Ecosystem maturity vs transformers; see SSM guide

Practical rule: start with a BiGRU baseline on engineered per-step features when sequences are short, labels are scarce, and you need sub-10 ms CPU inference. Reach for transformers when you have large text corpora, need transfer learning from foundation models, or context exceeds a few hundred tokens. See attention mechanism and state space models for the post-RNN landscape.

Common pitfalls

Random train/test splits on time series — future leakage; split by time or entity cohort.
Padding without packing — padded zeros pollute final hidden states; use pack_padded_sequence.
Using final hidden state on padded batches — index the last real timestep per sequence.
Expecting vanilla RNN to learn 500-step dependencies — use LSTM/GRU or shorten the horizon.
Bidirectional models for live forecasting — future data is not available at serve time.
Ignoring class imbalance — rare events need weighted loss or oversampling; accuracy misleads.
Teacher forcing without exposure-bias mitigation — seq2seq decoders degenerate at inference.
Replacing RNNs with transformers by default — higher latency, data hunger, and ops cost for marginal gains on short streams.

Production checklist

Map the problem to a sequence I/O pattern (many-to-one, seq2seq, etc.).
Engineer per-timestep features; verify order is preserved end-to-end.
Split data temporally; document the holdout window.
Baseline with BiGRU (2 layers, 128 units) before transformers.
Use packed sequences, dropout between layers, gradient clipping.
Tune hidden size and layers with early stopping on domain metric (AUC, MAE).
Benchmark CPU p95 latency on longest expected sequences.
Version embedding tables and scalers with the checkpoint.
Monitor concept drift; schedule retrains when event vocabulary shifts.
Document why recurrence was chosen over attention for auditability.

Key takeaways

RNNs maintain hidden state across timesteps — the natural fit for ordered data where context accumulates.
Vanilla RNNs suffer vanishing gradients — LSTM and GRU gates enable learning over longer spans.
BPTT trains unrolled graphs — truncation and clipping make long sequences feasible.
Bidirectional encoding helps labeling; not live forecasting — match architecture to serve-time information.
Transformers dominate large-scale NLP — but gated RNNs remain fast, data-efficient baselines on short event streams.