Guide
Recurrent neural networks (RNN, LSTM, GRU) explained
A support agent reads a ticket thread from top to bottom, updating her mental model with each new message. The order matters: “refund approved” followed by “chargeback filed” tells a different story than the reverse. Feed-forward networks treat each input independently; they have no memory of what came before. Recurrent neural networks (RNNs) solve that by maintaining a hidden state that flows from timestep to timestep, letting the model condition predictions on prior context. Vanilla RNNs are elegant but struggle with long dependencies because gradients shrink across many steps — the vanishing gradient problem. LSTM and GRU cells add gating mechanisms that selectively remember and forget, dominating sequence modeling from 2014 until transformers took over large-scale language modeling. RNNs remain valuable for streaming sensor data, lightweight on-device models, and problems where sequence length is modest and latency budgets are tight. This guide covers recurrence mechanics, backpropagation through time (BPTT), LSTM and GRU internals, bidirectional and stacked architectures, sequence I/O patterns, a Harbor Support escalation forecaster worked example, an architecture decision table, pitfalls, and a production checklist.
Why sequences need recurrence
Many real inputs are ordered sequences: words in a sentence, clicks in a session, heartbeats in an ECG, price ticks in a trading day. Shuffling the order destroys meaning. A bag-of-words representation loses “not good” vs “good” and cannot capture temporal patterns like “user browsed three pages then abandoned cart.”
RNNs process one element xt per timestep t and
update a hidden vector ht that summarizes everything seen so
far. At each step the network can emit an output yt (many-to-many),
only at the end (many-to-one), or from a single seed (one-to-many). That flexibility
made RNNs the default for machine translation, speech recognition, and time-series
forecasting before attention-based models scaled to billions of parameters.
Sequence I/O patterns
- Many-to-one — read a full review, output sentiment score; read a sensor window, output failure probability.
- One-to-many — image captioning: one CNN embedding seeds an RNN that generates word by word.
- Many-to-many (aligned) — part-of-speech tagging: one label per input token.
- Many-to-many (encoder-decoder) — translation: encoder RNN compresses source, decoder RNN generates target (often with attention bridging the gap).
Vanilla RNN: hidden state recurrence
The simplest recurrent cell combines the current input with the previous hidden state through shared weight matrices:
h_t = tanh(W_xh · x_t + W_hh · h_{t-1} + b_h)
y_t = W_hy · h_t + b_y
The same weights are applied at every timestep — weight sharing
across time, analogous to kernel sharing in CNNs across space. The hidden state
ht is a fixed-size summary of the prefix
x1..xt. During training, the network unrolls the
recurrence for T steps and applies
backpropagation through time (BPTT): gradients flow from the loss at
step T backward through each unrolled copy of the cell.
Vanishing and exploding gradients
Because ht depends on ht-1 through
repeated multiplication by Whh, gradients w.r.t. early timesteps
involve products of Jacobian matrices. If the largest singular value of
Whh is consistently below 1, gradients vanish
and early tokens cannot be learned; above 1, they explode and training
diverges. Sigmoid and tanh saturation makes this worse. Practical mitigations include
gradient clipping, careful initialization (orthogonal Whh),
truncated BPTT (backprop only through the last k steps), and — most
importantly — gated architectures.
LSTM: long short-term memory with gates
Hochreiter and Schmidhuber’s LSTM (1997, popularized ~2014)
introduces a separate cell state ct that acts
like a conveyor belt, plus three gates controlling information flow:
- Forget gate
ft = σ(W_f · [h_{t-1}, x_t])— what to erase fromct-1. - Input gate
it = σ(W_i · [h_{t-1}, x_t])and candidatec̃t = tanh(W_c · [h_{t-1}, x_t])— what new information to write. - Output gate
ot = σ(W_o · [h_{t-1}, x_t])— what part of the cell state becomes the visible hidden stateht = ot ⊙ tanh(ct).
Cell update: ct = ft ⊙ ct-1 + it ⊙ c̃t.
The additive path through ct lets gradients flow across many
timesteps without repeated destructive multiplication — the core fix for long-range
dependencies. LSTMs became the workhorse for speech (Deep Speech), early neural MT, and
time-series baselines.
Peephole and layer-normalized variants
Peephole connections let gates peek at the cell state directly.
Layer normalization inside recurrent cells stabilizes training on
long sequences. PyTorch’s nn.LSTM defaults are sufficient for most
projects; enable batch_first=True when your tensors are
(batch, seq, feature) instead of (seq, batch, feature).
GRU: a lighter gated recurrent unit
The GRU (Cho et al., 2014) merges forget and input gates into an
update gate zt and adds a
reset gate rt that controls how much past
hidden state contributes to the candidate:
z_t = σ(W_z · [h_{t-1}, x_t]) # update: blend old vs new
r_t = σ(W_r · [h_{t-1}, x_t]) # reset: drop past when computing candidate
h̃_t = tanh(W · [r_t ⊙ h_{t-1}, x_t])
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t
GRUs have fewer parameters than LSTMs (no separate cell state) and often match LSTM accuracy on medium-length tasks while training faster. On very long sequences or noisy financial data, LSTMs sometimes edge ahead; the only reliable answer is cross-validation on your metric. Many production tabular-sequence models use a bidirectional GRU as a strong, compact baseline before reaching for transformers.
Stacked, bidirectional, and dropout in depth
Stacked RNNs feed the hidden outputs of layer l as inputs
to layer l+1, letting higher layers learn increasingly abstract temporal
features — analogous to CNN depth. Two to three layers are typical; deeper stacks
need residual connections or they become hard to train.
Bidirectional RNNs run one RNN forward and another backward over the sequence, concatenating (or summing) their hidden states at each position. Context from both past and future helps tagging and encoding tasks. They are unsuitable for real-time streaming prediction where future tokens are unavailable unless you accept latency buffering.
Apply dropout between stacked layers (not inside the recurrent loop on
the same timestep unless using variational dropout). Pack padded sequences with
pack_padded_sequence in PyTorch so padding tokens do not waste compute or
bias the hidden state.
Training practicalities: BPTT, truncation, and regularization
Full BPTT through 10,000 timesteps is memory-prohibitive. Truncated BPTT
splits the sequence into chunks of length k, detaching the hidden state
between chunks during backprop while still carrying it forward for inference continuity.
This trades some long-range gradient signal for feasible GPU memory.
Teacher forcing during seq2seq training feeds the decoder the ground-truth previous token instead of its own prediction, accelerating convergence but creating exposure bias at inference. Scheduled sampling gradually replaces gold tokens with model outputs. Modern MT uses transformers with teacher forcing on the full target in parallel; RNN decoders needed these tricks more desperately.
Normalize inputs (zero mean, unit variance per feature channel). Use learning-rate warmup and gradient clipping (e.g. max norm 1.0). Monitor validation loss on held-out time ranges, not random row splits — temporal leakage inflates scores on financial and IoT data.
Worked example: Harbor Support ticket escalation forecaster
Harbor Support routes enterprise tickets to Tier 1 or auto-escalates to Tier 3 engineers. Each ticket is a sequence of events: status changes, agent replies, customer messages, SLA timer ticks, and integration webhooks. A flat feature vector per ticket (last message sentiment, total reply count) missed escalation triggers buried early in the thread.
The team built a many-to-one bidirectional GRU:
- Embed each event type (12 categories) and scalar features (hours since open, priority code).
- Concatenate embeddings per timestep into a 64-dim input vector.
- Two-layer BiGRU, 128 hidden units per direction; take final forward and backward states, concatenate.
- Linear head outputs escalation probability within the next 4 hours.
Training data: 240k resolved tickets, time-based split (last 90 days holdout). Truncated BPTT with chunk length 64; median thread length 18 events, p95 = 112. Class weighting for rare escalations (3.2% positive rate). AUC 0.89 vs 0.81 for XGBoost on aggregated tabular features. Inference: 6 ms CPU per ticket on p95 length — fast enough for the routing webhook. They retrain weekly; hidden states are not persisted across tickets.
When Harbor later experimented with a small fine-tuned encoder-only transformer on the same labels, AUC reached 0.91 but latency jumped to 180 ms without GPU. The GRU stays in production for real-time routing; the transformer handles overnight batch risk scoring.
Architecture decision table: RNN vs CNN vs transformer
| Approach | Best when | Watch out for |
|---|---|---|
| Vanilla RNN | Teaching, tiny sequences (<10 steps), embedded devices with extreme parameter budgets | Long-range learning; almost always prefer GRU/LSTM |
| LSTM / GRU | Streaming sensors, modest text/speech, on-device seq models, <500 tokens, tight CPU latency | Sequential inference; limited parallelization vs transformers |
| 1D CNN / TCN | Fixed receptive fields, local patterns, very fast GPU inference on uniform-length windows | Global context needs deep stacks or dilation |
| Transformer | Long context, web-scale pretraining, parallel training, SOTA NLP/vision-language | O(n²) attention memory; overkill for short tabular event streams |
| State-space (Mamba) | Very long sequences needing linear-time recurrence with modern quality | Ecosystem maturity vs transformers; see SSM guide |
Practical rule: start with a BiGRU baseline on engineered per-step features when sequences are short, labels are scarce, and you need sub-10 ms CPU inference. Reach for transformers when you have large text corpora, need transfer learning from foundation models, or context exceeds a few hundred tokens. See attention mechanism and state space models for the post-RNN landscape.
Common pitfalls
- Random train/test splits on time series — future leakage; split by time or entity cohort.
- Padding without packing — padded zeros pollute final hidden states; use
pack_padded_sequence. - Using final hidden state on padded batches — index the last real timestep per sequence.
- Expecting vanilla RNN to learn 500-step dependencies — use LSTM/GRU or shorten the horizon.
- Bidirectional models for live forecasting — future data is not available at serve time.
- Ignoring class imbalance — rare events need weighted loss or oversampling; accuracy misleads.
- Teacher forcing without exposure-bias mitigation — seq2seq decoders degenerate at inference.
- Replacing RNNs with transformers by default — higher latency, data hunger, and ops cost for marginal gains on short streams.
Production checklist
- Map the problem to a sequence I/O pattern (many-to-one, seq2seq, etc.).
- Engineer per-timestep features; verify order is preserved end-to-end.
- Split data temporally; document the holdout window.
- Baseline with BiGRU (2 layers, 128 units) before transformers.
- Use packed sequences, dropout between layers, gradient clipping.
- Tune hidden size and layers with early stopping on domain metric (AUC, MAE).
- Benchmark CPU p95 latency on longest expected sequences.
- Version embedding tables and scalers with the checkpoint.
- Monitor concept drift; schedule retrains when event vocabulary shifts.
- Document why recurrence was chosen over attention for auditability.
Key takeaways
- RNNs maintain hidden state across timesteps — the natural fit for ordered data where context accumulates.
- Vanilla RNNs suffer vanishing gradients — LSTM and GRU gates enable learning over longer spans.
- BPTT trains unrolled graphs — truncation and clipping make long sequences feasible.
- Bidirectional encoding helps labeling; not live forecasting — match architecture to serve-time information.
- Transformers dominate large-scale NLP — but gated RNNs remain fast, data-efficient baselines on short event streams.
Related reading
- Transformer architecture explained — how self-attention replaced recurrence at scale
- Backpropagation explained — gradient flow through unrolled recurrent graphs
- Vanishing and exploding gradients explained — why depth and recurrence need careful training
- PyTorch fundamentals explained —
nn.LSTM,nn.GRU, and packed sequences