Guide
Mixture of experts (MoE) explained
Harbor Support routed tier-2 tickets through a 70B-class MoE model with 8 experts per layer and top-2 activation. On paper that is 4× the FFN capacity of a dense 18B model at similar per-token FLOPs — but p95 latency spiked when 60% of tokens landed on the same two experts while four GPUs sat idle. Fixing load-balancing aux loss, capacity factors, and expert-parallel sharding cut tail latency by 38% without retraining from scratch. A mixture of experts (MoE) replaces each transformer FFN block with a bank of parallel “expert” MLPs and a router that activates only the top-k experts per token. Total parameters scale up; active parameters per forward pass stay bounded — the core trick behind Mixtral, DeepSeek-V2/V3, and Switch-style sparse transformers. This guide covers router math, training stability, expert parallelism in distributed training, the Harbor Support gateway refactor, a decision table against dense stacks and external model routing, pitfalls, and a production checklist alongside our transformer architecture guide and LLM scaling laws guide.
What MoE changes in a transformer block
A standard decoder block is: self-attention → residual → FFN
(two linear layers with activation) → residual. In MoE, the single FFN
becomes N expert FFNs plus a gating network
(often a learned linear map from hidden state to N logits). For each token
hidden vector h:
- Compute router logits
g = Wr h(sometimes with noise during training). - Select top-k experts (typically k = 1 or 2).
- Softmax only over selected experts (or full softmax with masking) to get weights
αi. - Output
∑i ∈ top-k αi · Experti(h).
Active parameters per token are roughly
k/N × expert FFN size plus attention weights (unchanged).
Total parameters scale with N. That is why a “70B
MoE” headline often means ~14B active — marketing counts total
weights; inference cost tracks active FLOPs and memory touched.
Dense FFN vs sparse MoE
Dense models use every parameter on every token. MoE trades conditional computation for capacity: different experts specialize (syntax vs numbers vs tool JSON, informally) without running all MLP weights each step. Attention layers usually stay dense; MoE almost always targets the FFN because it dominates parameter count in large transformers.
Routing, load balancing, and training stability
Without constraints, routers collapse: one expert wins every token, others never train (“dead experts”). Papers and production stacks add:
- Auxiliary load-balancing loss — encourage uniform expert utilization (Switch Transformer, Mixtral). Penalize squared coefficient of variation of expert counts per batch.
- Router z-loss — stabilize logit magnitudes so softmax does not saturate (used in PaLM and successors).
- Capacity factor — each expert accepts at most
capacity_factor × (tokens / N)tokens per step; overflow tokens are dropped or routed to a default expert (training throughput vs quality trade-off). - Noise on router logits — Gumbel or Gaussian noise during training for exploration; removed or reduced at inference.
- Expert dropout — randomly mask experts so backups learn.
Top-1 vs top-2: top-1 minimizes FLOPs and simplifies serving; top-2 (Mixtral default) smooths gradients and reduces brittle routing at ~2× expert compute. DeepSeek-V2/V3 use fine-grained experts (many small experts, top-k with shared experts) to improve specialization without exploding per-token cost.
What the router actually learns
Experts are not hand-labeled. Specialization emerges from data: one expert may dominate code tokens, another long-form prose. Do not assume interpretable clusters without measurement — log per-expert token histograms by domain (support tickets, JSON, markdown) during eval.
Expert parallelism and inference serving
Training and serving MoE require expert parallelism (EP) distinct from tensor parallel (TP) and pipeline parallel (PP). Each GPU holds a subset of experts; tokens are dispatched to the GPU owning the selected expert, computed, then combined. All-to-all communication dominates at scale — a reason MoE shines on well-connected GPU pods and hurts on latency-sensitive single-node CPU inference.
Production serving patterns (2026):
- Colocated experts — all experts on one node if N is small and VRAM fits; simplest routing.
- EP sharding — experts spread across GPUs; batch tokens to amortize all-to-all (vLLM, SGLang MoE paths).
- Shared experts — always-on MLP plus sparse experts (DeepSeek pattern) guarantees a stable baseline path.
- Quantized experts — INT8/FP8 expert weights with FP16 attention; pairs with LLM quantization guides.
MoE does not shrink KV cache: attention is unchanged. Pair with GQA or MLA for decode memory, not MoE alone.
Harbor Support MoE gateway refactor
Harbor's tier-2 support bot ran Mixtral-8x7B-class weights (8 experts, top-2) behind an OpenAI-compatible proxy. Symptoms: uneven GPU utilization, 2.1× p99 latency vs dense 34B on short replies, occasional nonsense when overflow dropped tokens during high-traffic windows.
- Re-tuned aux loss coefficient from paper default — Harbor's ticket vocabulary skewed routing; increased aux weight 1.5× in a 2-day continued pretrain on 40M internal tokens.
- Raised capacity factor from 1.0 to 1.25 on training replay to match production burstiness.
- EP layout change — 2 experts per GPU on a 4-GPU node instead of 8 experts on 2 GPUs (reduced all-to-all contention).
- Shared expert lane — added a dense 0.5× FFN bypass always added to MoE output for routing failures (inspired by DeepSeek shared-expert design).
- Domain-aware batching — batch tickets by language and length so routers see homogeneous groups; cut expert thrashing 22%.
Outcome: GPU idle dropped from 40% to 11%, p95 latency −38%, human escalation rate unchanged. Lesson: MoE serving is a scheduling problem as much as a modeling one.
Architecture decision table
| Approach | Capacity vs cost | Best when | Watch out for |
|---|---|---|---|
| Dense transformer | All params active; predictable FLOPs | Single-GPU deploy, strict latency SLOs, smaller models | Parameter scaling hits VRAM and train cost linearly |
| MoE (top-2, 8–64 experts) | High total params, ~2/N active FFN FLOPs | Multi-GPU training, need capacity without 70B dense cost | Load balance, all-to-all, dead experts |
| Fine-grained MoE + shared expert | Many small experts, stable shared path | Long-context chat, code+prose mix (DeepSeek-class) | Complex serving stack; fewer canned checkpoints |
| Multi-model router (external) | Separate models per task | Clear task boundaries (OCR vs chat vs embed) | Ops overhead; no shared representation learning |
| Distillation to dense student | One-time train cost; dense deploy | Edge or CPU after MoE teacher training | Quality gap; needs teacher access |
Common pitfalls
- Comparing total params to dense baselines — benchmark active FLOPs and VRAM for your batch size, not headline “456B” counts.
- Ignoring expert collapse in fine-tunes — small domain fine-tunes can re-specialize two experts and starve the rest; monitor histograms every epoch.
- Capacity factor too low at inference — training drops overflow; serving must never drop tokens — use padding, second-choice routing, or shared expert fallback.
- MoE on a single consumer GPU — all experts may fit, but you pay routing overhead without EP benefits; often worse than a smaller dense model.
- Skipping communication profiling — all-to-all can exceed matmul time on small batches; batch size 1 chat is painful for MoE.
- Assuming MoE fixes context length — it scales FFN capacity, not attention quadratic cost; use sliding window, MLA, or linear attention separately.
- Router dtype mismatches — FP16 router logits on large expert counts overflow; keep router in FP32 or apply z-loss.
Production checklist
- Report active parameter count and top-k FFN FLOPs per token alongside total weights.
- Log per-expert token share, overflow drops, and dead-expert alerts in training and serving.
- Sweep aux loss and capacity factor on a representative slice before full fine-tune.
- Profile all-to-all vs compute at production batch sizes (not only micro-benchmark matmuls).
- Plan EP GPU topology: experts per device vs network bisection bandwidth.
- Implement shared-expert or dense fallback for routing failures at inference.
- Pair MoE with GQA/MLA if decode KV cache is the bottleneck.
- Validate quality on skewed domains (JSON, non-English, short prompts) where routers fail silently.
- Document whether checkpoints expect top-1 or top-2 and matching noise settings.
- Consider distilling to dense if final deploy target is edge or CPU-only.
Key takeaways
- MoE replaces dense FFNs with routed expert banks, activating top-k experts per token so total capacity grows faster than per-token compute.
- Training needs load-balancing aux loss, capacity limits, and monitoring to prevent expert collapse and dead experts.
- Serving MoE is an expert-parallel scheduling problem — all-to-all communication and batching dominate tail latency.
- Mixtral-style top-2 and DeepSeek-style fine-grained plus shared experts are the dominant 2026 patterns; pick by GPU topology and traffic shape.
- MoE complements but does not replace attention optimizations (GQA, MLA) or scaling-law trade-offs for data and compute budgets.
Related reading
- Distributed LLM training explained — DDP, TP, PP, and expert parallel axes
- LLM scaling laws explained — how data, compute, and params interact
- Transformer architecture explained — where MoE blocks sit in the stack
- Group Query Attention (GQA) explained — KV cache savings MoE does not provide