Guide

Multi-agent orchestration explained

A single AI agent with access to every tool in your company sounds elegant until the model confuses billing APIs with deployment runbooks, burns half your token budget on irrelevant retrieval, and loops on the wrong subtask. Multi-agent orchestration splits work across specialist agents — each with a narrow role, tool set, and system prompt — coordinated by an orchestrator that decomposes tasks, routes subtasks, merges results, and enforces guardrails. The pattern mirrors how human teams operate: a triage lead assigns tickets, a researcher gathers facts, a writer drafts the reply, and a reviewer checks policy before anything reaches a customer. This guide covers when multiple agents beat one generalist, common topologies (orchestrator-worker, hierarchical, pipeline, debate), communication and shared state, handoff protocols, cost and safety controls, how orchestration pairs with single-agent tool loops, agent memory, MCP tool servers, and function calling, a Harbor Support tiered-escalation worked example, a topology decision table, common pitfalls, and a production checklist.

When one agent is not enough

A lone agent works when the task is bounded, tools are few, and context fits comfortably in one window. Multi-agent systems earn their complexity when:

  • Tool surface is large — dozens of APIs; exposing all schemas in one prompt degrades tool-selection accuracy.
  • Domains differ sharply — legal review, code execution, and customer tone need incompatible system prompts.
  • Tasks decompose naturally — research then write then verify is three roles, not one marathon loop.
  • Parallelism saves latency — independent sub-queries (price check + inventory check + shipping estimate) can run concurrently.
  • Quality needs a second opinion — a critic agent catches hallucinations a generator agent will not self-flag.

The cost is operational: more LLM calls, more state to track, more failure modes. Default to a single agent or a fixed workflow; add orchestration when measurement shows tool-selection errors, context overflow, or quality gaps you cannot fix with better prompts alone.

Common orchestration topologies

Orchestrator-worker (supervisor)

A supervisor agent reads the user goal, emits a plan as structured subtasks, and dispatches each to a worker. Workers return compact results; the supervisor synthesizes the final answer. Workers never talk to the user directly unless the supervisor delegates that channel. This is the default pattern in many frameworks (LangGraph supervisor nodes, CrewAI managers).

Hierarchical teams

Supervisors can manage other supervisors: a project lead delegates to a research lead and a engineering lead, each running their own worker pool. Depth adds routing flexibility but increases latency and debugging difficulty. Cap hierarchy at two or three levels unless you have strong observability.

Sequential pipeline

Agents pass artifacts in fixed order: extract → transform → validate → publish. No central planner; the graph topology is the plan. Pipelines are predictable, cheap to test, and ideal when steps rarely skip or reorder.

Parallel fan-out / fan-in

The orchestrator spawns N workers on independent subtasks, waits for all (or k-of-n), then merges. Use timeouts and partial-result policies — one slow worker should not block the entire response if an approximate answer is acceptable.

Debate and critique

Two or more agents propose solutions; a judge or iterative critique loop selects or refines. Effective for high-stakes reasoning (policy interpretation, security review) at the cost of 2–5× token spend. Stop after a fixed round count to prevent infinite agreement loops.

Communication, state, and handoffs

Agents coordinate through messages and shared state. Pick one primary pattern per system; mixing ad hoc copies of context across agents causes drift.

Message passing

Each handoff is a structured payload: { task_id, agent_role, input_summary, artifacts[] }. Summarize prior steps instead of forwarding full chat logs — workers need conclusions and citations, not every failed tool attempt.

Shared blackboard

A central store (Redis, Postgres JSON column, or in-memory dict in dev) holds facts the team agrees on: customer ID, retrieved policy clauses, open blockers. Agents read and write keyed fields with schema validation. The blackboard is your episodic memory for the current run.

Handoff contracts

Define what each role must produce before the next agent runs: the researcher outputs sourced bullet points with doc IDs; the writer outputs draft text only; the reviewer outputs pass/fail plus edit list. Ambiguous handoffs produce agents that re-do each other’s work.

Human-in-the-loop gates

Insert approval steps before irreversible actions (refunds, deploys, outbound email). The orchestrator pauses the graph, surfaces a structured summary, and resumes on human signal.

Task decomposition and routing

The orchestrator’s core job is turning a fuzzy user request into routable subtasks.

Planning styles

  • Upfront plan — supervisor emits full task list before execution; easy to audit, brittle when the world changes mid-run.
  • Reactive replanning — supervisor revises after each worker result; handles surprises, costs more tokens.
  • Router-only — classify intent and send to one specialist; not full multi-step orchestration but often sufficient.

Specialist design

Each specialist gets: a role name, a tight system prompt, an allowlisted tool subset, and optional output schema. A BillingAgent should not see DeployAgent tools. Use MCP servers or namespaced function groups to enforce boundaries in code, not just in prompts.

Termination conditions

Set max steps per run, max spend in USD, max wall-clock time, and explicit FINISH signals. Orchestrators that never declare done will loop until budget exhaustion.

Runtime, observability, and cost control

Multi-agent runs are distributed systems with nondeterministic components. Treat them accordingly.

Tracing

Assign a run_id and log every agent transition: who ran, input token count, tools called, output summary, latency, and model ID. Visualize as a DAG so failures map to a specific node. OpenTelemetry spans per agent step are worth the setup at production scale.

Idempotency and retries

Workers retry on transient API errors with exponential backoff. Side-effecting tools (charges, tickets) require idempotency keys so a retry does not double-apply.

Cost budgeting

Track spend per run and per agent role. Debate patterns and deep hierarchies multiply calls — cap parallel workers and use smaller models for routing and summarization, reserving frontier models for final synthesis or high-risk steps.

Safety

Apply guardrails at orchestrator boundaries: PII redaction before cross-agent messages, tool sandboxing per role, and output filters before user delivery. A compromised worker should not inherit another role’s privileges.

Worked example: Harbor Support tiered escalation

Harbor Support handles 12,000 monthly tickets across billing, shipping, and account security. A monolithic agent hallucinated refund policies and called the wrong Stripe endpoints. The team rebuilt with orchestrated specialists.

Roles

  • TriageAgent — classifies intent, extracts order ID, sets urgency; tools: ticket metadata only.
  • PolicyAgent — RAG over policy docs; outputs cited clauses with doc slugs; no customer write tools.
  • ActionAgent — executes refunds and reships via idempotent APIs when PolicyAgent citations support the action.
  • ComposerAgent — drafts customer-facing reply from blackboard facts; no direct API access.
  • Supervisor — plans steps, routes, and halts for human approval on refunds above $500.

Flow

User message → Triage writes intent=refund_missing_item and order_id to blackboard → PolicyAgent retrieves return-window policy → if eligible, Supervisor requests human approve or auto-approves under threshold → ActionAgent calls refund API with idempotency key → ComposerAgent drafts reply citing policy slug → Supervisor delivers. Average steps dropped from 14 (monolith) to 6; policy citation rate rose from 61% to 94%; erroneous refund attempts fell to near zero.

What failed first

Initial debate topology (PolicyAgent vs ActionAgent arguing three rounds) looked rigorous but added 40 seconds and $0.18 per ticket. Replaced with single Policy pass plus deterministic rule check on cited clause IDs — same safety, one-third the cost.

Topology decision table

Scenario Recommended topology Why
Large tool catalog, varied intents Orchestrator-worker with router Specialists see only relevant tools; supervisor plans steps
Fixed ETL-style workflow Sequential pipeline Predictable, testable, minimal planner tokens
Independent data sources Parallel fan-out / fan-in Latency bounded by slowest worker, not sum of steps
High-stakes policy or security Generator + critic (1–2 rounds) Second pass catches unsupported claims
Simple FAQ, one domain Single agent or router-only Multi-agent overhead not justified
Cross-department enterprise tasks Hierarchical (2 levels max) Department leads scope tools; avoid deep trees

Common pitfalls

  • Agents without boundaries — every agent has every tool; behavior collapses to a confused monolith with extra latency.
  • Full transcript forwarding — context windows fill with noise; workers lose the actual task.
  • No termination policy — supervisor replans forever on ambiguous tickets.
  • Debate by default — multi-round critique on low-risk tasks burns budget without measurable quality gain.
  • Implicit shared state — agents assume peers read chat history that was never written to the blackboard.
  • Missing run traces — impossible to debug which specialist hallucinated or called the wrong API.
  • Human gates too late — approval after irreversible API calls instead of before.

Production checklist

  • Document topology diagram (agents, edges, termination) before implementation.
  • Per-role tool allowlists enforced in code, not prompts alone.
  • Handoff schemas with required fields validated at runtime.
  • Shared blackboard with typed keys and TTL per run.
  • Max steps, max cost, and max wall-clock enforced by orchestrator.
  • Structured tracing with run_id across all agent steps.
  • Idempotency keys on all side-effecting tools.
  • Human approval gates on high-risk actions with clear resume semantics.
  • Regression tests per specialist in isolation plus end-to-end golden runs.
  • Cost dashboard per role; alert on p95 spend anomalies.
  • Fallback to single-agent or human when orchestration exhausts budget.
  • Quarterly review: collapse roles if two agents always run sequentially with no routing benefit.

Key takeaways

  • Multi-agent orchestration trades complexity for specialization, parallelism, and review — not every product needs it.
  • Supervisor-worker is the default; pipelines and parallel fan-out cover most production shapes.
  • Shared state and explicit handoff contracts matter more than clever prompts.
  • Observability and cost caps are non-negotiable when multiple LLM calls chain per user request.
  • Start with a monolith or fixed workflow; split agents only when metrics justify the split.

Related reading