Guide
Multi-agent orchestration explained
A single AI agent with access to every tool in your company sounds elegant until the model confuses billing APIs with deployment runbooks, burns half your token budget on irrelevant retrieval, and loops on the wrong subtask. Multi-agent orchestration splits work across specialist agents — each with a narrow role, tool set, and system prompt — coordinated by an orchestrator that decomposes tasks, routes subtasks, merges results, and enforces guardrails. The pattern mirrors how human teams operate: a triage lead assigns tickets, a researcher gathers facts, a writer drafts the reply, and a reviewer checks policy before anything reaches a customer. This guide covers when multiple agents beat one generalist, common topologies (orchestrator-worker, hierarchical, pipeline, debate), communication and shared state, handoff protocols, cost and safety controls, how orchestration pairs with single-agent tool loops, agent memory, MCP tool servers, and function calling, a Harbor Support tiered-escalation worked example, a topology decision table, common pitfalls, and a production checklist.
When one agent is not enough
A lone agent works when the task is bounded, tools are few, and context fits comfortably in one window. Multi-agent systems earn their complexity when:
- Tool surface is large — dozens of APIs; exposing all schemas in one prompt degrades tool-selection accuracy.
- Domains differ sharply — legal review, code execution, and customer tone need incompatible system prompts.
- Tasks decompose naturally — research then write then verify is three roles, not one marathon loop.
- Parallelism saves latency — independent sub-queries (price check + inventory check + shipping estimate) can run concurrently.
- Quality needs a second opinion — a critic agent catches hallucinations a generator agent will not self-flag.
The cost is operational: more LLM calls, more state to track, more failure modes. Default to a single agent or a fixed workflow; add orchestration when measurement shows tool-selection errors, context overflow, or quality gaps you cannot fix with better prompts alone.
Common orchestration topologies
Orchestrator-worker (supervisor)
A supervisor agent reads the user goal, emits a plan as structured subtasks, and dispatches each to a worker. Workers return compact results; the supervisor synthesizes the final answer. Workers never talk to the user directly unless the supervisor delegates that channel. This is the default pattern in many frameworks (LangGraph supervisor nodes, CrewAI managers).
Hierarchical teams
Supervisors can manage other supervisors: a project lead delegates to a research lead and a engineering lead, each running their own worker pool. Depth adds routing flexibility but increases latency and debugging difficulty. Cap hierarchy at two or three levels unless you have strong observability.
Sequential pipeline
Agents pass artifacts in fixed order: extract → transform → validate → publish. No central planner; the graph topology is the plan. Pipelines are predictable, cheap to test, and ideal when steps rarely skip or reorder.
Parallel fan-out / fan-in
The orchestrator spawns N workers on independent subtasks, waits for all (or k-of-n), then merges. Use timeouts and partial-result policies — one slow worker should not block the entire response if an approximate answer is acceptable.
Debate and critique
Two or more agents propose solutions; a judge or iterative critique loop selects or refines. Effective for high-stakes reasoning (policy interpretation, security review) at the cost of 2–5× token spend. Stop after a fixed round count to prevent infinite agreement loops.
Communication, state, and handoffs
Agents coordinate through messages and shared state. Pick one primary pattern per system; mixing ad hoc copies of context across agents causes drift.
Message passing
Each handoff is a structured payload: { task_id, agent_role, input_summary, artifacts[] }.
Summarize prior steps instead of forwarding full chat logs — workers need conclusions
and citations, not every failed tool attempt.
Shared blackboard
A central store (Redis, Postgres JSON column, or in-memory dict in dev) holds facts the team agrees on: customer ID, retrieved policy clauses, open blockers. Agents read and write keyed fields with schema validation. The blackboard is your episodic memory for the current run.
Handoff contracts
Define what each role must produce before the next agent runs: the researcher outputs sourced bullet points with doc IDs; the writer outputs draft text only; the reviewer outputs pass/fail plus edit list. Ambiguous handoffs produce agents that re-do each other’s work.
Human-in-the-loop gates
Insert approval steps before irreversible actions (refunds, deploys, outbound email). The orchestrator pauses the graph, surfaces a structured summary, and resumes on human signal.
Task decomposition and routing
The orchestrator’s core job is turning a fuzzy user request into routable subtasks.
Planning styles
- Upfront plan — supervisor emits full task list before execution; easy to audit, brittle when the world changes mid-run.
- Reactive replanning — supervisor revises after each worker result; handles surprises, costs more tokens.
- Router-only — classify intent and send to one specialist; not full multi-step orchestration but often sufficient.
Specialist design
Each specialist gets: a role name, a tight system prompt, an allowlisted tool subset, and optional output schema. A BillingAgent should not see DeployAgent tools. Use MCP servers or namespaced function groups to enforce boundaries in code, not just in prompts.
Termination conditions
Set max steps per run, max spend in USD, max wall-clock time, and explicit
FINISH signals. Orchestrators that never declare done will loop until budget
exhaustion.
Runtime, observability, and cost control
Multi-agent runs are distributed systems with nondeterministic components. Treat them accordingly.
Tracing
Assign a run_id and log every agent transition: who ran, input token count,
tools called, output summary, latency, and model ID. Visualize as a DAG so failures map to
a specific node. OpenTelemetry spans per agent step are worth the setup at production scale.
Idempotency and retries
Workers retry on transient API errors with exponential backoff. Side-effecting tools (charges, tickets) require idempotency keys so a retry does not double-apply.
Cost budgeting
Track spend per run and per agent role. Debate patterns and deep hierarchies multiply calls — cap parallel workers and use smaller models for routing and summarization, reserving frontier models for final synthesis or high-risk steps.
Safety
Apply guardrails at orchestrator boundaries: PII redaction before cross-agent messages, tool sandboxing per role, and output filters before user delivery. A compromised worker should not inherit another role’s privileges.
Worked example: Harbor Support tiered escalation
Harbor Support handles 12,000 monthly tickets across billing, shipping, and account security. A monolithic agent hallucinated refund policies and called the wrong Stripe endpoints. The team rebuilt with orchestrated specialists.
Roles
- TriageAgent — classifies intent, extracts order ID, sets urgency; tools: ticket metadata only.
- PolicyAgent — RAG over policy docs; outputs cited clauses with doc slugs; no customer write tools.
- ActionAgent — executes refunds and reships via idempotent APIs when PolicyAgent citations support the action.
- ComposerAgent — drafts customer-facing reply from blackboard facts; no direct API access.
- Supervisor — plans steps, routes, and halts for human approval on refunds above $500.
Flow
User message → Triage writes intent=refund_missing_item and
order_id to blackboard → PolicyAgent retrieves return-window policy
→ if eligible, Supervisor requests human approve or auto-approves under threshold
→ ActionAgent calls refund API with idempotency key → ComposerAgent drafts
reply citing policy slug → Supervisor delivers. Average steps dropped from 14 (monolith)
to 6; policy citation rate rose from 61% to 94%; erroneous refund attempts fell to near zero.
What failed first
Initial debate topology (PolicyAgent vs ActionAgent arguing three rounds) looked rigorous but added 40 seconds and $0.18 per ticket. Replaced with single Policy pass plus deterministic rule check on cited clause IDs — same safety, one-third the cost.
Topology decision table
| Scenario | Recommended topology | Why |
|---|---|---|
| Large tool catalog, varied intents | Orchestrator-worker with router | Specialists see only relevant tools; supervisor plans steps |
| Fixed ETL-style workflow | Sequential pipeline | Predictable, testable, minimal planner tokens |
| Independent data sources | Parallel fan-out / fan-in | Latency bounded by slowest worker, not sum of steps |
| High-stakes policy or security | Generator + critic (1–2 rounds) | Second pass catches unsupported claims |
| Simple FAQ, one domain | Single agent or router-only | Multi-agent overhead not justified |
| Cross-department enterprise tasks | Hierarchical (2 levels max) | Department leads scope tools; avoid deep trees |
Common pitfalls
- Agents without boundaries — every agent has every tool; behavior collapses to a confused monolith with extra latency.
- Full transcript forwarding — context windows fill with noise; workers lose the actual task.
- No termination policy — supervisor replans forever on ambiguous tickets.
- Debate by default — multi-round critique on low-risk tasks burns budget without measurable quality gain.
- Implicit shared state — agents assume peers read chat history that was never written to the blackboard.
- Missing run traces — impossible to debug which specialist hallucinated or called the wrong API.
- Human gates too late — approval after irreversible API calls instead of before.
Production checklist
- Document topology diagram (agents, edges, termination) before implementation.
- Per-role tool allowlists enforced in code, not prompts alone.
- Handoff schemas with required fields validated at runtime.
- Shared blackboard with typed keys and TTL per run.
- Max steps, max cost, and max wall-clock enforced by orchestrator.
- Structured tracing with
run_idacross all agent steps. - Idempotency keys on all side-effecting tools.
- Human approval gates on high-risk actions with clear resume semantics.
- Regression tests per specialist in isolation plus end-to-end golden runs.
- Cost dashboard per role; alert on p95 spend anomalies.
- Fallback to single-agent or human when orchestration exhausts budget.
- Quarterly review: collapse roles if two agents always run sequentially with no routing benefit.
Key takeaways
- Multi-agent orchestration trades complexity for specialization, parallelism, and review — not every product needs it.
- Supervisor-worker is the default; pipelines and parallel fan-out cover most production shapes.
- Shared state and explicit handoff contracts matter more than clever prompts.
- Observability and cost caps are non-negotiable when multiple LLM calls chain per user request.
- Start with a monolith or fixed workflow; split agents only when metrics justify the split.
Related reading
- AI agents and tool use explained — single-agent ReAct loops and when they suffice
- LLM agent memory explained — working, episodic, and semantic memory tiers
- Model Context Protocol (MCP) explained — standardizing tool servers for agents
- LLM guardrails explained — safety filters at orchestration boundaries