Guide
LLM function calling explained
A customer asks Harbor Commerce support: “Where is order HC-88421 and can I change
the shipping address?” A plain chat model can only guess. Function calling
(also called tool use) lets the model emit a structured request —
lookup_order(order_id="HC-88421") — that your application executes and
feeds back as a tool result message. The model then composes a grounded answer from real
data. This pattern powers order lookup, calendar booking, database queries, code execution,
and the action layer behind
autonomous agents.
This guide covers JSON Schema tool definitions, the multi-turn call loop, parallel versus
sequential execution, provider API differences (OpenAI, Anthropic, Google Gemini), strict
mode and validation, security guardrails, a Harbor Commerce order assistant worked example,
a decision table versus
structured outputs
and ReAct prompting, common pitfalls, and a production checklist — assuming basic
familiarity with
prompt engineering.
What function calling is
Function calling is a contract between your application and an LLM: you declare a set of tools (functions the model may invoke), each with a name, natural-language description, and a JSON Schema describing its parameters. At inference time the model can either reply with user-facing text or return a tool call — a JSON object naming the function and supplying arguments that conform to the schema.
Your code — not the model — runs the actual function. You append a
tool role message (or provider-specific equivalent) with the result, then call
the model again. This loop continues until the model produces a final natural-language
answer or you hit a step limit. The model never directly touches your database; it only
proposes typed actions your server validates and executes.
Function calling vs structured outputs
Both patterns constrain model output to JSON. The difference is intent:
- Structured outputs — the model’s entire response is one JSON document matching a schema (classification labels, extracted entities, report fields). Best when you need a single parsed object per turn.
- Function calling — the model chooses which of several registered tools to invoke, with arguments, often across multiple turns. Best when the action space is dynamic and the model must decide what to do next.
Many production systems combine both: function calling for the agent loop, structured outputs inside individual tool handlers that return normalized data.
Designing tool schemas
Schema quality determines reliability. Treat each tool like a public API endpoint: narrow scope, explicit types, and descriptions written for the model, not just developers.
Schema anatomy
- name — snake_case identifier; verbs help (
get_order_status, notorder). - description — when to use this tool, what it returns, and when not to use it. Mention required ID formats.
- parameters — JSON Schema object with
type,properties,required, and per-field descriptions.
Example tool definition (OpenAI-compatible shape):
{
"type": "function",
"function": {
"name": "lookup_order",
"description": "Fetch order status, items, and shipping for a Harbor Commerce order ID (format HC-#####). Use when the user asks about a specific order.",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"pattern": "^HC-\\d{5}$",
"description": "Harbor order ID, e.g. HC-88421"
}
},
"required": ["order_id"],
"additionalProperties": false
}
}
}
Schema design rules
- Prefer fewer, focused tools over one mega-function with twenty optional parameters.
- Use
enumfor closed sets (shipping carriers, ticket priorities). - Set
additionalProperties: falsewhen providers support strict mode — reduces hallucinated fields. - Return compact tool results; truncate large lists and paginate via follow-up calls.
- Version tools in the name (
search_products_v2) when breaking changes ship.
The multi-turn call loop
A minimal production loop looks like this:
- Send system prompt + conversation history + tool definitions to the model.
- If the response contains tool calls, parse each call’s name and arguments.
- Validate arguments against your schema (and business rules) before execution.
- Execute handlers server-side with authenticated service credentials.
- Append tool result messages and call the model again.
- Repeat until the model returns text only, or cap iterations (typically 5–10).
Parallel vs sequential tool calls
Modern models can emit multiple tool calls in one turn when requests are
independent — e.g. lookup_order and get_shipping_options
for the same user message. Execute independent calls concurrently to reduce latency.
Dependent calls (create draft, then confirm) must stay sequential across turns.
Always enforce a max parallel fan-out and total step budget. Unbounded loops are a common source of runaway token bills and accidental API hammering.
Error handling in the loop
When a tool fails, return a structured error string in the tool message
({"error": "order_not_found", "order_id": "HC-99999"}). Models recover well
from explicit errors; silent empty results encourage fabrication. Log failures with
correlation IDs for debugging.
Provider API differences
The conceptual loop is the same; wire formats differ. Abstract behind a thin adapter if you swap providers.
OpenAI
Chat Completions and the newer Responses API accept a tools array. Set
tool_choice to auto, required, or a specific
function. Strict mode (strict: true on function definitions)
guarantees schema adherence via constrained decoding — use it for production when
available. Parallel tool calls are enabled by default; disable with
parallel_tool_calls: false when ordering matters within a single turn.
Anthropic
Claude uses a tools array with input_schema (JSON Schema).
Tool use blocks appear in the assistant message; you reply with tool_result
blocks. tool_choice supports auto, any, or a
named tool. Claude is strong at multi-step reasoning; still cap tool iterations.
Google Gemini
Gemini exposes functionDeclarations in tools. The model returns
functionCall parts; respond with functionResponse. Mode
ANY forces a tool call; AUTO lets the model choose. Validate
argument shapes — provider strictness varies by model version.
Frameworks like
LangChain
normalize these differences behind bind_tools() and agent executors, at the
cost of abstraction leakage when providers ship new features first.
Security and guardrails
Function calling turns natural language into executable actions. Treat it as a privilege boundary, not a convenience feature.
- Allowlist tools per user role — guests get
lookup_order; agents getissue_refundbehind approval workflows. - Never trust model-generated SQL or shell — wrap data access in parameterized handlers; reject raw query strings.
- Authenticate outbound calls with service accounts, not user-supplied tokens embedded in prompts.
- Rate-limit and audit destructive tools (refunds, deletes, sends).
- Human-in-the-loop for irreversible actions above a dollar threshold.
- Prompt injection defense — untrusted document text in RAG context can trick the model into calling tools; separate system instructions from retrieved content and scan for override attempts.
Pair function calling with input/output filters and structured logging. The model is an untrusted planner; your handlers are the trusted execution layer.
Worked example: Harbor Commerce order assistant
Harbor Commerce ships a support chatbot backed by three tools:
lookup_order(order_id)— returns status, line items, tracking URL.update_shipping_address(order_id, address)— allowed only beforeshippedstatus; requires address validation.create_return_ticket(order_id, reason)— opens a Zendesk ticket; idempotent on duplicate calls.
User: “HC-88421 hasn’t arrived — can you check and start a return if it’s lost?”
Turn 1: Model calls lookup_order("HC-88421"). Handler returns
{"status": "in_transit", "carrier": "UPS", "eta": "2026-06-11", "tracking": "..."}.
Turn 2: Model responds in natural language with ETA and tracking link;
does not open a return because status is not delivered/lost. If the user
insists, turn 3 may call create_return_ticket after confirming eligibility
rules in the system prompt.
Key design choices: shipping updates are blocked at the handler (not just the prompt);
return tickets log order_id + conversation_id; tool results
strip PII before logging. Average resolution: two model round-trips, ~1,200 input tokens
with trimmed order payloads.
Decision table: when to use what
| Need | Function calling | Structured outputs | ReAct / text tools |
|---|---|---|---|
| Dynamic tool selection from a registry | Best fit | Poor | OK |
| Single JSON extraction per turn | Overkill | Best fit | Poor |
| Multi-step workflows with branching | Best fit | Manual orchestration | OK for prototypes |
| Provider-native strict schema guarantees | OpenAI strict tools | Structured outputs API | None |
| Debugging transparency | Typed call logs | Single blob | Parse action lines |
| Latency-sensitive single lookup | One call OK | Often faster | Slower |
Common pitfalls
- Tool sprawl — twenty overlapping tools confuse the model; merge or namespace by domain.
- Vague descriptions — “search database” causes wrong-tool picks; specify inputs and scope.
- Returning huge JSON blobs — blows context and cost; summarize and offer pagination tools.
- No server-side validation — models invent plausible but invalid IDs; always validate before IO.
- Unbounded loops — model retries the same failed call forever; cap steps and detect repetition.
- Mixing user content into system prompts — injection via pasted order notes or email bodies.
- Ignoring latency — serial tool calls across five round-trips feel broken in chat; parallelize and stream interim status.
Production checklist
- Define tools with JSON Schema,
additionalProperties: false, and model-facing descriptions. - Implement validate-then-execute handlers; never eval model output.
- Cap tool iterations and parallel fan-out per request.
- Return structured errors in tool messages; log with correlation IDs.
- Enable provider strict mode where supported for argument conformance.
- Role-scope tools; require approval for destructive operations.
- Truncate and redact tool results before re-prompting and logging.
- Monitor tool call rates, failure types, and token cost per conversation.
- Integration-test golden paths (happy path, not found, permission denied).
- Document tool registry versions and deprecation windows for downstream agents.
Key takeaways
- Function calling lets models propose typed actions; your server executes and returns results.
- Invest in schema design and descriptions — they matter more than clever system prompts.
- The production loop is: call → validate → execute → append tool results → repeat until done.
- Use strict mode and server validation together; neither alone is sufficient for security.
- Pair tools with frameworks like LangChain or MCP when the action space grows beyond a handful of functions.
Related reading
- AI agents and tool use explained — ReAct loops, memory, and multi-agent handoffs
- LLM structured outputs explained — when to constrain full responses to JSON Schema
- LangChain fundamentals explained — bind_tools, agents, and LCEL composition
- RAG explained — retrieval as a tool alongside function calling