Tool calling is how a large language model (LLM) requests work it cannot do on its own, like fetching live data, running code, or hitting an API. The model does not execute anything itself. It returns a structured JSON object naming a tool and its arguments. Your application runs the tool and feeds the result back. This FAQ answers 15 of the most common engineering questions about how tool calling actually works at the API level, where it differs from MCP and structured outputs, and why it fails in production.
What is LLM tool calling and how does it work?
Tool calling is a structured generation mode in which an LLM produces a JSON object describing a function to invoke and the arguments to pass. The runtime, not the model, executes the function. The result is sent back to the model as a new message. This loop is how agents do anything useful: search the web, query a database, run code, or call your own API.
What is tool calling in LLMs?
Tool calling is a feature where an LLM, given a list of tool definitions in JSON Schema, decides whether a user request requires an external action and emits a structured call describing the tool name and arguments. Per the Anthropic tool use docs, the model returns a tool_use content block; per the OpenAI function calling guide, it returns a tool_calls array. Your code parses, executes, and returns the result. The model never runs the tool itself.
How does tool calling actually work at the API level?
At the API level, tool calling is a four-step loop:
- Define tools. Pass a
toolsarray of JSON Schema definitions in your request. - Model decides. The LLM responds with either text or a structured tool-call block (
tool_usefor Claude,tool_callsfor OpenAI). - Execute locally. Your runtime parses the call, runs the function, and captures the result.
- Return the result. Send a follow-up message containing the tool result. The model continues the conversation with that data in context.
The Berkeley team that built BFCL calls this the canonical pattern.
What schema format do tool definitions use?
Tool definitions are written in JSON Schema. A minimal Anthropic example from the Claude API docs:
{
"name": "get_weather",
"description": "Get the current weather in a given location",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA"
}
},
"required": ["location"]
}
}
OpenAI uses the same schema under a parameters key instead of input_schema. Descriptions matter: the model uses them to decide when and how to invoke the tool.
How is tool calling different from related concepts?
Three terms get conflated constantly: function calling, tool calling, and MCP. Function calling and tool calling are the same primitive under different vendor names. MCP is a transport protocol that sits one layer up. Structured outputs are a separate feature that constrains the model's text response to a schema, with no external execution involved.
What's the difference between function calling and tool calling?
Nothing meaningful. They describe the same primitive. OpenAI shipped the feature in June 2023 as function calling. Anthropic launched theirs in 2024 and named it tool use. Per Prefect's MCP vs Function Calling explainer, the two are interchangeable in practice. Most providers now standardize on tools and tool_calls in the API surface, even when their docs still say 'functions.' If you see the words 'function call,' 'tool call,' or 'tool use,' assume they mean the same thing unless the doc specifies a difference.
Is MCP the same as tool calling?
No. The Model Context Protocol, released by Anthropic in November 2024 and donated to the Linux Foundation in December 2025, is a transport standard that exposes tools to any model. Function calling is the model's act of requesting a tool. MCP is how that tool is served. As Descope's MCP vs Function Calling guide puts it: function calling decides what to do, MCP makes the tool reliably available across vendors. They are complementary, not competing.
What's the difference between structured outputs and tool calls?
Structured outputs constrain the model's response to a JSON Schema. Tool calls constrain the model's request to call an external function. Per OpenAI's Structured Outputs guide, use response_format when you want clean JSON back from the model itself, like data extraction. Use tools when the model needs to invoke something external, like a database query. Both can use strict: true to guarantee schema compliance via constrained decoding. They are different features that solve different problems.
How do you control tool calling behavior?
Three knobs determine how the model uses tools: parallel execution (can it call several at once?), tool_choice (must it call one?), and strict mode (must arguments match the schema exactly?). Tuning these is the difference between a working agent and one that loops, stalls, or returns garbage.
Can an LLM call multiple tools in parallel?
Yes, on modern frontier models. Per Anthropic's advanced tool use post, Claude 4 returns multiple tool_use blocks in a single assistant turn, and minor prompting pushes parallel-call success near 100%. OpenAI's GPT-4 and GPT-5 families return a tool_calls array with multiple entries by default. Your runtime should execute these concurrently (e.g., asyncio.gather in Python) rather than serially. Anthropic also offers Programmatic Tool Calling, which lets Claude orchestrate parallel tools through generated Python instead of round-trips.
How do you force an LLM to use a specific tool?
Use the tool_choice parameter. Both APIs support three modes:
| Mode | Anthropic | OpenAI |
|---|---|---|
| Model decides | {"type": "auto"} |
"auto" |
| Must call any tool | {"type": "any"} |
"required" |
| Must call specific tool | {"type": "tool", "name": "X"} |
{"type": "function", "function": {"name": "X"}} |
Docs: Claude tool_choice cookbook and OpenAI's tool_choice: required announcement. Forcing a tool is useful for routing, classification, and structured extraction where you want the model's analytical work but a guaranteed shape.
What is strict mode in tool calling?
Strict mode constrains the model's decoding so its arguments are guaranteed to match the JSON Schema. Per OpenAI's Structured Outputs announcement, enabling strict: true on a tool means every call passes schema validation, eliminating the retry loop you would otherwise need for malformed JSON. Anthropic supports a similar mode for output. The tradeoff: strict schemas must be 'closed' (no additional properties, all fields explicitly defined). OpenAI now recommends strict mode by default for production tool definitions.
Why do tool calls fail in production?
Tool calls fail in three predictable ways: the model invents arguments (hallucination), the JSON is malformed (schema validation), or the tool itself errors (execution). Each has a different fix. The first is a prompting and schema problem, the second is solved by strict mode, the third by graceful error returns to the model.
Why do LLMs hallucinate tool calls?
LLMs hallucinate tool calls when the schema is ambiguous, the description is weak, or required arguments are not present in the conversation. Per OpenAI's research on why models hallucinate, training objectives reward confident guessing over calibrated uncertainty, so the model fills in plausible-looking arguments rather than asking. The HalluLens benchmark measured tool-relevant hallucination at 1.5% for GPT-4o and 4.6% for Claude-3.5-Sonnet. Mitigations: tighter schemas, explicit required fields, strict mode, and a clarifying-question pattern in the system prompt.
How do you handle tool call errors?
Return the error as a tool result, not as an exception. Both APIs accept an is_error: true flag on the tool result message, which signals to the model that the call failed and it should adapt. Example pattern:
try:
result = run_tool(name, args)
content = json.dumps(result)
except Exception as e:
content = f"Error: {e}"
is_error = True
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_id,
"content": content,
"is_error": is_error,
}]
})
The model will read the error and either retry with corrected arguments or explain the failure to the user.
How do you debug a broken tool call?
Start with the tool definition, not the model. Most broken tool calls trace to four causes:
- Vague description. The model does not know when to use the tool. Rewrite the description to specify the trigger condition.
- Loose schema. Missing
requiredfields or genericstringtypes let the model invent values. Add enums and constraints. - No examples in the system prompt. A single shot of 'when the user asks X, call Y with these args' fixes most ambiguity.
- Wrong
tool_choice. If the model is calling the wrong tool, trytool_choice: requiredfor one turn to force any tool, then inspect.
The BFCL leaderboard is also a useful sanity check for whether your model handles your call complexity.
Which models and frameworks support tool calling?
Tool calling is now a baseline feature across every major frontier provider and most open-source models. The differentiators are reliability under parallel calls, multi-turn coherence, and strict-mode support. The Berkeley Function Calling Leaderboard (BFCL) is the de facto benchmark.
Which LLMs are best at tool calling?
As of mid-2026, the top-ranked models on the Berkeley Function Calling Leaderboard are GPT-5 and Claude 4 / 4.5, with Gemini 2.5 close behind. The benchmark covers 2,000+ question-function-answer pairs across simple calls, parallel calls, multi-turn dialogues, and 'should-not-call' cases. Per the BFCL paper at ICML 2025, even top models stumble on long-context multi-turn tasks and 'when not to act' decisions, even though they ace one-shot calls. Pick a model that matches your call complexity, not the leaderboard headline.
Do all LLMs support tool calling?
No. Tool calling requires the model to be trained on structured tool-use traces. Most frontier models from OpenAI, Anthropic, Google, Meta, Mistral, and DeepSeek now support it natively. Smaller open-source models often expose tool calling through wrappers that prompt the model to emit JSON, with no decoding-level guarantee. Per BentoML's open-source LLM guide, reliability varies widely. If you run a local model, validate behavior on BFCL or your own eval set before shipping.
What's the future of tool calling and MCP?
The trend is clear: tool calling is converging on MCP as the transport layer, and frontier models are getting trained directly on MCP traces. Per the November 2025 MCP spec, the protocol now supports asynchronous tasks, statelessness, and a community registry. ChatGPT, Claude, Gemini, and Copilot all support MCP servers natively. Expect 'function calling' as a vendor-specific term to fade. The model-to-tool API will be MCP, and function calling becomes the model's internal mechanism for emitting an MCP-compatible call.