faq-heavy 11 min read May 04, 2026

Tool Calling FAQ: How LLMs Actually Use Tools

Q: Can I force an LLM to call a specific tool?

Yes, via the tool_choice parameter. OpenAI accepts 'required' or a specific function name. Anthropic accepts {type: any} or {type: tool, name: X}. Forcing a specific tool is common for routing, classification, and structured extraction tasks.

Q: How do I handle errors when a tool call fails?

Return the error to the model as a tool result with is_error: true. Do not raise an exception or silently retry. The model reads the error message and decides whether to retry with different arguments, switch tools, or explain the failure to the user.

By Peter Foy

15 plain-English answers on LLM tool calling: function calling vs MCP, parallel tool use, tool_choice, strict mode, hallucinations, and error handling.

TL;DR

Tool calling is a structured output mode where an LLM emits a JSON object naming a function and its arguments, which your code then executes and returns to the model. The model never runs the tool itself. Function calling, tool use, and MCP are related but distinct: the first two are provider-specific APIs, MCP is a transport protocol layered on top.

LLMs do not execute tools. They emit a JSON request describing which tool to call with what arguments.
Function calling and tool calling are the same thing. OpenAI used 'function', Anthropic used 'tool', everyone converged.
MCP is the transport layer that exposes tools to any model. Function calling is the model's act of requesting one.
Parallel tool calls let one assistant turn invoke multiple tools at once. Claude 4 and GPT-5 do this natively.
Tool calls hallucinate when schemas are vague, descriptions are weak, or required arguments are missing from context.

Tool calling is how a large language model (LLM) requests work it cannot do on its own, like fetching live data, running code, or hitting an API. The model does not execute anything itself. It returns a structured JSON object naming a tool and its arguments. Your application runs the tool and feeds the result back. This FAQ answers 15 of the most common engineering questions about how tool calling actually works at the API level, where it differs from MCP and structured outputs, and why it fails in production.

What is LLM tool calling and how does it work?

Tool calling is a structured generation mode in which an LLM produces a JSON object describing a function to invoke and the arguments to pass. The runtime, not the model, executes the function. The result is sent back to the model as a new message. This loop is how agents do anything useful: search the web, query a database, run code, or call your own API.

What is tool calling in LLMs?

Tool calling is a feature where an LLM, given a list of tool definitions in JSON Schema, decides whether a user request requires an external action and emits a structured call describing the tool name and arguments. Per the Anthropic tool use docs, the model returns a tool_use content block; per the OpenAI function calling guide, it returns a tool_calls array. Your code parses, executes, and returns the result. The model never runs the tool itself.

How does tool calling actually work at the API level?

At the API level, tool calling is a four-step loop:

Define tools. Pass a tools array of JSON Schema definitions in your request.
Model decides. The LLM responds with either text or a structured tool-call block (tool_use for Claude, tool_callsfor OpenAI).
Execute locally. Your runtime parses the call, runs the function, and captures the result.
Return the result. Send a follow-up message containing the tool result. The model continues the conversation with that data in context.

The Berkeley team that built BFCL calls this the canonical pattern.

What schema format do tool definitions use?

Tool definitions are written in JSON Schema. A minimal Anthropic example from the Claude API docs:

{
  "name": "get_weather",
  "description": "Get the current weather in a given location",
  "input_schema": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "The city and state, e.g. San Francisco, CA"
      }
    },
    "required": ["location"]
  }
}

OpenAI uses the same schema under a parameters key instead of input_schema. Descriptions matter: the model uses them to decide when and how to invoke the tool.

How is tool calling different from related concepts?

Three terms get conflated constantly: function calling, tool calling, and MCP. Function calling and tool calling are the same primitive under different vendor names. MCP is a transport protocol that sits one layer up. Structured outputs are a separate feature that constrains the model's text response to a schema, with no external execution involved.

What's the difference between function calling and tool calling?

Nothing meaningful. They describe the same primitive. OpenAI shipped the feature in June 2023 as function calling. Anthropic launched theirs in 2024 and named it tool use. Per Prefect's MCP vs Function Calling explainer, the two are interchangeable in practice. Most providers now standardize on tools and tool_calls in the API surface, even when their docs still say 'functions.' If you see the words 'function call,' 'tool call,' or 'tool use,' assume they mean the same thing unless the doc specifies a difference.

Is MCP the same as tool calling?

No. The Model Context Protocol, released by Anthropic in November 2024 and donated to the Linux Foundation in December 2025, is a transport standard that exposes tools to any model. Function calling is the model's act of requesting a tool. MCP is how that tool is served. As Descope's MCP vs Function Calling guide puts it: function calling decides what to do, MCP makes the tool reliably available across vendors. They are complementary, not competing.

What's the difference between structured outputs and tool calls?

Structured outputs constrain the model's response to a JSON Schema. Tool calls constrain the model's request to call an external function. Per OpenAI's Structured Outputs guide, use response_format when you want clean JSON back from the model itself, like data extraction. Use tools when the model needs to invoke something external, like a database query. Both can use strict: true to guarantee schema compliance via constrained decoding. They are different features that solve different problems.

How do you control tool calling behavior?

Three knobs determine how the model uses tools: parallel execution (can it call several at once?), tool_choice (must it call one?), and strict mode (must arguments match the schema exactly?). Tuning these is the difference between a working agent and one that loops, stalls, or returns garbage.

Can an LLM call multiple tools in parallel?

Yes, on modern frontier models. Per Anthropic's advanced tool use post, Claude 4 returns multiple tool_use blocks in a single assistant turn, and minor prompting pushes parallel-call success near 100%. OpenAI's GPT-4 and GPT-5 families return a tool_calls array with multiple entries by default. Your runtime should execute these concurrently (e.g., asyncio.gather in Python) rather than serially. Anthropic also offers Programmatic Tool Calling, which lets Claude orchestrate parallel tools through generated Python instead of round-trips.

How do you force an LLM to use a specific tool?

Use the tool_choice parameter. Both APIs support three modes:

Mode	Anthropic	OpenAI
Model decides	`{"type": "auto"}`	`"auto"`
Must call any tool	`{"type": "any"}`	`"required"`
Must call specific tool	`{"type": "tool", "name": "X"}`	`{"type": "function", "function": {"name": "X"}}`

Docs: Claude tool_choice cookbook and OpenAI's tool_choice: required announcement. Forcing a tool is useful for routing, classification, and structured extraction where you want the model's analytical work but a guaranteed shape.

What is strict mode in tool calling?

Strict mode constrains the model's decoding so its arguments are guaranteed to match the JSON Schema. Per OpenAI's Structured Outputs announcement, enabling strict: true on a tool means every call passes schema validation, eliminating the retry loop you would otherwise need for malformed JSON. Anthropic supports a similar mode for output. The tradeoff: strict schemas must be 'closed' (no additional properties, all fields explicitly defined). OpenAI now recommends strict mode by default for production tool definitions.

Why do tool calls fail in production?

Tool calls fail in three predictable ways: the model invents arguments (hallucination), the JSON is malformed (schema validation), or the tool itself errors (execution). Each has a different fix. The first is a prompting and schema problem, the second is solved by strict mode, the third by graceful error returns to the model.

Why do LLMs hallucinate tool calls?

LLMs hallucinate tool calls when the schema is ambiguous, the description is weak, or required arguments are not present in the conversation. Per OpenAI's research on why models hallucinate, training objectives reward confident guessing over calibrated uncertainty, so the model fills in plausible-looking arguments rather than asking. The HalluLens benchmark measured tool-relevant hallucination at 1.5% for GPT-4o and 4.6% for Claude-3.5-Sonnet. Mitigations: tighter schemas, explicit required fields, strict mode, and a clarifying-question pattern in the system prompt.

How do you handle tool call errors?

Return the error as a tool result, not as an exception. Both APIs accept an is_error: true flag on the tool result message, which signals to the model that the call failed and it should adapt. Example pattern:

try:
    result = run_tool(name, args)
    content = json.dumps(result)
except Exception as e:
    content = f"Error: {e}"
    is_error = True

messages.append({
    "role": "user",
    "content": [{
        "type": "tool_result",
        "tool_use_id": tool_id,
        "content": content,
        "is_error": is_error,
    }]
})

The model will read the error and either retry with corrected arguments or explain the failure to the user.

How do you debug a broken tool call?

Start with the tool definition, not the model. Most broken tool calls trace to four causes:

Vague description. The model does not know when to use the tool. Rewrite the description to specify the trigger condition.
Loose schema. Missing required fields or generic string types let the model invent values. Add enums and constraints.
No examples in the system prompt. A single shot of 'when the user asks X, call Y with these args' fixes most ambiguity.
Wrong tool_choice. If the model is calling the wrong tool, try tool_choice: required for one turn to force any tool, then inspect.

The BFCL leaderboard is also a useful sanity check for whether your model handles your call complexity.

Which models and frameworks support tool calling?

Tool calling is now a baseline feature across every major frontier provider and most open-source models. The differentiators are reliability under parallel calls, multi-turn coherence, and strict-mode support. The Berkeley Function Calling Leaderboard (BFCL) is the de facto benchmark.

Which LLMs are best at tool calling?

As of mid-2026, the top-ranked models on the Berkeley Function Calling Leaderboard are GPT-5 and Claude 4 / 4.5, with Gemini 2.5 close behind. The benchmark covers 2,000+ question-function-answer pairs across simple calls, parallel calls, multi-turn dialogues, and 'should-not-call' cases. Per the BFCL paper at ICML 2025, even top models stumble on long-context multi-turn tasks and 'when not to act' decisions, even though they ace one-shot calls. Pick a model that matches your call complexity, not the leaderboard headline.

Do all LLMs support tool calling?

No. Tool calling requires the model to be trained on structured tool-use traces. Most frontier models from OpenAI, Anthropic, Google, Meta, Mistral, and DeepSeek now support it natively. Smaller open-source models often expose tool calling through wrappers that prompt the model to emit JSON, with no decoding-level guarantee. Per BentoML's open-source LLM guide, reliability varies widely. If you run a local model, validate behavior on BFCL or your own eval set before shipping.

What's the future of tool calling and MCP?

The trend is clear: tool calling is converging on MCP as the transport layer, and frontier models are getting trained directly on MCP traces. Per the November 2025 MCP spec, the protocol now supports asynchronous tasks, statelessness, and a community registry. ChatGPT, Claude, Gemini, and Copilot all support MCP servers natively. Expect 'function calling' as a vendor-specific term to fade. The model-to-tool API will be MCP, and function calling becomes the model's internal mechanism for emitting an MCP-compatible call.

Frequently asked questions

What is LLM tool calling in one sentence?

Tool calling is a structured generation mode where an LLM emits a JSON object naming a function and its arguments, which your application then executes and returns to the model. The model itself never runs anything; it only requests and reasons about results.

Are function calling and tool calling the same?

Yes. OpenAI introduced the feature as 'function calling' in 2023; Anthropic shipped the same primitive as 'tool use' in 2024. The two terms are interchangeable. Most APIs now use the keyword tools regardless of which term the docs prefer.

Is MCP a replacement for function calling?

No. MCP and function calling solve different problems. Function calling is how a model requests a tool. MCP is the transport protocol that exposes tools to any model in a vendor-agnostic way. You typically use both: MCP servers define tools, the model invokes them via function calling.

Can I force an LLM to call a specific tool?

Yes, via the tool_choice parameter. OpenAI accepts "required" or a specific function name. Anthropic accepts {"type": "any"} or {"type": "tool", "name": "X"}. Forcing a specific tool is common for routing, classification, and structured extraction tasks.

Why does my LLM call the wrong tool or invent arguments?

Three usual causes: vague tool descriptions, loose JSON Schemas (no required fields, generic types), or missing context. Tighten descriptions, add enums and constraints, enable strict mode, and add a one-shot example in the system prompt. Most hallucinated calls disappear with a stricter schema.

Can LLMs call multiple tools at once?

Yes. Modern frontier models like Claude 4 and GPT-5 emit multiple tool calls in a single assistant turn. Your runtime should execute them concurrently. Anthropic also offers Programmatic Tool Calling, which lets the model orchestrate parallel tools through generated code.

What is strict mode in tool calling?

Strict mode constrains the model's decoding so emitted arguments are guaranteed to match the JSON Schema. OpenAI's strict: true flag uses constrained decoding to eliminate malformed JSON entirely. The tradeoff is you must define a closed schema with no additional properties allowed.

What is the difference between structured outputs and tool calls?

Structured outputs constrain the model's text response to a schema (use response_format). Tool calls constrain the model's request to invoke an external function (use tools). Use structured outputs for data extraction. Use tools when the model needs to actually call something external.

How do I handle errors when a tool call fails?

Return the error to the model as a tool result with is_error: true. Do not raise an exception or silently retry. The model reads the error message and decides whether to retry with different arguments, switch tools, or explain the failure to the user.

Which model is best for tool calling in 2026?

On the Berkeley Function Calling Leaderboard, GPT-5 and Claude 4.5 lead, with Gemini 2.5 close behind. But leaderboard rank does not equal production fit. Pick a model that matches your call complexity: simple one-shot calls work on most models, multi-turn agentic loops favor Claude and GPT-5.

After the FAQ, before the schema section. Frame as: now that you understand the API, build something with it.

Build an agent that uses tools the right way