How LLMs call tools
An LLM never runs a tool itself - it only generates text that requests one, and the harness around it parses that request, runs the operation, and feeds the result back into the conversation.
The loop
Each tool call is a full round-trip:
- The harness sends a request with the conversation so far and the list of available tools.
- The model ends its turn with a structured block naming the tool and its arguments. Nothing has run yet - it has stopped and is waiting.
- The harness executes the tool and sends a new request: the whole conversation again, plus the model's tool-call turn and the result attached to it.
- The model reads the result and continues, either requesting another tool or answering.
This repeats until the model responds with no tool calls.
Most tool-calling APIs keep no state between requests, so the message list is the only memory and the harness resends it every turn.
This also makes tool calling turn-based: the model can't receive a result in the middle of a response. Streaming lets the harness show tokens as they're generated, but the model still finishes its turn before any tool runs.
Two layers of registration
The harness tells the model which tools exist, and it happens at two layers.
On the wire, tools are a JSON field in the request body, written to the provider's spec. Anthropic's schema differs from OpenAI's, so the harness serializes to whichever one it's calling. At this layer the tools are metadata, sitting outside the prompt.
Inside the model, the provider's server takes that JSON and renders it into the tokens the model actually reads. The model never sees a field - it sees text, placed in or near the system prompt in the format it was fine-tuned on.1
So the harness registers tools in the provider's format, and the server builds the real prompt from it. The format is a public contract; the prompt the server generates is internal, and the provider can change it without touching your API call.
Native calling and hand-rolled prompting
Before native tool calling, you could build the same thing by hand: write instructions telling the model to return JSON with an action and arguments, parse the response, run the action, and feed the result into the next prompt. The loop is identical to the native one.
Native tool calling is that pattern, standardized, and only two of its differences are real:
- The model was fine-tuned on the provider's tool format, so it's more reliable at deciding when to call, extracting arguments, and chaining steps than it is at following instructions you wrote.
- The schema is machine-readable, so the inference stack can constrain decoding to it and guarantee valid JSON - though only if the serving layer turns that on.
Everything else is plumbing the API handles for you - the schema lives in a field instead of your prompt, the response is a typed block instead of text you parse, a stop reason flags the call, parallel calls arrive with IDs to match their results. None of it changes what the model can do.
What decides and what coordinates
Whether to call a tool, which one, and with what arguments is the model's judgment, not something the harness or server grants at runtime.
That judgment is built during fine-tuning, on examples that show tool definitions, when a call is warranted, how to fill in the arguments, how to read a result, and when to stop. A model with thin tool training still emits calls, but it misjudges when to call and fills arguments poorly, falling back on general instruction-following. Because the skill sits in the weights, the same harness gets better tool use just by swapping in a better-trained model.
Coordination is a separate matter, split between the server and the harness. The server only translates a single request: JSON into prompt, model output into a typed block. It holds no state and runs nothing.
The harness does the real orchestration. It executes the tools, runs the loop, handles errors, enforces permissions, and manages the message history across turns.
When the structure comes back invalid
A malformed tool call isn't always a training problem. Training is the cause when the model is small, old, or lightly tuned for tools and falls back to general instruction-following, but more often the cause is elsewhere.
The usual culprit is the inference stack. If the serving layer doesn't constrain decoding to the schema, nothing forces valid JSON even when the model knows the format, and many self-hosted setups skip this step.
Two other factors make it worse. High temperature pushes the model toward lower-probability tokens and raises the rate of malformed output on the same weights. Deeply nested schemas with many required fields degrade structural accuracy, even on a model that handles flat schemas without trouble.
So before blaming the model, lower the temperature and check whether the serving stack constrains decoding. If validity still doesn't improve, the weights are the bottleneck and the answer is a different model.