The Model Combined Two Searches on Its Own
A tool is a function you expose to the model. You define what it does (via a name and description) and what arguments it accepts (via a JSON schema). The model doesn't execute the function — it reads the definition, decides whether to call it, and returns the function name and arguments. Your code runs the function and sends the result back. A tool call is one of these model-requested function executions.
I built a sourcing agent with four tools — keyword search, semantic search, candidate detail lookup, and a signal to flag when the candidate pool is too thin. No instructions on which tools to call or in what order. Just tool definitions and a job description.
The model made four calls autonomously:
| Step | Tool Called | Why | Result |
|---|---|---|---|
| 1 | search_candidates_by_skills | Precise stack matching (Python, PostgreSQL, AWS, Docker, 5+ yr) | 2 candidates |
| 2 | search_candidates_semantic | Keyword results were thin — broaden with semantic search | 8 candidates (overlaps + new) |
| 3 | get_candidate_detail | Verify Principal Engineer's Docker gap | Kubernetes covers Docker — credited |
| 4 | get_candidate_detail | Verify a borderline candidate's fit | Confirmed as weak match |
Final shortlist: 4 candidates sourced — two exact skill matches (0.95, 0.93), one credited via Kubernetes-Docker overlap (0.82), one borderline (0.61).
Two things worth noting: the model excluded the junior frontend developer without being told to, and credited the Principal Engineer's Kubernetes experience against the Docker requirement after verifying via detail lookup. I also tested a niche JD (Ray, Triton Inference Server, CUDA) where keyword search returned zero hits — it extended to four steps: keyword search → semantic search → re-search with relaxed parameters → flag_sourcing_gap. Same code, different strategies.
| Who decides | Sequence | Adapts to input? | |
|---|---|---|---|
| Pipeline | Code | Step 1 → 2 → 3 → ... → N | No — same every time |
| Tool use | Model | Code offers tools → Model picks → Code executes | Yes — strategy varies per input |
The Agentic Loop
The agentic loop is the mechanism that makes tool use work:
One detail the diagram doesn't show: the loop needs a MAX_TOOL_CALLS safety limit1. Without it, the model can loop indefinitely if it never converges on a final answer.
Tool Descriptions Are Decision-Making Instructions
This was the most important finding. The description field in a tool definition isn't documentation for humans — it's an instruction the model reads when deciding which tool to call and when to stop calling. Every word in the description shapes the model's strategy the same way a system prompt shapes a chatbot's behavior, except here it controls tool selection instead of conversation tone.
# Bad: model doesn't know when to use this vs. semantic search
"description": "Search candidates"
# Good: model knows the precise use case and when to prefer alternatives
"description": (
"Search the candidate pool by required skills. Returns candidates "
"whose skill list contains ALL of the specified skills. Use this "
"for precise technical stack matching."
)I assumed the phrase "Use this for precise technical stack matching" was what caused the model to call keyword search first. So I tested it — swapped to vague descriptions ("Search candidates", "Find candidates using AI") and ran 3 times. The model still called keyword search first every time. Routing order didn't change. What changed was stopping behavior:
| Specific descriptions | Vague descriptions | |
|---|---|---|
| Routing order | Keyword first | Keyword first (same) |
| Detail lookups | 2.0 avg | 5.3 avg (2.6x more) |
| Total tool calls | 4.3 avg | 8.0 avg (1.9x more) |
The model didn't know when to stop examining candidates because the vague descriptions gave no implicit stopping guidance. It verified every candidate it found — even clearly strong matches.
Where Tool Use Breaks Down
Over-calling tools. Without stopping criteria, the model verified every candidate it found. Adding one sentence to the system prompt made a measurable difference:
| Without stopping criteria | With stopping criteria | |
|---|---|---|
| Detail lookups | 6.3 avg | 3.0 avg (52% fewer) |
| Total tool calls | 8.3 avg | 4.7 avg (43% fewer) |
The sentence: "Use get_candidate_detail ONLY for ambiguous borderline candidates. Stop once you have 3-5 confident candidates." Each unnecessary call adds latency and cost — the difference between a 30-second sourcing step and a minute-plus one. The token overhead compounds it: tool definitions are sent on every iteration of the agentic loop, not once per request.
| Tools sent | Extra input tokens | At 10k requests/day (4 iterations each) |
|---|---|---|
| 1 tool | ~637 | ~25M tokens/day |
| 3 tools | ~900 | ~36M tokens/day |
| 4 tools | ~1,200 | ~48M tokens/day |
The first tool is expensive (~637 tokens) because the API injects scaffolding. Each additional tool adds ~240–290 tokens. A token/cost budget per request — short-circuiting the loop when exhausted and returning the best result so far — is the right fix.
No fallback on failure. If all tools return empty results, the model has no deterministic path to fall back on. In the fixed pipeline, sourcing always produces output — even if it's bad output. Tool use can produce nothing, which means downstream agents have nothing to work with. The fix: if the tool-use agent returns empty results or exceeds its cost budget, fall back to the fixed pipeline. The pipeline is less adaptive but always produces output — a reliable floor beneath the flexible ceiling.
Takeaways
- Write tool descriptions like you're briefing a colleague — specific about what the tool does, when to use it, and when to prefer alternatives. Vagueness doesn't break routing, it breaks stopping.
- Use
temperature=0for tool selection — it stabilizes tool order but not tool count. For fully deterministic behavior, combine it with explicit stopping criteria in the system prompt. - Keep fixed pipelines for steps that always run the same way. Tool use shines when the optimal sequence genuinely varies by input. The practical pattern is mixing both — fixed pipeline for the overall flow, tool use for the adaptive steps.
Footnotes
-
The agentic loop in code. The loop sends messages to the model, checks for tool calls, executes them, and repeats until the model returns a final answer.
MAX_TOOL_CALLScaps the iterations to prevent runaway loops:messages = [system_msg, user_msg] for _ in range(MAX_TOOL_CALLS): response = client.chat.completions.create( messages=messages, tools=TOOLS, temperature=0 ) msg = response.choices[0].message if not msg.tool_calls: return parse_result(msg.content) # done messages.append(msg) for tc in msg.tool_calls: result = execute_tool(tc.function.name, json.loads(tc.function.arguments)) messages.append({ "role": "user", "content": f"Tool: {tc.function.name}\nResult:\n{result}" })Each iteration appends the model's response and tool results to
messages, so the model sees the full conversation history — including its own prior tool calls and their results — on every round-trip. ↩