How ReAct Agents Recover from Their Own Mistakes

Error Recovery Starts with What Tools Return

The agent tried to fetch candidate 101. There was no candidate 101.

What's interesting isn't that it failed — it's why it thought 101 existed in the first place.

The previous article covered tool use — letting the model decide which functions to call and in what order. It worked well, but had a blind spot: when something went wrong, I could see what the model called but not why. An agent using the ReAct pattern fixes that. Before every tool call, the model writes a Thought: its current understanding and what it plans to do next. Those Thoughts are debugging logs written by the model itself — and they surfaced a failure I hadn't seen with plain tool use.

Halfway through a run, the agent searched for candidates matching a job description and got back two results:

[
  {
    "id": 10,
    "name": "Candidate A",
    ...
  },
  {
    "id": 14,
    "name": "Candidate B",
    ...
  }
]

The model doesn't parse JSON — it reads it as text. IDs 10 and 14 sit close together in adjacent objects, separated by nothing more than a closing } and an opening {. When the model composed its next action, it reconstructed the ID from memory and merged the two numbers: the digits of 10 followed by the first digit of 14, producing 101.

The agent called get_candidate(101). That candidate doesn't exist.

This is a known failure mode: digit concatenation from proximity in context. The fix is straightforward — label IDs unambiguously in tool output. candidate_10 or ID-10 breaks the digit-adjacency pattern. But the Thought trace is what told me which kind of error this was: the model mis-encoded an ID it actually saw — it didn't hallucinate one from nowhere. That's a transformation error, not a hallucination. The distinction matters because the fix is in the data format, not the prompt.

The tool returned "Candidate ID 101 not found." — a plain string, not an exception. The agent's next Thought:

Thought: Candidate 101 was not found. Looking back at the search results,
the actual IDs are 10 (Candidate A) and 14 (Candidate B). Let me retrieve
candidate 10 first.
Action: get_candidate({"candidate_id": 10})

It recovered. Read the error, re-examined its previous observations, identified the mistake, and continued with the correct IDs. The rest of the run completed without issues.

Writing tests to validate this revealed exactly why the recovery worked — and it wasn't the model being clever. The recovery path in code:

get_candidate(101)
  → "Candidate ID 101 not found."       # returns a string, not an exception

score_candidate(101, ...)
  → checks "not found" in result        # guard prevents the sub-LLM call
  → returns error string without LLM call

loop receives observation string
  → appends to messages as Observation  # loop never crashed
  → model reads it, writes next Thought
  → recovers and calls get_candidate(10)

Every failure path returns a string. The loop appends the error as an Observation — the result of the last action that the model reads before its next Thought. The model reads "not found", re-examines its earlier observations, and self-corrects. If get_candidate had raised a KeyError, the loop would have crashed — no recovery, no final recommendation. The guard in score_candidate is what prevented a wasted LLM call on a candidate that doesn't exist.

The Agent Spectrum

I've built three systems for the same screening pipeline, each at a different level of autonomy:

System	Who Controls the Loop	Pattern
Original pipeline	Code — always runs agents 1 through 7	Fixed pipeline
Tool-use sourcing	Model picks tools, loop runs until done	Reactive agent
ReAct agent	Model reasons, acts, observes, adapts	ReAct agent

The difference is who controls the loop. In the pipeline, code decides. In the agent, the model decides. The previous article showed how tool use lets the model choose which actions to take. ReAct adds reasoning on top — making those decisions explicit, traceable, and debuggable. It's the difference between seeing what the model did and understanding why it did it.

ReAct: Reason + Act

ReAct forces the model to write a Thought before every Action. This isn't just a formatting convention — it fundamentally changes how the agent works.

Thought: I found 2 candidates via keyword search. Candidate A looks strong — fintech
         background, all 4 required skills. Let me score them before Candidate B.
Action: score_candidate({"candidate_id": 10, "role_requirements": "..."})

Observation: {"technical": 0.92, "experience": 0.82, "overall": 0.88}

Thought: Candidate A scores 0.88 — very strong. Now I need to score Candidate B
         before I can compare and give a final recommendation.
Action: score_candidate({"candidate_id": 14, "role_requirements": "..."})

Without Thoughts, the model jumps straight to actions. I saw this in an early run — the model called search_candidates three times with overlapping queries because it didn't reason about what the first result already told it.

With Thoughts, the model acknowledges what it learned from the last Observation, states its current hypothesis, and identifies the minimum next action needed. Redundant tool calls — the kind I saw with plain tool use — stopped appearing once Thoughts were in place.

The stop Sequence Prevents Hallucinated Results

Without a stop sequence, the model simulates tool results instead of waiting for execution — writing what it thinks the tool would return rather than letting your code run it.

response = client.chat.completions.create(
    messages=messages,
    stop=["Observation:"],  # model stops after Action, your code runs the tool
)

The stop sequence forces generation to halt at exactly the right point. Also keep max_tokens low (400) for intermediate steps — too high and the model front-runs the loop, hallucinating observations before your code runs the tool.¹ For runs that take several minutes, use stream=True — API gateways often have timeout limits on non-streaming requests.²

The Sub-Agent Pattern

score_candidate isn't a simple function — it contains its own LLM call. The outer ReAct agent calls a tool that internally runs a separate model invocation at temperature=0 with a specialized scoring prompt.

Loading diagram...

This is how complex agent systems are built: agents calling agents, each with a narrow specialization. The outer agent coordinates strategy. The inner agents handle specific tasks. The outer agent doesn't need to know how scoring works — it just calls the tool and reasons about the result. The pattern is especially useful when a subtask needs deterministic behavior (temperature=0) inside a broader exploratory workflow.

When ReAct Agents Fail

The Candidate 101 recovery is the success story. But ReAct has failure modes that the Thought traces make visible without necessarily preventing:

Plausible reasoning, wrong conclusion. The model can write a Thought that reads perfectly — "This candidate has strong Python experience, let me score them" — and then pass the wrong candidate ID. The reasoning looks correct, but the action is wrong. Thought traces make this diagnosable, but they don't prevent it. You still need assertions on tool inputs and output validation.

Loops without convergence. Without MAX_STEPS, the agent can enter cycles — searching, finding the same results, reasoning that it needs more data, searching again. I hit this when the candidate pool was too small for the JD: the agent kept calling semantic search with slightly different queries, getting the same 3 candidates each time, and reasoning that it should try harder. The model doesn't know it's stuck — it thinks it's being thorough. A step limit kills the loop, but the agent produces no useful output when it does.

Accumulated context degrades quality. Each step appends Thought + Action + Observation to the message history. By step 5 or 6, the context is long enough that the model starts losing track of earlier observations. I saw the agent re-request a candidate profile it had already retrieved two steps earlier — the information was in the context, but the model didn't reference it.

Cost scales with reasoning depth. Each Thought is a separate LLM call. A 6-step agent makes 6 calls where a fixed pipeline might make 3. The reasoning overhead is real: this run took about 45 seconds across 6 steps, compared to roughly 20 seconds for the equivalent fixed pipeline steps. For high-throughput systems processing hundreds of inputs per hour, that 2x latency multiplier matters.

ReAct works not because the model becomes more accurate, but because it externalizes its reasoning. Once reasoning is visible, you can shape it — through prompts, tool design, and output constraints. That's the actual value: not correctness, but debuggability. It doesn't guarantee the agent gets things right — it only makes mistakes easier to diagnose. The value proposition is strongest when the cost of a wrong decision justifies the debugging investment, and the volume is low enough to absorb the latency overhead.

When to Use Each Pattern

Aspect	Tool Use	ReAct
Reasoning	Implicit — model reasons internally	Explicit — Thought written before every step
API format	Structured `tool_calls` field	Plain text with Thought/Action/Answer pattern
Parsing	Clean — SDK provides function name and args	Regex extraction from text
Debugging	Hard — you see calls but not reasoning	Easy — Thought traces explain every decision
Multi-step accuracy	Good	Better — explicit reasoning prevents redundant calls

Pattern	Use When
Fixed pipeline	Steps are always the same and known upfront
Reactive agent (tool use)	Steps vary by input but the goal is clear
ReAct agent	Complex multi-step reasoning where adaptation and debugging matter

For single-step tasks — classifying a prompt, extracting structured data, scoring one candidate — ReAct adds overhead without benefit. The model either gets it right or it doesn't, and a Thought step won't change the outcome. ReAct earns its cost when multi-step reasoning and error recovery matter, and when diagnosing why something went wrong is worth the extra latency.

Takeaways

Never throw exceptions inside an agent loop — return structured error strings the model can reason about. A KeyError crashes the loop; "Candidate 101 not found" gives the model something to work with.
stop=["Observation:"] is the most important single line in a ReAct implementation. Without it, the model simulates tool results instead of waiting for real execution.
Never expose raw numeric IDs to the model — prefix them (candidate_10) to avoid digit-merging bugs. Adjacency in JSON output is enough to corrupt ID reconstruction.
When an agent makes a wrong decision, the Thought tells you exactly what reasoning led there. That's what makes ReAct debuggable — fix the tool description or system prompt that caused it, not the model.

Use ~1500 tokens for the final answer step — the recommendation needs more room. step_max_tokens = 1500 if step_num >= 6 else 400 ↩
Streaming pattern: chunks = []; [chunks.append(c.choices[0].delta.content) for c in stream if c.choices[0].delta.content]; raw = "".join(chunks) ↩