My LLM Pipeline Passed Every Manual Check — Then 36 Tests Proved Otherwise

Five Manual Runs Looked Fine

In the first article, I audited three AI systems and found five gaps I'd been working around. The biggest was evaluation — none of my pipelines had tests. I decided to start with the screening pipeline — a chain of LLM-powered agents that parse job descriptions, filter candidates, score them, and audit the results. Zero tests. I'd run it five times manually, eyeballed the results, and called it done.

Then I wrote 36 automated tests. The first run showed the system was inconsistent, biased, and randomly deciding whether humans should review results.

None of this was visible from five manual runs.

Loading diagram...

Three Layers of LLM Evaluation

The framework I landed on has three layers, each catching different classes of problems:

Layer	What It Tests	LLM Calls	Speed
Deterministic	JSON parsing, prompt rendering, data contracts, file structure	None	under 1s
Golden	Real LLM output compared against known-good expected results	Yes	3-15 min
Calibration	Do LLM scores correlate with human judgment?	Yes + human	Minutes + review

Most AI systems fail because they skip deterministic tests and jump straight to evaluating model output.

Layer 1 is table stakes — 24 deterministic tests that catch structural bugs without touching the LLM. Does the job description parser return valid JSON? Does the prompt sent to the model render without missing variables? Does the candidate data file have the expected fields?

Layer 2 is where you learn what your system actually does. 12 golden tests that run the real pipeline against real LLM calls and compare outputs to expected behavior.

Layer 3 is the long game — correlating LLM scores with human recruiter rankings. I haven't built this yet, but the golden tests already surfaced enough problems to keep me busy.

Five Findings From Real Runs

Finding 1: Sourcing Is Non-Deterministic

Run 1: sourced 6 candidates
Run 2: same JD, same candidate pool — sourced 4

Two of the original six disappeared — only 67% overlap between identical runs, measured by Jaccard similarity.

Root cause: temperature was set to 0.7 on the sourcing agent — the LLM was being creative about a task that should produce the same result every time.

In production, this means qualified candidates silently drop out of the pipeline between runs — not because they were rejected, but because the sourcing step randomly excluded them. At 500 candidates, you'd never notice the inconsistency without a fixed test harness.

The fix is straightforward — lower sourcing temperature to 0. Sourcing is a filter, not a creative task. But the more important lesson is that a single global temperature across agents with fundamentally different jobs (filtering vs. scoring vs. auditing) is a design mistake.

Finding 2: The Most Experienced Candidate Scored Lowest

The candidate I expected to rank first — 9 years as a Principal Engineer, processed 1B+ events/day at scale — scored lowest in every run. The scores were stable. The system was consistently wrong.

In a manual review, each score looks plausible on its own. You'd only catch this by comparing scores across candidates against their profiles — which is exactly what the golden test did. It asserted that the most experienced candidate should not score below less experienced ones. That assertion failed immediately.

But the test only told me that the ranking was wrong, not why. For that, the pipeline has a built-in second pass: the validation agent reviews the screening agent's output specifically looking for scoring inconsistencies, and produces explicit warnings:

{
  "bias_warnings": [
    "Salary expectation ($220k) penalized disproportionately for senior candidate despite being within stated range",
    "Culture score assigned without documented rubric — criteria appear invented per candidate",
    "Company prestige framing inconsistent — similar employers rated differently across candidates"
  ]
}

The test caught the problem. The validation agent explained it. Neither would have surfaced from five manual runs.

Finding 3: Confidence Score Swings Across the Decision Threshold

The validation agent also produces a confidence score after reviewing the full shortlist — a single number representing how reliable it considers the screening output. If confidence is above 0.80, the system auto-approves and moves to outreach. Below 0.80, a human reviewer is pulled in.

// Run 1
{ "confidence": 0.82, "human_review_required": false }
 
// Run 2 — same JD, similar outputs
{ "confidence": 0.74, "human_review_required": true }

Whether a human spends 30 minutes reviewing a shortlist depends on a number generated at temperature=0.7. In borderline cases, that's a coin flip controlling whether a human is in the loop at all.

Root cause: the validation agent generates confidence as a free-form creative output — a subjective judgment at non-zero temperature. The fix is the same pattern as Finding 1: lower validation temperature to 0 and replace free estimation with a structured deduction rubric.

Finding 4: The Bias Auditor Actually Works

Finding 2 showed the validation agent catching the senior candidate's under-scoring. But I needed to know if those flags were reliable or just noise. I reviewed every flag across all runs against what a careful human reviewer would catch — four flags raised, all real problems, zero false alarms, nothing missed. Perfect precision and recall on a small set. Both will drop at scale, but every flag was substantiated.

The pattern — a second LLM call that audits the first call's output — is the most underrated tool in this pipeline. The screening agent can't notice its own bias. The validation agent, reviewing the scores specifically looking for bias, can.

Finding 5: Zero Temperature Doesn't Guarantee Determinism

Setting sourcing temperature to 0 reduced variance significantly but didn't eliminate it.

The reason: many LLM APIs distribute requests across multiple copies of the same model running on different servers. Even with temperature at 0, identical requests routed to different copies can produce slightly different outputs. For borderline candidates near the sourcing threshold, this is enough to make them appear or disappear between runs.

Perfect consistency is only achievable if the sourcing agent sees exactly the same input both times — same candidate data, same job description, byte-for-byte identical. In production, where inputs vary naturally between runs, some inconsistency comes from the infrastructure, not the model.

The practical implication: rule-based filtering (not LLM) is the only way to get perfectly consistent sourcing. If consistency matters more than generalization, move the filter logic out of the LLM call entirely.

The Fixes

None of these were model problems. They were system design problems — temperature settings, prompt wording, missing rubrics. The model did exactly what the system told it to.

Did the Fixes Work?

Same candidate. Same JD. The most experienced candidate's score moved from 0.62 to 0.95 — lowest to highest. Not because the model changed — because the system did.

Fixes Without Tests Are Promises

The temperature change and prompt rules work now. They'll also silently break the moment someone reverts a config value or edits a prompt without knowing what it affects. Each fix now has a corresponding deterministic test — the temperature overrides are verified by asserting agent init values, the prompt rules are tested by checking prompt content directly. Those tests run in 0.36 seconds with no LLM calls. If either fix is reverted, CI fails before any model is touched.

Takeaways

Start with deterministic tests. 24 tests, no LLM calls, under a second. They caught 60% of regressions — prompt template errors, schema violations, missing fields — before any model was involved.
Match each metric to a failure mode. Consistency checks caught sourcing drift. Precision/recall validated the validation agent. A single "accuracy" number would have hidden both.
temperature=0 reduces variance but doesn't eliminate it. LLM APIs can load-balance across replicas, so identical requests don't always produce identical outputs. Know where your non-determinism comes from before assuming it's the model.
Every fix needs a corresponding test. The temperature overrides and prompt rules work today. Without deterministic tests asserting agent init values and prompt content, they'll silently break the moment someone edits a config.

If your LLM system "works" without tests, it doesn't work — you just haven't measured where it fails yet.