Auditing My AI Systems: Patterns, Tradeoffs, and Gaps I Was Working Around

The Patterns Already Had Names

Three months into building AI-powered applications — a UI generator, an on-device SDK integration, an automated screening pipeline — I sat down to map what I'd actually built. Not the features. The architecture decisions.

Selective context injection instead of RAG. Temperature tuning per phase. LLM-as-judge for bias auditing. Sequential multi-agent pipelines with shared mutable context. I'd shipped all of these by solving problems. What I hadn't done was name them, evaluate whether they were the right call, or identify where I'd been working around gaps instead of filling them.

This is the first in a series on deepening AI engineering skills by building on real systems.

Three Projects, One Pattern

I catalogued every AI-related decision across the three projects. The same engineering pattern showed up every time: start with the simplest approach, use rules where possible, and call the LLM only where generative power is actually needed.

Project 1: Design System UI Generator

A system where users type plain English and get live HTML mockups matching a specific design system.

Decision	What I Did	Why
Context strategy	Selective injection, not RAG	Design system is 18 components — small and stable enough to filter, not search
Intent detection	Keyword map, zero LLM calls	7 screen types, deterministic lookup — saves ~400ms and one API call per request
Temperature	0 for classification, 0.3 for generation	Classification needs consistency, generation needs creativity
Token budget	Two-phase generation	Classify first (cheap), then inject only relevant patterns — 44% fewer tokens per request, cutting inference cost and keeping generation under 4 seconds for real-time UI preview

Project 2: Apple FM SDK Integration

On-device inference with Apple's Foundation Models SDK. The positioning decision was the interesting one: Apple FM handles classification and privacy-sensitive tasks, cloud models handle reasoning.

Decision	What I Did	Why
Model selection	Apple FM for classification, Claude for reasoning	Binary classification (PII detection) doesn't need Opus-level reasoning
Output format	Guided generation with `@generable` schemas	Guarantees structured output without prompt gymnastics
Session management	Reuse sessions, manage context window budgets	Creating new sessions per request is expensive on-device

Keeping classification on-device meant PII never left the machine — no network hop, no third-party data exposure, and latency dropped to single-digit milliseconds for tasks that would take 400ms+ through a cloud API. The tradeoff was a 3B parameter model with a small context window: session reuse was mandatory to stay within device memory, and batch operations needed careful chunking to avoid context overflow that would silently reset the session mid-run. I hit that exact failure during bulk tagging — the model stopped returning results without an error, and it took an hour of debugging to realize the session had silently overflowed.

Project 3: Screening Pipeline

I built this as a learning project — a candidate screening pipeline is the kind of multi-step, judgment-heavy workflow where every AI engineering concept shows up naturally: structured extraction, scoring, bias detection, and human-in-the-loop decisions. An 8-agent sequential pipeline that parses job descriptions, sources candidates, scores them, and audits its own fairness. The most architecturally interesting — and the most flawed.

Decision	What I Did	Why
Architecture	Sequential agents sharing a mutable context dict	Each agent reads previous output and appends its own
Bias detection	Dedicated LLM agent auditing the pipeline's own scores	LLM-as-judge pattern — a second model reviewing the first
Human review	Confidence threshold at 0.8 triggers manual review	Below 0.8, the system flags uncertainty rather than guessing
Scale approach	Full candidate pool in prompt (15 candidates)	Fits in context window. No embeddings needed yet

The sequential architecture was a deliberate choice: each agent's output becomes the next agent's input, so the pipeline reads like a conversation — JD parsing feeds sourcing, sourcing feeds screening, screening feeds scoring. That made the system easy to build and easy to reason about when stepping through a single run.

The cost showed up later. Shared mutable state meant every agent could read and write to the same context dict, which made it impossible to isolate failures. When the scoring agent produced a bad result, I couldn't tell whether the problem was in the scoring prompt, the screening data it inherited, or the sourcing step that selected the candidates in the first place. Debugging required tracing the entire chain. Agents 3 and 4 could theoretically run in parallel — they don't depend on each other — but the shared dict made that unsafe without a refactor I hadn't earned yet.

The reliability issues went deeper than architecture. The sourcing agent produced non-deterministic outputs — same JD, same candidate pool, different results across runs — which made debugging impossible without a fixed test harness. The scoring agent generated confidence values that drifted across identical inputs, swinging above and below the 0.8 threshold that determined whether a human reviewed the shortlist at all. And the entire pipeline had no retry logic: a single malformed LLM response would silently corrupt the shared context dict, and every downstream agent would inherit the bad data without knowing it. These aren't edge cases. They're the default behavior of any LLM pipeline that hasn't been hardened against the inherent non-determinism of model outputs.

Five Gaps I'd Been Working Around

The audit surfaced five areas where I'd either oversimplified, skipped, or hadn't needed to go deeper — until now. Each one traced back to a real limitation I'd already hit:

Topic	What I Was Working Around	What It Actually Cost
Evaluation	The pipeline produced scores (0.75, 0.85) that influenced shortlisting decisions, but I had no way to validate whether those numbers aligned with human judgment	I ran it five times manually and eyeballed results — missed that the sourcing agent returned different candidates on every run
Embeddings	Used keyword matching for all lookups — 7 screen types, 15 candidates — and it worked fine at that size	The moment the candidate pool hits a few hundred, keyword matching misses anyone who describes the same skills in different vocabulary. The approach has a hard ceiling
Tool use	Hardcoded which agents run and in what order, regardless of what the data actually needed	A niche JD and a high-volume JD ran the exact same 7-agent sequence, paying for LLM calls that added nothing for the specific input
Agents	Built a fixed pipeline that can't adapt mid-run — if sourcing returns 2 candidates, the pipeline still runs all remaining steps identically to when it returns 20	No ability to loop back for more candidates, skip unnecessary steps, or adjust scoring strategy based on pool size
Fine-tuning	Kept reaching for it conceptually ("maybe I should fine-tune for my design system") without a framework for when it's actually justified	Almost talked myself into weeks of training data prep for a system that handles 18 components — selective context injection already solved the problem at zero training cost

The Screening Pipeline Had the Most Surface Area

All three projects could benefit from going deeper on these topics, but the screening pipeline had the most room to grow:

Zero tests. No existing test suite to work around — clean slate for evaluation.
Unvalidated scores. The scoring system produced numbers (0.75, 0.85) with no calibration against human judgment.
15 candidates fit in context today. At 500 candidates, the full-context approach overflows. Embeddings become necessary.
Static candidate source. The sourcing agent reads a JSON file. With tool use, it could search multiple sources and decide which to query.
Fixed pipeline. 8 agents always run in the same order. An adaptive agent could skip steps or loop back when results are thin.

I worked through the five topics in sequence: evaluation first (measure before you optimize), then embeddings, tool use, agents, and fine-tuning last. Each one built on the previous.

The Recurring Architecture

Looking across all three projects, the engineering judgment was the same every time:

Loading diagram...

Most AI systems I've seen fail not because of model quality, but because engineers reach for LLM calls where a keyword map, a rule, or a config lookup would be faster, cheaper, and more reliable. The recurring architecture across all three projects wasn't "use the best model." It was "use the model last" — after exhausting every deterministic option. A zero-cost keyword classifier that handles 95% of cases is better engineering than a model call that handles 100% at 400ms and variable cost per request.

That judgment — knowing where the LLM boundary should be — showed up consistently across projects, and it's the same judgment that guided how I approached filling each gap.

Takeaways

Audit your own systems before deciding what to learn next. Cataloguing architecture decisions reveals both the patterns you're already applying well and the gaps you've been working around instead of solving.
Name the patterns. Recognizing "selective context injection" or "LLM-as-judge" in your own code turns implicit decisions into a foundation you can extend deliberately — and defend in design reviews.
Sequence matters. Evaluation first — because without it, every improvement is a guess. I thought the pipeline was working fine until 36 tests proved otherwise.
The system with the most gaps teaches you the most. Not because it's broken, but because it has the most surface area for engineering rigor — and the failures are real enough to make the lessons stick.
AI engineering is less about choosing the right model and more about constraining where the model is allowed to operate. The best systems I've built use LLMs for the smallest possible surface area — and rules, keyword maps, and deterministic logic for everything else.