The Patterns Already Had Names
Three months into building AI-powered applications — a UI generator, an on-device SDK integration, an automated screening pipeline — I sat down to map what I'd actually built. Not the features. The architecture decisions.
Selective context injection instead of RAG. Temperature tuning per phase. LLM-as-judge for bias auditing. Sequential multi-agent pipelines with shared mutable context. I'd shipped all of these by solving problems. What I hadn't done was name them, evaluate whether they were the right call, or identify where I'd been working around gaps instead of filling them.
This is the first in a series on deepening AI engineering skills by building on real systems.
Three Projects, One Pattern
I catalogued every AI-related decision across the three projects. The same engineering pattern showed up every time: start with the simplest approach, use rules where possible, and call the LLM only where generative power is actually needed.
Project 1: Design System UI Generator
A system where users type plain English and get live HTML mockups matching a specific design system.
| Decision | What I Did | Why |
|---|---|---|
| Context strategy | Selective injection, not RAG | Design system is 18 components — small and stable enough to filter, not search |
| Intent detection | Keyword map, zero LLM calls | 7 screen types, deterministic lookup — saves ~400ms and one API call per request |
| Temperature | 0 for classification, 0.3 for generation | Classification needs consistency, generation needs creativity |
| Token budget | Two-phase generation | Classify first (cheap), then inject only relevant patterns — 44% fewer tokens per request, cutting inference cost and keeping generation under 4 seconds for real-time UI preview |
Project 2: Apple FM SDK Integration
On-device inference with Apple's Foundation Models SDK. The positioning decision was the interesting one: Apple FM handles classification and privacy-sensitive tasks, cloud models handle reasoning.
| Decision | What I Did | Why |
|---|---|---|
| Model selection | Apple FM for classification, Claude for reasoning | Binary classification (PII detection) doesn't need Opus-level reasoning |
| Output format | Guided generation with @generable schemas | Guarantees structured output without prompt gymnastics |
| Session management | Reuse sessions, manage context window budgets | Creating new sessions per request is expensive on-device |
Keeping classification on-device meant PII never left the machine — no network hop, no third-party data exposure, and latency dropped to single-digit milliseconds for tasks that would take 400ms+ through a cloud API. The tradeoff was a 3B parameter model with a small context window: session reuse was mandatory to stay within device memory, and batch operations needed careful chunking to avoid context overflow that would silently reset the session mid-run. I hit that exact failure during bulk tagging — the model stopped returning results without an error, and it took an hour of debugging to realize the session had silently overflowed.
Project 3: Screening Pipeline
I built this as a learning project — a candidate screening pipeline is the kind of multi-step, judgment-heavy workflow where every AI engineering concept shows up naturally: structured extraction, scoring, bias detection, and human-in-the-loop decisions. An 8-agent sequential pipeline that parses job descriptions, sources candidates, scores them, and audits its own fairness. The most architecturally interesting — and the most flawed.
| Decision | What I Did | Why |
|---|---|---|
| Architecture | Sequential agents sharing a mutable context dict | Each agent reads previous output and appends its own |
| Bias detection | Dedicated LLM agent auditing the pipeline's own scores | LLM-as-judge pattern — a second model reviewing the first |
| Human review | Confidence threshold at 0.8 triggers manual review | Below 0.8, the system flags uncertainty rather than guessing |
| Scale approach | Full candidate pool in prompt (15 candidates) | Fits in context window. No embeddings needed yet |
The sequential architecture was a deliberate choice: each agent's output becomes the next agent's input, so the pipeline reads like a conversation — JD parsing feeds sourcing, sourcing feeds screening, screening feeds scoring. That made the system easy to build and easy to reason about when stepping through a single run.
The cost showed up later. Shared mutable state meant every agent could read and write to the same context dict, which made it impossible to isolate failures. When the scoring agent produced a bad result, I couldn't tell whether the problem was in the scoring prompt, the screening data it inherited, or the sourcing step that selected the candidates in the first place. Debugging required tracing the entire chain. Agents 3 and 4 could theoretically run in parallel — they don't depend on each other — but the shared dict made that unsafe without a refactor I hadn't earned yet.
The reliability issues went deeper than architecture. The sourcing agent produced non-deterministic outputs — same JD, same candidate pool, different results across runs — which made debugging impossible without a fixed test harness. The scoring agent generated confidence values that drifted across identical inputs, swinging above and below the 0.8 threshold that determined whether a human reviewed the shortlist at all. And the entire pipeline had no retry logic: a single malformed LLM response would silently corrupt the shared context dict, and every downstream agent would inherit the bad data without knowing it. These aren't edge cases. They're the default behavior of any LLM pipeline that hasn't been hardened against the inherent non-determinism of model outputs.
Five Gaps I'd Been Working Around
The audit surfaced five areas where I'd either oversimplified, skipped, or hadn't needed to go deeper — until now. Each one traced back to a real limitation I'd already hit:
| Topic | What I Was Working Around | What It Actually Cost |
|---|---|---|
| Evaluation | The pipeline produced scores (0.75, 0.85) that influenced shortlisting decisions, but I had no way to validate whether those numbers aligned with human judgment | I ran it five times manually and eyeballed results — missed that the sourcing agent returned different candidates on every run |
| Embeddings | Used keyword matching for all lookups — 7 screen types, 15 candidates — and it worked fine at that size | The moment the candidate pool hits a few hundred, keyword matching misses anyone who describes the same skills in different vocabulary. The approach has a hard ceiling |
| Tool use | Hardcoded which agents run and in what order, regardless of what the data actually needed | A niche JD and a high-volume JD ran the exact same 7-agent sequence, paying for LLM calls that added nothing for the specific input |
| Agents | Built a fixed pipeline that can't adapt mid-run — if sourcing returns 2 candidates, the pipeline still runs all remaining steps identically to when it returns 20 | No ability to loop back for more candidates, skip unnecessary steps, or adjust scoring strategy based on pool size |
| Fine-tuning | Kept reaching for it conceptually ("maybe I should fine-tune for my design system") without a framework for when it's actually justified | Almost talked myself into weeks of training data prep for a system that handles 18 components — selective context injection already solved the problem at zero training cost |
The Screening Pipeline Had the Most Surface Area
All three projects could benefit from going deeper on these topics, but the screening pipeline had the most room to grow:
- Zero tests. No existing test suite to work around — clean slate for evaluation.
- Unvalidated scores. The scoring system produced numbers (0.75, 0.85) with no calibration against human judgment.
- 15 candidates fit in context today. At 500 candidates, the full-context approach overflows. Embeddings become necessary.
- Static candidate source. The sourcing agent reads a JSON file. With tool use, it could search multiple sources and decide which to query.
- Fixed pipeline. 8 agents always run in the same order. An adaptive agent could skip steps or loop back when results are thin.
I worked through the five topics in sequence: evaluation first (measure before you optimize), then embeddings, tool use, agents, and fine-tuning last. Each one built on the previous.
The Recurring Architecture
Looking across all three projects, the engineering judgment was the same every time:
Most AI systems I've seen fail not because of model quality, but because engineers reach for LLM calls where a keyword map, a rule, or a config lookup would be faster, cheaper, and more reliable. The recurring architecture across all three projects wasn't "use the best model." It was "use the model last" — after exhausting every deterministic option. A zero-cost keyword classifier that handles 95% of cases is better engineering than a model call that handles 100% at 400ms and variable cost per request.
That judgment — knowing where the LLM boundary should be — showed up consistently across projects, and it's the same judgment that guided how I approached filling each gap.
Takeaways
- Audit your own systems before deciding what to learn next. Cataloguing architecture decisions reveals both the patterns you're already applying well and the gaps you've been working around instead of solving.
- Name the patterns. Recognizing "selective context injection" or "LLM-as-judge" in your own code turns implicit decisions into a foundation you can extend deliberately — and defend in design reviews.
- Sequence matters. Evaluation first — because without it, every improvement is a guess. I thought the pipeline was working fine until 36 tests proved otherwise.
- The system with the most gaps teaches you the most. Not because it's broken, but because it has the most surface area for engineering rigor — and the failures are real enough to make the lessons stick.
- AI engineering is less about choosing the right model and more about constraining where the model is allowed to operate. The best systems I've built use LLMs for the smallest possible surface area — and rules, keyword maps, and deterministic logic for everything else.