I Evaluated Fine-Tuning Across 3 Projects — None of Them Needed It

None of My Projects Need Fine-Tuning

After building evaluation frameworks, embedding pipelines, tool-use agents, and ReAct loops across three projects, I evaluated each one for fine-tuning. The answer was the same every time: not yet, probably not ever.

This isn't a fine-tuning tutorial. It's about knowing when fine-tuning is the wrong tool — which, in my experience, is almost always.

What Fine-Tuning Actually Does

Every other technique in this series injects context at inference time. Prompt engineering adds instructions per call. RAG retrieves relevant documents per call. Tool use offers capabilities per call. All of them leave the model unchanged.

Fine-tuning changes the model's weights by training on examples. The knowledge gets baked into the model itself. A fine-tuned model needs no (or a much smaller) system prompt because the behavior is learned, not instructed.

Prompt engineering:  Base model + system prompt → output
                     (model unchanged, context injected each call)

Fine-tuning:         Base model + 1000 labeled examples → New model
                     (knowledge baked into weights)
                     Fine-tuned model + prompt → output

The Three Reasons Teams Fine-Tune

Reason	Goal	Example
Style consistency	Model always outputs a specific format or tone	Always produce HTML matching your design system
Domain knowledge	Model learns proprietary terminology and schemas	Understanding internal API naming conventions
Cost reduction	Smaller fine-tuned model replaces larger general model	Fine-tuned 7B model matching GPT-4 quality on a narrow task

These are legitimate reasons. The problem is teams reach for fine-tuning before exhausting simpler approaches.

The Tradeoff Table

	Prompt Engineering	RAG	Fine-Tuning
Setup cost	Hours	Days	Weeks
Data needed	None	Your documents	500-5,000 labeled examples
Updates	Instant (edit prompt)	Re-embed docs	Re-train (expensive, slow)
Knowledge	Fresh every call	Live updates possible	Frozen at training time
Cost per call	Base model price	Base model + retrieval	Smaller model = cheaper
Best for	Most tasks	Large knowledge bases	High-volume, stable-format tasks

The critical row is "Updates." Fine-tuned models are frozen at training time. When your design system adds 12 new components, when screening criteria change, when your API schema evolves — prompt engineering adapts instantly, RAG re-indexes in hours, and a fine-tuned model needs retraining.

When Fine-Tuning Is Actually Justified

All three conditions must be true simultaneously:

Volume is high. 100,000+ calls/month where even a small per-call cost reduction adds up to thousands of dollars monthly. At 1,000 calls/month, a fine-tuning run costing a few hundred dollars outweighs months of inference savings — you'd spend more on training than you'd save in a year of cheaper per-call costs.
Format is rigid. Output must match an exact schema with no variation. The task is well-defined and stable enough that the training data won't go stale within months.
Prompt engineering has hit a ceiling. Optimized prompts still produce a 10-15%+ failure rate on the target format — meaning real downstream impact like malformed outputs breaking consumers or manual correction eating team hours.

If any one of these is false — volume is low, format is evolving, or prompts already work well enough — fine-tuning costs more than it saves.

There's also a failure mode that's easy to miss: fine-tuned models tend to overfit to training patterns, improving consistency on known cases while degrading performance on edge cases and novel inputs. A model fine-tuned on 1,000 JD-to-candidate pairings will handle standard backend roles well but may score a DevOps-to-SRE career transition worse than the base model would — because the training data didn't include that pattern, and the fine-tuning narrowed the model's ability to reason about it.

Applied to Three Real Projects

Design System UI Generator — Closest Candidate, Still Not Worth It

The UI generator has the most fine-tuning-friendly profile: well-defined task (HTML matching a design system), rigid output format (specific Tailwind classes and component patterns), and a clear quality metric (does the output use the correct design tokens?).

But the design system changes. Eighteen components today, potentially thirty tomorrow. Every component addition or modification would require new training data and retraining. Selective context injection handles this without retraining — I just update the component patterns in the prompt.

Verdict: stay with selective context injection. At current volume, the inference cost is negligible. Revisit only if generation scales to 50,000+ mockups/month — at that point, the per-call savings of a smaller fine-tuned model would offset the retraining overhead.

Screening Pipeline — Almost Certainly Never

The pipeline requires general reasoning: bias detection, nuanced candidate scoring, adaptive sourcing strategies. Fine-tuning teaches patterns, not reasoning. Training on 15 candidates and 5 historical runs would overfit immediately. Requirements evolve constantly — new roles, new scoring dimensions, new bias patterns to detect.

Fine-tuning would actively degrade the general reasoning capability the pipeline depends on. The bias auditor works precisely because it uses a general-purpose model that can identify unfair patterns it's never seen in training data. A fine-tuned model trained on 15 candidates would overfit to those specific profiles and lose the ability to generalize.

Verdict: not a fine-tuning candidate. The cost of losing general reasoning far outweighs any per-call savings. Focus on prompt improvement and evaluation.

Apple FM (On-Device) — The Eventual Candidate

Apple's 3B parameter on-device model is the strongest case for fine-tuning. Binary classification tasks (PII detection, intent classification) are exactly what fine-tuning excels at — narrow scope, rigid output, high volume. Fine-tuning a small model for on-device inference keeps everything private, no cloud exposure needed.

The blocker: Apple's Foundation Models SDK doesn't expose fine-tuning in the Python API yet. When it does, PII classification is the first candidate — the evaluation framework from the second article in this series already provides the golden test set to validate whether a fine-tuned model actually outperforms the base model with few-shot prompting.

On-device fine-tuning would also eliminate network latency entirely and remove any privacy concerns about data leaving the device — making it a strong architectural fit for classification tasks that run at high volume on sensitive data. The combination of a small, specialized model with zero network overhead and full data privacy is the scenario where fine-tuning delivers the most value relative to alternatives.

Verdict: watch for when Apple exposes fine-tuning. PII classification and commit message generation are strong first candidates.

LoRA: What You'd Actually Use

Full fine-tuning updates all model weights — expensive, slow, requires serious GPU. LoRA (Low-Rank Adaptation) freezes the base model and trains tiny adapter layers on top:

Base model:    7B parameters (frozen)
LoRA adapter:  ~10M parameters (trained)
Result:        Fine-tuned behavior at ~1% of the cost

HuggingFace's peft library implements LoRA. Unsloth makes training 2-5x faster on consumer GPUs. If you ever fine-tune, this is the approach — not full weight updates.

The Decision Tree

Before evaluating fine-tuning, run through this:

Question	If No
Do you have stable, well-defined requirements?	Fix requirements first
Is prompt engineering + RAG producing unacceptable results?	You don't need fine-tuning
Do you have 500+ high-quality labeled input/output pairs?	Collect more data first
Is the task high-volume (100k+ calls/month)?	Prompt cost is fine, revisit later

Only if every answer is yes is fine-tuning worth evaluating. And even then, the validation step matters: hold out 10-20% of examples before training, compare fine-tuned performance against few-shot prompting on the same test set. If the fine-tuned model doesn't beat few-shot by more than 10%, it wasn't worth the effort.

The Hierarchy

This is the order I've applied across three projects, and it maps directly to the progression of this series:

Priority	Technique	Setup Time	When It Helps
1	Prompt engineering	Hours	Always start here
2	Selective context / RAG	Days	When knowledge is the gap
3	Tool use / agents	Days	When steps need to be dynamic
4	Fine-tuning	Weeks	Only when all above are exhausted at scale

Each level is more powerful but more expensive and less flexible. The right strategy is to move down the list only when the level above genuinely isn't enough — not because the technique sounds more sophisticated.

Takeaways

Fine-tuning is the last resort, not the first instinct. Prompt engineering, RAG, and tool use solve most problems with less cost, faster iteration, and no retraining when requirements change.
Three conditions must all be true to justify fine-tuning: high volume (100k+ calls/month), rigid format, and prompt engineering hitting a measurable ceiling. If any one is false, the simpler approach wins.
Fine-tuning degrades general reasoning. Tasks that require bias detection, adaptive strategies, or nuanced judgment get worse with fine-tuning, not better. The model learns patterns at the expense of flexibility.
If you do fine-tune, use LoRA. Full weight updates are expensive and rarely necessary. LoRA trains tiny adapter layers at roughly 1% of the cost while preserving the base model's capabilities.
Most AI systems don't fail because they lack fine-tuning — they fail because simpler approaches weren't pushed far enough. Better prompts, tighter context injection, and proper evaluation will get you further than any training run.