The Best Match Had the Worst Score
After adding evaluation tests to the screening pipeline in the previous article, I needed to improve the sourcing step. Instead of sending all 15 candidates to the LLM, I wanted to narrow down to the most relevant ones first. I added vector search — a way to find candidates whose profiles are most similar to the job description. One JD for a Senior Backend Engineer (Python, PostgreSQL, AWS, Docker, 5-10 years). The system ranked candidates by how closely their profile matched the JD, measured by cosine similarity:
| Rank | Candidate | Similarity | Problem |
|---|---|---|---|
| 1 | 3yr frontend developer | 0.622 | Junior, wrong specialty |
| 2 | 6yr Python/FastAPI engineer | 0.601 | Correct match |
| 3 | 6yr Python/Go engineer | 0.593 | Correct match |
| 9 | 9yr Principal Engineer | 0.458 | Most qualified — missed |
The junior developer's profile mentions "Python", "AWS", "backend services" — words that appear verbatim in the JD. High vocabulary overlap, high cosine similarity, wrong answer.
The Principal Engineer's profile has "Kafka", "Cassandra", "Microservices", "1B+ events/day" — equivalent or superior capabilities described in different vocabulary. The embedding model didn't know that processing a billion events a day is more impressive than listing the same keywords as the JD.
Embeddings matched vocabulary, not capability. At scale, this silently filters out strong candidates who describe their experience differently from the JD.
How Embeddings Work
The idea behind vector search is simple: convert text into a list of numbers that represents its meaning. Two pieces of text with similar meaning produce similar numbers. The system compares those numbers instead of matching keywords.
| Profile | Meaning (simplified) | Match to "Python backend engineer"? |
|---|---|---|
| "Python backend engineer with FastAPI" | [0.12, 0.84, ...] | 1.00 — exact match |
| "Senior developer working on payment APIs" | [0.15, 0.79, ...] | 0.89 — very similar |
| "WordPress freelance developer" | [0.91, 0.03, ...] | 0.12 — not similar |
The first two profiles describe related roles — their numbers are close. The third is a different domain entirely. No rules needed — the model learns what "similar" means from the text alone.
The Pipeline I Built
The implementation uses sentence-transformers1 to convert text to numbers and Chroma2 as the database. Converting all 15 candidates takes 0.14 seconds. Finding the closest matches takes about 1ms — the entire pre-filtering step is nearly instant compared to the LLM call it feeds into.
Prose Over JSON
One design decision worth noting: I convert each candidate's data into a natural language paragraph rather than feeding the raw structured data. The model was trained on sentences, not data formats.
JSON:
{"years_experience": 6, "skills": ["Python", "Go"], "role": "Senior Engineer"}Prose: "Senior Software Engineer with 6 years of experience at a fintech company. Skills: Python, Go, Kubernetes, PostgreSQL. Led payment processing redesign serving 10M+ transactions/day."
I tested this — prose produced clearer differences between candidates and caught the Principal Engineer that structured data missed.
Pre-filter, Not Judge
After seeing the junior developer at #1 and the Principal Engineer at #9, the temptation was to tune the embedding — try a bigger model, add more context, re-weight the dimensions. But the deeper lesson is about the role of embeddings in the pipeline.
The right mental model:
Wrong: 15 candidates → vector search top-6 → shortlist
(search makes the hiring decision)
Right: 500 candidates → vector search top-20 → LLM → shortlist
(search narrows the field, LLM makes the decision)
Vector search doesn't need to be perfect. It needs to be:
- Better than random. It is — the strongest candidates appear consistently in the top results.
- Fast. About 1ms per query, regardless of pool size.
- Scalable. Same speed at 15 or 15,000 candidates.
What I Tested
The initial results used all-MiniLM-L6-v2 — a small, fast model optimized for speed over accuracy. Before accepting the ranking as a fundamental limitation, I ran three experiments to see what could improve it.
Does a Better Model Fix It?
I swapped to all-mpnet-base-v2 — a larger model trained on more text with higher dimensional output (768 vs 384):
| Rank | all-MiniLM-L6-v2 | all-mpnet-base-v2 |
|---|---|---|
| 1 | 3yr frontend dev | 6yr Python/FastAPI engineer |
| 2 | 6yr Python/FastAPI engineer | Staff Engineer, 8yr |
| 3 | 6yr Python/Go engineer | 6yr Python/Go engineer |
| 6 | 4yr data engineer | 9yr Principal Engineer |
The junior developer who ranked #1 dropped out entirely. The Principal Engineer who was missed at #9 surfaced at #6.
Result: Model choice had a bigger impact on correctness than any other variable I tested.
Does Query Phrasing Matter?
I compared the raw JD (specific tool names like "FastAPI", "PostgreSQL", "Kafka") against a hand-written summary using generic phrases like "cloud infrastructure" and "senior level."
| Specific JD | Vague summary | |
|---|---|---|
| Top 6 candidates | Correct matches surfaced | Same 6, different order |
| #1 ranked | 6yr Python/FastAPI engineer | Management-focused candidate |
| Similarity scores | Higher across the board | All scores dropped |
A management-focused candidate rose to #1 under the vague summary because "senior level" matched her vocabulary. Less specific vocabulary means less signal.
Result: Model choice changed which candidates appeared. Query phrasing only changed the ordering. Model matters more.
Does Adding More Layers Fix It?
I added two more steps on top of vector search: keyword matching (to catch exact term overlaps that vector search might miss) and a re-ranking model (to re-score the combined results more carefully).
| Layer | Effect | Verdict |
|---|---|---|
| Vector search alone | Missed one relevant candidate | Baseline |
| + Keyword matching | Surfaced the missed candidate | Helped |
| + Re-ranking model | Dropped Principal Engineer, brought back 3yr frontend dev | Made it worse |
The re-ranking model was trained on web search questions, not hiring. It scores "does this text answer this query?" not "is this person qualified?"
Result: A more sophisticated model doesn't mean better results when it's trained on the wrong problem.
Why This Matters at Scale
LLMs have a limit on how much text they can process at once (the context window). Instead of sending all candidates, RAG3 sends only the most relevant ones:
| Approach | Candidates Sent to LLM | Cost |
|---|---|---|
| Send all (15 candidates) | 15 | fits in context window |
| Search first, send top 6 | 6 | 61% less text to process |
| Send all (500 candidates) | 500 | context window overflow — fails |
| Search first, send top 10 | 10 | 98% less text — works |
At 15 candidates, it barely matters. At 500, vector search becomes the only option — without it, the pipeline can't run at all.
What I'd Change Structurally
- Combine keyword matching with vector search. Keyword matching catches exact term overlaps that vector search misses. Together they surface more relevant candidates — but keyword results need filtering to avoid false positives on common terms.
- Skip the re-ranking model unless it's trained on the right problem. A re-ranking model trained on web search made the hiring pipeline worse. Let the LLM make the final judgment instead.
- Save the converted data. Avoid re-processing all candidates on every run — convert once, store, and reuse.
Takeaways
- Model choice matters most. A larger model fixed the false #1 and surfaced the missed Principal Engineer. No amount of query tuning or text formatting compensated for a weaker model.
- Vector search narrows the field. The LLM makes the decision. Don't use search results as the final answer — use them to decide what the LLM should evaluate.
- Feed the model prose, not structured data. Natural language produced clearer differences between candidates and caught the ones that structured data missed.
- A more sophisticated model doesn't mean better results. A re-ranking model trained on web search made the hiring pipeline worse. Test on your actual problem before adding complexity.
- At small scale, the technique is optional. At large scale, it's the only option — without it, the LLM can't process all the candidates at once.
Footnotes
-
sentence-transformers is an open-source Python library from HuggingFace that wraps popular embedding models (like
all-MiniLM-L6-v2orall-mpnet-base-v2) and runs them locally — no API key, no data leaving your machine. Basic usage:from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") texts = [ "Senior Python engineer with FastAPI and PostgreSQL", "Frontend developer with React and TypeScript", ] embeddings = model.encode(texts) # returns numpy array shape (2, 384) print(embeddings.shape) # (2, 384)The returned array has one row per input text. Each row is a fixed-length vector — same length regardless of how long the original text was. You can then compute cosine similarity between any two rows to measure how related those texts are. ↩
-
Chroma is a lightweight, in-process vector database. It runs embedded inside your Python process — no separate server, no Docker, no config. You create a collection, add documents with their embeddings, and query by similarity:
import chromadb from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") client = chromadb.Client() collection = client.create_collection("candidates") texts = ["Senior Python engineer, FastAPI, PostgreSQL, 6yr", "Frontend developer, React, TypeScript, 3yr"] ids = ["candidate-1", "candidate-2"] collection.add( documents=texts, embeddings=model.encode(texts).tolist(), ids=ids, ) results = collection.query( query_embeddings=model.encode(["Senior backend engineer"]).tolist(), n_results=1, ) print(results["ids"]) # [['candidate-1']]For production use,
chromadb.PersistentClient(path="./db")saves the collection to disk so you don't re-embed on every restart. ↩ -
RAG (Retrieval-Augmented Generation) is a pattern for working around LLM context window limits. Instead of passing everything to the model, you retrieve only the relevant subset first, then inject that subset into the prompt as context:
def screen_candidates(job_description: str, all_candidates: list[dict]) -> str: # Step 1 — retrieve: find the most relevant candidates via vector search top_candidates = vector_search(job_description, all_candidates, top_k=6) # Step 2 — augment: build a prompt with only those candidates as context context = "\n\n".join(format_candidate(c) for c in top_candidates) prompt = f"""You are a technical recruiter. Given this job description: {job_description} Evaluate these candidates and rank them by fit: {context}""" # Step 3 — generate: call the LLM with the reduced context return llm.complete(prompt)Without the retrieval step, sending 500 candidate profiles to the LLM in a single request would exceed the context window and fail. The retrieval step converts an impossible request into a feasible one — the LLM only ever sees a manageable slice of the full dataset. ↩