When Embeddings Fail: Why Vector Search Can't Judge Capability

The Best Match Had the Worst Score

After adding evaluation tests to the screening pipeline in the previous article, I needed to improve the sourcing step. Instead of sending all 15 candidates to the LLM, I wanted to narrow down to the most relevant ones first. I added vector search — a way to find candidates whose profiles are most similar to the job description. One JD for a Senior Backend Engineer (Python, PostgreSQL, AWS, Docker, 5-10 years). The system ranked candidates by how closely their profile matched the JD, measured by cosine similarity:

Rank	Candidate	Similarity	Problem
1	3yr frontend developer	0.622	Junior, wrong specialty
2	6yr Python/FastAPI engineer	0.601	Correct match
3	6yr Python/Go engineer	0.593	Correct match
9	9yr Principal Engineer	0.458	Most qualified — missed

The junior developer's profile mentions "Python", "AWS", "backend services" — words that appear verbatim in the JD. High vocabulary overlap, high cosine similarity, wrong answer.

The Principal Engineer's profile has "Kafka", "Cassandra", "Microservices", "1B+ events/day" — equivalent or superior capabilities described in different vocabulary. The embedding model didn't know that processing a billion events a day is more impressive than listing the same keywords as the JD.

Embeddings matched vocabulary, not capability. At scale, this silently filters out strong candidates who describe their experience differently from the JD.

How Embeddings Work

The idea behind vector search is simple: convert text into a list of numbers that represents its meaning. Two pieces of text with similar meaning produce similar numbers. The system compares those numbers instead of matching keywords.

Profile	Meaning (simplified)	Match to "Python backend engineer"?
"Python backend engineer with FastAPI"	[0.12, 0.84, ...]	1.00 — exact match
"Senior developer working on payment APIs"	[0.15, 0.79, ...]	0.89 — very similar
"WordPress freelance developer"	[0.91, 0.03, ...]	0.12 — not similar

The first two profiles describe related roles — their numbers are close. The third is a different domain entirely. No rules needed — the model learns what "similar" means from the text alone.

The Pipeline I Built

Loading diagram...

The implementation uses sentence-transformers¹ to convert text to numbers and Chroma² as the database. Converting all 15 candidates takes 0.14 seconds. Finding the closest matches takes about 1ms — the entire pre-filtering step is nearly instant compared to the LLM call it feeds into.

Prose Over JSON

One design decision worth noting: I convert each candidate's data into a natural language paragraph rather than feeding the raw structured data. The model was trained on sentences, not data formats.

JSON: {"years_experience": 6, "skills": ["Python", "Go"], "role": "Senior Engineer"}

Prose: "Senior Software Engineer with 6 years of experience at a fintech company. Skills: Python, Go, Kubernetes, PostgreSQL. Led payment processing redesign serving 10M+ transactions/day."

I tested this — prose produced clearer differences between candidates and caught the Principal Engineer that structured data missed.

Pre-filter, Not Judge

After seeing the junior developer at #1 and the Principal Engineer at #9, the temptation was to tune the embedding — try a bigger model, add more context, re-weight the dimensions. But the deeper lesson is about the role of embeddings in the pipeline.

The right mental model:

Wrong:  15 candidates → vector search top-6 → shortlist
        (search makes the hiring decision)

Right:  500 candidates → vector search top-20 → LLM → shortlist
        (search narrows the field, LLM makes the decision)

Vector search doesn't need to be perfect. It needs to be:

Better than random. It is — the strongest candidates appear consistently in the top results.
Fast. About 1ms per query, regardless of pool size.
Scalable. Same speed at 15 or 15,000 candidates.

What I Tested

The initial results used all-MiniLM-L6-v2 — a small, fast model optimized for speed over accuracy. Before accepting the ranking as a fundamental limitation, I ran three experiments to see what could improve it.

Does a Better Model Fix It?

I swapped to all-mpnet-base-v2 — a larger model trained on more text with higher dimensional output (768 vs 384):

Rank	all-MiniLM-L6-v2	all-mpnet-base-v2
1	3yr frontend dev	6yr Python/FastAPI engineer
2	6yr Python/FastAPI engineer	Staff Engineer, 8yr
3	6yr Python/Go engineer	6yr Python/Go engineer
6	4yr data engineer	9yr Principal Engineer

The junior developer who ranked #1 dropped out entirely. The Principal Engineer who was missed at #9 surfaced at #6.

Result: Model choice had a bigger impact on correctness than any other variable I tested.

Does Query Phrasing Matter?

I compared the raw JD (specific tool names like "FastAPI", "PostgreSQL", "Kafka") against a hand-written summary using generic phrases like "cloud infrastructure" and "senior level."

	Specific JD	Vague summary
Top 6 candidates	Correct matches surfaced	Same 6, different order
#1 ranked	6yr Python/FastAPI engineer	Management-focused candidate
Similarity scores	Higher across the board	All scores dropped

A management-focused candidate rose to #1 under the vague summary because "senior level" matched her vocabulary. Less specific vocabulary means less signal.

Result: Model choice changed which candidates appeared. Query phrasing only changed the ordering. Model matters more.

Does Adding More Layers Fix It?

I added two more steps on top of vector search: keyword matching (to catch exact term overlaps that vector search might miss) and a re-ranking model (to re-score the combined results more carefully).

Layer	Effect	Verdict
Vector search alone	Missed one relevant candidate	Baseline
+ Keyword matching	Surfaced the missed candidate	Helped
+ Re-ranking model	Dropped Principal Engineer, brought back 3yr frontend dev	Made it worse

The re-ranking model was trained on web search questions, not hiring. It scores "does this text answer this query?" not "is this person qualified?"

Result: A more sophisticated model doesn't mean better results when it's trained on the wrong problem.

Why This Matters at Scale

LLMs have a limit on how much text they can process at once (the context window). Instead of sending all candidates, RAG³ sends only the most relevant ones:

Approach	Candidates Sent to LLM	Cost
Send all (15 candidates)	15	fits in context window
Search first, send top 6	6	61% less text to process
Send all (500 candidates)	500	context window overflow — fails
Search first, send top 10	10	98% less text — works

At 15 candidates, it barely matters. At 500, vector search becomes the only option — without it, the pipeline can't run at all.

What I'd Change Structurally

Combine keyword matching with vector search. Keyword matching catches exact term overlaps that vector search misses. Together they surface more relevant candidates — but keyword results need filtering to avoid false positives on common terms.
Skip the re-ranking model unless it's trained on the right problem. A re-ranking model trained on web search made the hiring pipeline worse. Let the LLM make the final judgment instead.
Save the converted data. Avoid re-processing all candidates on every run — convert once, store, and reuse.

Takeaways

Model choice matters most. A larger model fixed the false #1 and surfaced the missed Principal Engineer. No amount of query tuning or text formatting compensated for a weaker model.
Vector search narrows the field. The LLM makes the decision. Don't use search results as the final answer — use them to decide what the LLM should evaluate.
Feed the model prose, not structured data. Natural language produced clearer differences between candidates and caught the ones that structured data missed.
A more sophisticated model doesn't mean better results. A re-ranking model trained on web search made the hiring pipeline worse. Test on your actual problem before adding complexity.
At small scale, the technique is optional. At large scale, it's the only option — without it, the LLM can't process all the candidates at once.

sentence-transformers is an open-source Python library from HuggingFace that wraps popular embedding models (like all-MiniLM-L6-v2 or all-mpnet-base-v2) and runs them locally — no API key, no data leaving your machine. Basic usage:
```
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("all-MiniLM-L6-v2")
 
texts = [
    "Senior Python engineer with FastAPI and PostgreSQL",
    "Frontend developer with React and TypeScript",
]
embeddings = model.encode(texts)  # returns numpy array shape (2, 384)
print(embeddings.shape)           # (2, 384)
```
The returned array has one row per input text. Each row is a fixed-length vector — same length regardless of how long the original text was. You can then compute cosine similarity between any two rows to measure how related those texts are. ↩

Chroma is a lightweight, in-process vector database. It runs embedded inside your Python process — no separate server, no Docker, no config. You create a collection, add documents with their embeddings, and query by similarity:

import chromadb
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer("all-MiniLM-L6-v2")
client = chromadb.Client()
collection = client.create_collection("candidates")
 
texts = ["Senior Python engineer, FastAPI, PostgreSQL, 6yr",
         "Frontend developer, React, TypeScript, 3yr"]
ids = ["candidate-1", "candidate-2"]
 
collection.add(
    documents=texts,
    embeddings=model.encode(texts).tolist(),
    ids=ids,
)
 
results = collection.query(
    query_embeddings=model.encode(["Senior backend engineer"]).tolist(),
    n_results=1,
)
print(results["ids"])  # [['candidate-1']]

For production use, chromadb.PersistentClient(path="./db") saves the collection to disk so you don't re-embed on every restart. ↩

RAG (Retrieval-Augmented Generation) is a pattern for working around LLM context window limits. Instead of passing everything to the model, you retrieve only the relevant subset first, then inject that subset into the prompt as context:

def screen_candidates(job_description: str, all_candidates: list[dict]) -> str:
    # Step 1 — retrieve: find the most relevant candidates via vector search
    top_candidates = vector_search(job_description, all_candidates, top_k=6)
 
    # Step 2 — augment: build a prompt with only those candidates as context
    context = "\n\n".join(format_candidate(c) for c in top_candidates)
    prompt = f"""You are a technical recruiter. Given this job description:
 
{job_description}
 
Evaluate these candidates and rank them by fit:
 
{context}"""
 
    # Step 3 — generate: call the LLM with the reduced context
    return llm.complete(prompt)

Without the retrieval step, sending 500 candidate profiles to the LLM in a single request would exceed the context window and fail. The retrieval step converts an impossible request into a feasible one — the LLM only ever sees a manageable slice of the full dataset. ↩