#testing
1 piece of content
My LLM Pipeline Passed Every Manual Check — Then 36 Tests Proved Otherwise
Five manual runs looked fine. Then 36 automated tests exposed non-deterministic sourcing, biased scoring, and a confidence threshold that fired randomly.