If You Can't Test It, Don't Deploy It: The New Rule of AI Development
Generative AI breaks the old software playbook. You can't step through a call stack, there's no single "correct" answer, and models behave like black boxes. That's why Magdalena Picariello argues for a simple rule: if you can't test it, don't ship it.
She reframes the debate from model accuracy to business outcomes. The point isn't perfection. It's a feedback system that proves value, reduces risk, and gets you to impact faster.
Why traditional debugging fails with LLMs
- No ground truth: Many tasks don't have a single right answer. "Good enough" depends on context.
- Black box behavior: You see input and output. The "why" is opaque. Prompt tweaks fix one case and break another.
- Non-determinism and drift: Same prompt, slightly different results across time, data, and model versions.
The fix: Treat GenAI like a testable system
LLM testing ("evals") is your visibility layer. It lets you try ideas fast, compare models, and gate releases with evidence. In one project Magdalena shared, hundreds of prompt experiments led to a simple three-word change that unlocked major gains: ~900k CHF saved per year, 10k hours saved, and a 34% productivity boost. That result wasn't luck-it was testing velocity.
Start with users, not models: build a coverage matrix
- Collect real queries: Pull them from logs, support tickets, or user interviews.
- Segment the space: Example dimensions: user type (new vs. returning), intent (billing, product, technical), channel (site, app).
- Score by business value: Multiply frequency by impact (revenue, retention, cost-to-serve, risk).
- Prioritize test cases: Build evals for the highest "frequency x value" first-then cover critical outliers.
Outliers matter. If 1 in 10,000 queries is worth enterprise-level revenue (e.g., a bulk buyer at a wine fair), you build a test for it. Volume isn't the only signal-value is.
Measure success with KPIs, not just accuracy
- Revenue influenced, conversion rate, lead quality
- Time saved, ticket deflection, resolution rate
- Latency SLOs, cost per call, containment
- Policy and safety pass rates
Technical metrics (BLEU, ROUGE, cosine similarity) are useful proxies. Business KPIs decide what ships.
A practical workflow you can copy
- Define "good": Write a rubric per use case. Convert subjective expectations into numeric scores.
- Assemble a test set: Real queries + edge cases + high-value scenarios. Include negative tests and policy checks.
- Automate scoring: Mix methods: exact match, regex, semantic similarity, classifier-as-judge, guardrails.
- Run experiments fast: System prompts, retrieval variants, model versions, context windows, temperature.
- Gate releases: Ship only if evals meet KPI thresholds. Track regression over time.
- Monitor and learn: Feed production logs back into the eval set. Iterate weekly, not quarterly.
Keep a human in the loop
Generate candidates with AI if you want. But a human must define the rubric, label gold samples, and validate that "7/10 helpful" actually reflects business value. That human judgment becomes code.
Model choice is an implementation detail
Stop chasing hype. Abstract the model behind an interface and let the tests decide. If latency is the problem, try a smaller model. If quality is lacking on high-value cases, upgrade. Swap, run evals, compare. Decision made.
Observe users, not just tokens
- Conversation depth and re-ask rate (did they rephrase the same question 5 times?)
- Abandonment points and escalation to human
- Time to resolution and containment
LLMs are black boxes, but user behavior is loud. Your logs tell you where value is created-or lost.
Tools that help (pick one and start)
- DeepEval for LLM evals and custom metrics
- MLflow for experiment tracking
- LangFuse, Evidently AI, opik for tracing, dashboards, and evaluation workflows
Tooling moves fast. Your competitive edge is the eval strategy, not the brand name.
Common pitfalls to avoid
- Shipping without automated evals and release gates
- Optimizing for accuracy while ignoring revenue or risk
- Manual spot checks instead of repeatable tests
- Ignoring rare but high-value scenarios
- No versioning for prompts, data, or models
Quick-start checklist
- Write 10-30 high-value test cases from logs and interviews
- Define pass/fail rules and scoring rubrics per case
- Automate evals in CI and block deploys on regressions
- Run 50-200 prompt/model experiments-let data pick winners
- Log production behavior and promote new edge cases into tests
The mindset shift is simple: your AI doesn't have to be perfect. It has to be testable. If you can measure it against the outcomes your business cares about, you can ship with confidence-and improve it every week.
Want structured, practical training?
If you're rolling out LLMs across products and teams, a shared playbook beats ad-hoc hacks. Explore job-focused AI training and certifications here: Complete AI Training - Courses by Job.
Your membership also unlocks: