The Unspoken Truth of Building AI Products That Actually Work
“Please no more evals.” This simple but pointed request from Ben Hylak, CTO of Raindrop, cut straight to a key issue at the AI Engineer World’s Fair in San Francisco. Alongside Sid Bendre, co-founder of Oleve, he revealed a hard truth: many AI demos look impressive, but real-world AI products often fail to perform as expected.
Ben Hylak, whose company Raindrop offers “Sentry for AI products” to detect and fix issues, teamed up with Sid Bendre from Oleve, a company known for scaling viral consumer AI apps. Their discussion focused on the challenge of moving beyond proofs-of-concept to building AI products that scale and sustain performance. The secret? Continuous iteration using real-world data rather than relying solely on theoretical evaluations.
Why Traditional Evaluations Fall Short
The AI space is exciting but unpredictable. Even leaders like OpenAI release products with flaws. Hylak shared examples where OpenAI’s Codex generated poor tests and where Grok produced bizarre hallucinations on sensitive topics.
This highlights a key point: “More capable = more undefined behavior.” As AI models grow smarter, they also become less predictable. Increasing intelligence doesn’t mean fewer errors; it often means new, unexpected failure modes.
Relying on traditional “evals” to measure AI product quality is misleading. As Hylak put it, “They tell you how good your product is. They’re not.” This aligns with Goodhart’s Law, which warns that when a metric becomes a target, it loses its value as a true measure.
OpenAI itself admits that their evaluation methods can’t catch every problem. For example, they note: “Our evals won’t catch everything... for more subtle or emerging issues, like changes in tone or style, real-world use helps us spot problems and understand what matters most to users.”
Focusing on Real-World Signals Instead
The key to building AI products that work lies in capturing continuous, authentic signals from users. These signals go beyond simple metrics and include:
- Explicit feedback: User actions like thumbs up/down, content copying, or sharing.
- Implicit cues: Behavioral patterns such as signs of frustration, task failures, or AI “laziness.”
By analyzing these signals, teams can detect issues that traditional evaluations miss. This approach helps product teams identify pain points and prioritize fixes based on real user behavior, not just lab tests.
How Oleve Scales with a Signal-Driven Approach
Oleve’s lean four-person team has grown to $6 million in annual recurring revenue and half a billion social media views by embracing this iterative method. Sid Bendre emphasized that AI is inherently chaotic and non-deterministic. Their solution? A framework called Trellis, which doesn’t try to control AI’s chaos but guides it.
Trellis works by breaking down AI outputs into manageable “buckets” and prioritizing workflows based on their impact on business goals. This impact is calculated using factors like volume, negative sentiment, achievable improvements, and strategic importance. The workflows are then refined recursively, making AI behavior more predictable and manageable.
This approach ensures AI features are engineered, repeatable, testable, and attributable, rather than accidental. Success isn’t about perfect static models but about maintaining a dynamic feedback loop that constantly learns from real-world use.
What Product Developers Should Take Away
- Stop relying solely on pre-launch evaluations. They often miss critical issues that only show up in real-world use.
- Collect and analyze both explicit and implicit user signals continuously.
- Use frameworks that structure AI outputs and workflows to make iteration manageable and impact-driven.
- Focus on building feedback loops that allow your AI product to improve over time, adapting to unexpected edge cases.
For product teams aiming to build functional AI applications, success depends on embracing uncertainty and continuously learning from how users interact with the product.
For more practical insights on AI product development and training, visit Complete AI Training’s latest AI courses.
Your membership also unlocks: