AI Video Evals Reimagined: Braintrust’s Engineering Approach to AI Development
At the recent AI Engineer World’s Fair, Braintrust’s CEO shared five essential lessons for developing AI applications that truly perform. The key takeaway: successful AI requires a solid engineering mindset, especially in how models are evaluated and improved.
Evaluations aren't just checkboxes; they must be built to mirror real-world performance. As Braintrust points out, “The most important property of a good dataset is that you can reconcile it with reality.” This means moving beyond synthetic data and regularly integrating real user feedback. Complaints and issues become data points that help shape meaningful evaluation metrics.
Evaluation should be proactive—used to discover new use cases and anticipate model behavior, not just to verify past performance. A mature evaluation system enables teams to deploy updates with new models within a day, keeping products agile and up to date.
From Prompt Engineering to Context Engineering
The focus is shifting from simple prompt tweaks to optimizing the entire context fed into large language models (LLMs). This "context engineering" includes clearly defined tools and their outputs. Braintrust’s data shows that 67.6% of tokens in a typical prompt come from tool responses rather than system prompts or tool definitions.
This means how tools are structured and how their outputs are formatted can greatly affect LLM understanding. Even small changes—like switching from JSON to YAML—can have significant effects on model performance.
Building Agility with Model-Agnostic Systems
AI models evolve fast. A feature that performed at 10% with GPT 4o jumped to 58% with Claude 4 Sonnet. Such leaps require systems that aren’t tied to a single model. Developers need the ability to swap and test new models quickly without large-scale code rewrites.
This agility ensures teams can leverage advances in AI promptly, keeping products competitive as new models emerge.
Introducing Braintrust’s Loop: Holistic Evaluation Optimization
Braintrust’s new Loop feature addresses the need for end-to-end evaluation improvement. Instead of optimizing only prompts, Loop allows simultaneous auto-optimization of prompts, datasets, and scoring methods. This holistic approach delivers far better results.
For example, a benchmark showed improvements from 8.9% when only prompts were optimized to 39.14% when all components were optimized together. Loop enables fast, deliberate iteration to keep AI aligned with model updates and user expectations.
- Build evaluation datasets that reflect actual user interactions.
- Use evaluation systems to explore new use cases and predict outcomes.
- Focus on optimizing the entire context, including tool outputs.
- Adopt model-agnostic frameworks for faster integration of new models.
- Apply holistic optimization with tools like Braintrust’s Loop for continuous improvement.
For those in AI development, this approach offers practical guidance on creating AI applications that are both effective and adaptable. To deepen your skills in prompt and context engineering, check out Complete AI Training’s prompt engineering courses.
Your membership also unlocks: