Cline’s Ara Khan says agent evals matter most when demos lie
Evals as the control panel for improving AI agents.📷 AI-generated image / TECH&SPACE
- ★Ara Khan describes Cline’s shift from skepticism about evals to using them inside an agent-improvement loop.
- ★The main point is not that evals are objective truth, but that they give a steadier signal than informal vibe-based testing.
- ★The value of evals depends on heuristics: how they are run, interpreted, maintained and converted into concrete agent changes.
In a new DeepLearning.AI video, recorded as part of AI Dev 26 x SF, Cline’s Ara Khan takes on one of the least comfortable topics in AI-agent development: evaluations that nobody fully trusts, but serious teams can no longer ignore. The talk’s title, “Evals Are Broken Use Them Anyway,” captures the point cleanly. This is not a defense of evals as sacred metrics. It is a defense of discipline in a field where “it feels better” can quickly become an expensive lie.
Khan starts from a practical reversal: moving from the view that evals are useless to using them as a core part of the agent improvement loop. That matters because the argument is not framed around a lab slide with one heroic benchmark number. It comes from work on a tool that has to survive messy development tasks, user edge cases and model churn. Cline is exactly that kind of environment: a coding agent where small regressions can be the difference between useful assistance and a system that confidently wastes time.
A DeepLearning.AI talk from SF turns frustration with AI-agent measurement into a practical frame: less faith in vibes, more repeatable checks.
The gap between vibes and repeatable signal shows up in task traces.📷 AI-generated image / TECH&SPACE
The central claim is straightforward: evals are broken because they never measure the full reality. They can reward the wrong response style, age quickly, miss real user intent or create a false sense of progress. But the alternative is often not better science. It is vibe-driven development. If an agent “feels smarter” after a prompt change, model swap or tool tweak, that does not prove it is more stable, more useful or less likely to regress.
That is why Khan focuses on heuristics: how to interpret eval results, when to run them, how to create them and why to keep using them anyway. This is closer to editorial judgment than pure automation. An eval is not a verdict; it is a signal that has to be read in context. If one benchmark improves while real workflows break, the number is decoration. If an eval repeatedly catches the same class of failure, even an imperfect metric becomes operationally valuable.
The broader AI ecosystem has been moving toward treating agents as systems to be tested, not just models to be scored. Projects such as OpenAI Evals helped popularize repeatable checks, but agents add extra layers: tools, memory, file systems, chains of decisions and user goals that can shift during the task. A single right-or-wrong answer is not enough of a measurement surface.
The most useful part of Khan’s message is its lack of hype. Evals are not a magic dashboard for truth. They are working instruments for teams trying to know whether they actually improved something or merely produced a better demo. In 2026, as agents move into everyday development workflows, that difference is no longer academic. It is engineering hygiene.

