InfoQ’s AI platform lesson: agents need tests before they need more autonomy
A reliable AI platform shown as a control room for agents, tools, and evaluations.📷 AI-generated image / TECH&SPACE
- ★Erickson separates deterministic tools for certainty from agents used for open-ended discovery.
- ★Reliability requires an evaluation pyramid, not manual inspection of a few convincing outputs.
- ★Multi-agent hierarchies and time-series models only make sense with clear metrics, boundaries, and production monitoring.
That distinction matters. Tools are the right shape when a system must reliably call an API, enforce a rule, format an output, or keep a decision inside known boundaries. Agents make sense when the problem is not fully specified: when the system needs to explore paths, compare hypotheses, plan steps, or pull signal from messy context. A weak platform calls everything an agent. A serious platform knows where the agent stops and ordinary, testable software begins.
In practice, that means an AI workflow is not a chain of magical prompts. It is a system with control points. Deterministic guardrails can include schema validation, tool permissions, output constraints, error tracking, and explicit escalation rules. That naturally connects AI reliability to observability practices such as the OpenTelemetry documentation, because a production AI failure is not only about model accuracy. It is also about latency, cost, state, regressions, and behavior over time.
Aaron Erickson’s InfoQ presentation frames reliable AI systems as a mix of deterministic guardrails, agentic discovery, and a rigorous evaluation pyramid.
Deterministic guardrails catch agent output before production.📷 AI-generated image / TECH&SPACE
Erickson also focuses on multi-agent hierarchies. They are not useful because they sound sophisticated; they are useful when they divide a problem into concrete roles: planner, executor, critic, evaluator, or a specialized agent for a domain signal. But the risk is just as real. Every extra agent adds another surface for failure, cost, and unpredictability. If the hierarchy lacks clear inputs, outputs, metrics, and stopping conditions, it only produces a more expensive version of the same disorder.
Another important thread is the use of time-series foundation models. In that setting, the AI platform is not only processing a user’s text request. It is also reading patterns over time: operational signals, anomalies, historical trends, and shifts in behavior. That opens useful scenarios for forecasting and monitoring, but it also raises the evaluation bar because a model can look convincing while missing the rare events that matter most in production.
That is why the evaluation pyramid is the center of Erickson’s argument. At the base are small, frequent, cheap checks: parsing, format, rules, and tool calls. Above them are scenarios, regression sets, simulations, and behavior comparisons across versions. At the top are more expensive end-to-end evaluations tied to actual product outcomes. The same basic logic appears in systematic evaluation efforts such as the OpenAI Evals repository and in documented agent-graph patterns like LangGraph’s multi-agent concepts.
The useful point in this presentation is not a promise that one more framework will solve production AI. It is the sharper engineering boundary: an AI platform must preserve room for discovery without outsourcing reliability to improvisation. Use agents where exploration is the job, and use deterministic tools where certainty is required. If that boundary is blurry, scaling only makes the failure mode louder.

