Shipping an AI agent is the easy part. Trusting it in production is where most teams struggle. In this session, Lead AI Architect Kwasi Ankomah breaks down the evaluation discipline that separates demos from production-ready AI agents – built live on a real system, run on traces and telemetry.
Demos are deceptive. An AI agent that performs flawlessly in a controlled environment can quietly break in production the moment a tool errors, a prompt shifts, or a subagent delegates incorrectly. Most teams still ship on intuition – and pay the price.
This session is the finale of our multi-session agentic AI series. In Session 5, we scaled agents using a supervisor-subagent architecture. In this final session, Kwasi Ankomah – Lead AI Architect at SambaNova Systems with 15 years of cross-industry experience – walks you through the end-to-end evaluation discipline that state-of-the-art teams use to build production-ready AI agents, run on traces and telemetry.
Multi-step, non-deterministic systems cannot be unit-tested like a function, and the evaluation mindset has to shift accordingly. This session covers why traditional software tests fail for agents, and what to use instead.
You will walk through the four evaluator types every production-ready AI agent needs – rule-based, LLM-as-a-judge, trajectory, and recovery-from-failure – and learn how to combine them into a scorecard and regression gate that runs automatically in CI on every prompt, model, tool, or architecture change.
The session also covers the state-of-the-art workflow on LangSmith, including datasets, experiments, and trace-based evaluation that surfaces failure-and-retry sequences inside subagents. For teams on the open-source path, Kwasi walks through LangFuse and OpenTelemetry as a full observability alternative. The session closes with online evaluation and pass^k reliability – how to score live traffic and build genuine statistical confidence that your agent performs correctly every time.
Multi-agent systems are non-deterministic by nature. The same input can produce different intermediate steps, different tool calls, and different outputs across runs. That variability is what makes agents powerful – and what makes traditional QA completely inadequate for them.
The gap between a working demo and a reliable system in production is wider than most teams expect. Without a structured evaluation framework, there is no way to know whether a change to a prompt, a model swap, or a new tool integration has quietly degraded performance somewhere in the pipeline.
The teams shipping production-ready AI agents are not guessing. They run structured evaluations across datasets, catch regressions before they reach users, and monitor live traffic with observability tooling. Without this discipline, every deployment is a risk. You can explore related reading on the Data Science Dojo blog for primers on LangGraph, LangSmith, and building robust AI pipelines, and the LangChain blog for further context on multi-agent orchestration patterns.
This webinar is built for practitioners actively building or preparing to deploy production-ready AI agents – AI and ML engineers working with LangGraph, CrewAI, or similar frameworks, data scientists and architects responsible for production LLM systems, and technical leads evaluating agent observability and CI tooling. If you have ever pushed an agent to production and wondered whether it would hold up under real conditions, this session is for you. Prior exposure to supervisor-subagent patterns is helpful but not required.
Kwasi Ankomah is the Lead AI Architect at SambaNova Systems, specializing in deep agent architectures, multi-agent orchestration, and context engineering, with 15 years of experience building production AI systems across financial services, consulting, government, and tech. Connect with Kwasi on LinkedIn →

Lead AI Architect