For a hands-on learning experience to develop LLM applications, join our LLM Bootcamp today.
First 5 seats get a 30% discount! So hurry up!

Part 2: How Do You Evaluate Agents? | Evaluating AI Agents with Arize AI

Agenda

Learn Core Techniques to Evaluate AI Agents with LLM Evaluation

Evaluating large language models (LLMs) can be a daunting task, and when it comes to agentic systems, the complexity increases exponentially. In this second part of our community series with Arize AI, we will explore why traditional LLM evaluation metrics fall short when applied to agents and introduce modern LLM evaluation techniques that are built for this new paradigm.

From code-based evaluations to LLM-driven assessments, human feedback, and benchmarking your metrics, this session will equip you with the necessary tools and practices to assess agent behavior effectively. You will also get hands-on experience with Arize Phoenix and learn how to run your own LLM evaluations using both ground truth data and LLMs.

What We Will Cover:

  • Why standard metrics like BLEU, ROUGE, or even hallucination detection aren’t sufficient for evaluating agents.
  • Core evaluation methods for agents: LLM evaluations using code-based evaluations, LLM-driven assessments, human feedback and labeling, and ground truth comparisons.
  • How to write high-quality LLM evaluations that align with real-world tasks and expected outcomes.
  • Building and benchmarking LLM evaluations using ground truth data to validate their effectiveness.
  • Best practices for capturing telemetry and instrumenting evaluations at scale.
  • How OpenInference standards (where applicable) can improve interoperability and consistency across systems.
  • Hands-on Exercise: Judge a sample agent run using both code-based and LLM-based evaluations with Arize Phoenix.

 

Ready for Part 3 of the series? Find it here!

John Gilhuly

Head of Developer Relations at Arize AI

John is the Head of Developer Relations at Arize AI, focused on open-source LLM observability and evaluation tooling. He holds an MBA from Stanford, where he specialized in the ethical, social, and business implications of AI development, and a B.S. in C.S. from Duke. Prior to joining Arize, John led GTM activities at Slingshot AI, and served as a venture fellow at Omega Venture Partners. In his pre-AI life, John built out and ran technical go-to-market teams at Branch Metrics.

RSVP