Evaluating large language models (LLMs) can be a daunting task, and when it comes to agentic systems, the complexity increases exponentially. In this second part of our community series with Arize AI, we will explore why traditional LLM evaluation metrics fall short when applied to agents and introduce modern LLM evaluation techniques that are built for this new paradigm.
From code-based evaluations to LLM-driven assessments, human feedback, and benchmarking your metrics, this session will equip you with the necessary tools and practices to assess agent behavior effectively. You will also get hands-on experience with Arize Phoenix and learn how to run your own LLM evaluations using both ground truth data and LLMs.
What We Will Cover:
Ready for Part 3 of the series? Find it here!
Head of Developer Relations at Arize AI