For a hands-on learning experience to develop Agentic AI applications, join our Agentic AI Bootcamp today. Early Bird Discount

Artificial intelligence is evolving at breakneck speed and nowhere is this transformation more evident than at the Agentic AI Conference 2025. This global event is more than just a gathering; it’s a vibrant hub where visionaries, practitioners, and enthusiasts unite to shape the future of intelligent agents. Whether you’re a seasoned AI professional or just beginning your journey, the Agentic AI Conference offers a front-row seat to groundbreaking ideas, hands-on learning, and unparalleled networking. With every session, you’ll discover new strategies, connect with industry leaders, and leave inspired to push the boundaries of what’s possible in agentic AI. 

May 2025 Conference Recap 

Held virtually from May 27–28, 2025, the Agentic AI Conference brought together a diverse audience of researchers, practitioners, and industry leaders to explore the rapidly growing field of agentic AI. The event provided a platform for both cutting-edge research and hands-on learning, making advanced concepts accessible to participants worldwide.

A Global Gathering of Innovators 

The May 2025 Agentic AI Conference drew over 51,000 participants from more than 120 countries, reflecting the surging global interest in agentic AI. Joined by top companies like LlamaIndex, AWS, Microsoft, Weaviate, Neo4j, Arize etc, the event featured 20+ expert speakers and 10+ interactive sessions, all delivered virtually for maximum accessibility. 

Session Highlights 

Sessions in the May 2025 Agentic AI Conference balanced technical depth with practical application. Attendees explored frameworks, planning strategies, and memory architectures powering today’s advanced AI agents. Tutorials provided hands-on experience, from optimizing agents with Amazon Bedrock to automating workflows with Gemini. 

Achievements and Feedback 

The May 2025 Agentic AI Conference received enthusiastic praise from attendees, who highlighted its strong organization and practical value. As one participant shared,

“Attending the Future of Data and AI Conference was an eye-opening experience that truly exceeded my expectations. The sessions were a perfect blend of visionary thinking and practical insights, covering everything from responsible AI and model governance to cutting-edge advancements in generative AI and autonomous systems.”

while another remarked,

“The Future of Data and AI Conference was an incredibly insightful experience. The sessions were packed with valuable information, covering everything from cutting-edge AI technologies to their real-world applications. I especially enjoyed the interactive workshops and networking opportunities, which allowed me to connect with experts and peers. Overall, it was a great opportunity to expand my knowledge and gain fresh perspectives on AI and data science.”

The hands-on tutorials were especially appreciated, with feedback such as,

“What stood out most was the conference’s commitment to practical, real-world applications—bridging the gap between strategy and execution. From cutting-edge demos of generative AI to thought-provoking panels on data ethics and governance, every session was packed with actionable insights.”

September 2025 Conference Preview 

What’s New for September? 

The Agentic AI Conference 2025 returns September 15-19, 2025, with an expanded agenda and even more opportunities for learning and networking. The event remains virtual, ensuring accessibility for participants worldwide. Registration is now open—secure your spot here. 

Detailed Session Descriptions 

Future of data and ai: agentic ai conference 2025 schedule

Panels (September 15) 

The Agentic AI Conference  2025 Panels kick off the conference, bringing together leading experts to discuss key challenges and opportunities in agentic AI. 

Designing Intelligent Agents

Go beyond surface-level discussions of AI by diving into the cognitive building blocks of intelligent agents. This panel explores memory, planning, reasoning, and adaptability—key aspects of how agents operate in dynamic environments. Expert speakers will share insights into how these foundations translate into real-world systems, offering strategies for creating agents that are not only context-aware but also capable of evolving over time.

Architecting Scalable Multi-Agent Workflows

As organizations move from single agents to interconnected systems, scalability becomes a defining challenge. This panel addresses methods for orchestrating multi-agent workflows across enterprise environments. From communication protocols to coordination strategies, you’ll learn how multiple agents can collaborate seamlessly, enabling large-scale deployments that support complex business processes and mission-critical applications.

Managing Security and Governance in MCP Deployment

Deploying Model Context Protocol (MCP) introduces powerful capabilities but also new governance responsibilities. This panel brings together thought leaders to discuss compliance, trust, and security in the era of agentic AI. Topics include implementing guardrails, building observability into agent workflows, and conducting responsible evaluation. Attendees will leave with a roadmap for deploying MCP systems that balance innovation with accountability.

Tutorials (September 16) 

Tutorials offer practical, step-by-step guidance on building and deploying intelligent agents. In the Agentic AI Conference 2025, these sessions are ideal for deepening technical skills and applying new concepts in real-world scenarios. 

From Data to Agents: Building GraphRAG Systems with Neo4j

This hands-on tutorial walks you through building Retrieval-Augmented Generation (RAG) systems that combine graph databases with unstructured data. Using Neo4j, you’ll learn to model relationships, connect data sources, and power agents that reason more intelligently about context. The session is ideal for anyone looking to build data-driven agents with richer reasoning capabilities.

Vision-Enabled Agents with Haystack

Push the boundaries of what agents can do by giving them sight. In this tutorial, you’ll learn how to build multimodal agents that can process and interpret images alongside text. Using Haystack, you’ll implement pipelines for visual search, recognition, and analysis, opening the door to applications in fields like healthcare, manufacturing, and content moderation.

Agentic Research Assistants with Reka

Research workflows can be time-consuming but agentic AI can change that. This tutorial provides a blueprint for creating intelligent research assistants that automate literature reviews, summarize findings, and synthesize insights. Powered by Reka, you’ll explore how to design agents that support academics, analysts, and enterprises with faster, more efficient knowledge discovery.

Event-Driven Agents with GitHub Webhooks

Learn to build agents that don’t just respond to queries but act on real-world triggers. This tutorial demonstrates how to connect GitHub webhooks to create event-driven agents that respond to commits, pull requests, and issue tracking. The result: AI-enhanced workflows that boost developer productivity and streamline collaboration.

Additional Tutorials (AWS, Ejento AI, Landing AI)

Beyond the core sessions, the Agentic AI Conference 2025 offers specialized tutorials led by AWS, Ejento AI, and Landing AI. These deep dives cover advanced techniques and real-world case studies, ensuring that both beginners and seasoned practitioners can expand their skillsets with the latest agentic AI practices.

Workshops (September 17-19) 

The Agentic AI Conference 2025 Workshops provide in-depth, instructor-led training on advanced agentic AI topics. These immersive sessions blend theory and practice, allowing participants to work on real-world projects and engage directly with industry experts. 

Visualizing Transformer Models with Luis Serrano

Go beyond the black-box perception of transformers by learning how to visualize and interpret their inner workings. This workshop, led by AI educator Luis Serrano, breaks down attention mechanisms, embeddings, and hidden states into intuitive visuals. You’ll not only understand how transformers process sequences but also gain hands-on skills to create your own visualizations, helping you explain model behavior to both technical and non-technical audiences.

Building AI Agents with Vector Databases (Weaviate)

Modern AI agents rely on efficient knowledge retrieval to act intelligently. In this workshop, you’ll explore how vector databases like Weaviate can store and query high-dimensional embeddings for real-time reasoning. Learn how to connect agents with memory systems, implement semantic search, and design recommendation workflows. By the end, you’ll have a working agent that leverages vector databases for smarter and more contextual decision-making.

Agentic AI for Semantic Search (Pinecone)

Search is evolving from keyword matching to semantic understanding, and agents are leading that shift. This workshop with Pinecone focuses on deploying AI-powered agents that perform semantic search across unstructured text, images, and more. Through guided exercises, you’ll learn how to set up Pinecone indexes, integrate them into agent pipelines, and optimize for latency and accuracy. Walk away ready to build intelligent, search-driven applications that feel responsive and context-aware.

Smarter Agents, Faster (Arize AX)

When agents move from prototypes to production, speed and reliability become critical. This workshop introduces best practices for scaling agent performance using Arize AX. Learn how to instrument your agents with monitoring tools, debug common issues in real-world deployments, and apply optimization techniques that make them more responsive under load. By the end, you’ll have the tools to confidently deploy robust, high-performing agents in enterprise settings.

Workshop Value:

Each workshop in the Agentic AI Conference  2025 is interactive and hands-on, featuring live sessions, personalized Q&A, and direct feedback from instructors. Participants receive downloadable materials, access to recordings, and a certificate of completion making these sessions an invaluable investment in professional development. 

Why Attend the Agentic AI Conference 2025? 

Attending the Agentic AI Conference 2025 is more than just a learning opportunity, it’s a chance to join a thriving, international community of AI innovators. The event’s blend of expert-led sessions, hands-on tutorials, and immersive workshops ensures that every participant leaves with new skills and valuable connections. 

  • Learn from leading AI experts and practitioners 
  • Gain practical skills through interactive sessions 
  • Network with peers from around the world 
  • Access exclusive giveaways and professional development resources 

Registration Details & Important Dates 

Future of Data and Ai: Agentic Ai Conference 2025 - Important Dates

Getting started is easy. Visit the Agentic AI Conference page to explore ticket options and secure your spot. Free tickets provide access to panels and tutorials, while paid upgrades unlock premium workshops and additional benefits. 

Panels: September 15 

Tutorials: September 16 

Paid Workshops: September 17-19 

Frequently Asked Questions (FAQ) 

Q1. What is Agentic AI?

Agentic AI refers to artificial intelligence systems designed to act autonomously, make decisions, and interact intelligently with their environment. These agents are capable of learning, adapting, and responding to complex scenarios, making them invaluable in a wide range of applications. 

Q2. How do I register for the conference?

Registration is straightforward. Simply visit the conference registration page and follow the instructions to select your ticket type and complete your registration. You’ll receive updates and access details via email. 

Q3. Are workshops included in the free ticket?

Panels and tutorials are free for all attendees, providing access to a wealth of knowledge and networking opportunities. Workshops, however, require a paid upgrade, which unlocks additional benefits such as live instructor-led sessions, downloadable materials, and certificates of completion. 

Q4. Who should attend the Agentic AI Conference 2025?

The conference is ideal for AI professionals, data scientists, developers, researchers, and anyone interested in the future of intelligent agents. Whether you’re a seasoned expert or just starting out, you’ll find sessions tailored to your interests and experience level. 

Conclusion

The Agentic AI Conference 2025 stands at the forefront of innovation in intelligent agents and artificial intelligence. Whether you’re looking to deepen your expertise, expand your network, or gain hands-on experience, this event offers something for everyone. Don’t miss your chance to be part of the next wave of AI advancement, register today and join a global community shaping the future of agentic AI. 

Future of Data and AI: Agentic AI Conference 2025

If you’ve spent any time building or even casually experimenting with agentic AI systems, tools are probably the first thing that come to mind. Over the past year, tools have gone from being a nice-to-have to the default abstraction for extending large language models beyond text. They are the reason agents can browse the web, query databases, run code, trigger workflows, and interact with real-world systems.

This shift didn’t happen quietly. It fundamentally changed how we think about language models. A model that can call tools is no longer just predicting the next token. It is orchestrating actions. It is deciding when it lacks information, when it needs to delegate work to an external system, and how to integrate the response back into its reasoning. Standards like Model Context Protocol (MCP) accelerated this shift by making tool definitions portable and structured, so agents could reliably talk to external capabilities without brittle prompt hacks.

Get a deeper look at MCP—an increasingly important standard for structured interaction between agents and tools.

But as tools matured, something interesting started happening in the background. People kept running into the same friction points, even with powerful tools at their disposal. Agents could do things, but they still struggled with how to think about doing them well. That gap is where agent skills enter the picture.

Rather than replacing tools, agent skills address a different layer of the problem entirely. They focus on reasoning patterns, reusable cognitive workflows, and behavioral structure—things that tools were never designed to handle.

From Tools to Thinking Patterns

To see why agent skills were even needed, it helps to look at how most agents were being built before the concept existed. A typical setup looked something like this: a system prompt describing the agent’s role, a list of available tools, and a large blob of instructions explaining how the agent should approach problems.

Over time, those instruction blocks grew longer and more complex. Developers added planning steps, verification loops, fallback strategies, and safety checks. Entire mini-algorithms were embedded directly into prompts. If you’ve ever copied a carefully tuned “reasoning scaffold” from one project to another, you’ve already felt this pain.

The problem was not that this approach didn’t work. It did. The problem was that it didn’t scale.

Every new agent reimplemented the same patterns. Every update required editing massive prompts. Small inconsistencies crept in, and behavior diverged across agents that were supposed to be doing the same thing. Tools solved external capability reuse, but there was no equivalent abstraction for internal reasoning reuse.

This is exactly the class of problems agent skills were designed to solve.

The Introduction of Agent Skills by Anthropic

Anthropic introduced Claude Agent Skills
source: Anthropic

Anthropic formally introduced agent skills on October 16, 2025, as part of their broader work on making Claude more modular, composable, and agent-friendly. The timing was not accidental. By then, it was clear that serious agent builders were no longer asking, “Can my model call tools?” They were asking, “How do I make my agent reliable, consistent, and reusable across contexts?”

Agent skills reframed agent development around reusable cognitive components. Instead of embedding reasoning logic directly into every prompt, you could define a skill once and attach it to any agent that needed that capability. This marked a shift in how agents were written, tested, and evolved over time.

Importantly, agent skills were not positioned as a replacement for tools. They were introduced as a complementary abstraction—one that sits between raw prompting and external tool execution.

Explore how recursive language models help maintain context over long or complex chains of reasoning—central to advanced agent behavior.

Why Tools and Agent Skills Are Fundamentally Different

At a conceptual level, the difference between tools and agent skills comes down to where they operate.

Tools operate outside the model. They are external functions or services that the model can invoke. Their inputs and outputs are structured, and their behavior is deterministic from the model’s perspective. When a tool is called, the model pauses, waits for the result, and then continues reasoning.

Agent skills, on the other hand, operate inside the model’s reasoning loop. They shape how the agent plans, evaluates, and makes decisions. They do not fetch new information from the world. Instead, they constrain and guide the model’s internal process.

You can think of the distinction like this:

  • Tools extend capability
  • Agent skills extend competence

A tool lets an agent access a database. An agent skill teaches the agent how to decide when to query, what to query for, and how to validate the result.

This difference is subtle, but once you see it, you can’t unsee it.

The Core Problem Agent Skills Solve

At its core, the problem agent skills solve is not about capability, but about structure. Modern agents are already powerful. They can reason, call tools, and generate complex outputs. What they lack is a consistent, reusable way to apply that reasoning across different contexts, agents, and products.

Without agent skills, every agent becomes a bespoke construction. Two agents designed to do “research” might both work, but each will interpret planning, verification, and decision-making slightly differently. These differences are not always obvious, but they accumulate. Over time, systems become harder to reason about, harder to maintain, and harder to trust.

Most teams try to solve this by writing longer and longer prompts. Planning logic, fallback strategies, validation steps, and domain-specific heuristics all get embedded directly into system instructions. This works in the short term, but it creates a fragile setup where reasoning patterns are duplicated, inconsistently updated, and difficult to audit.

To make this more concrete, consider a research agent tasked with answering technical questions. Ideally, you want the agent to:

  • Decompose the question into smaller, answerable sub-questions

  • Decide which sub-questions require external data

  • Use tools selectively rather than reflexively

  • Cross-check information before synthesizing a final response

You can describe all of this in a prompt, and the agent will likely follow it. But now imagine you need ten such agents: one for infrastructure research, one for ML papers, one for internal documentation, one for customer questions, and so on. You are faced with an uncomfortable choice. Either you duplicate this logic across ten prompts, or you allow each agent to drift into its own interpretation of what “good research” means.

Agent skills exist to eliminate this tradeoff.

They allow reasoning patterns like this to be encoded once and reused everywhere. Instead of being informal prompt conventions, these patterns become explicit, named capabilities that can be attached to any agent that needs them. The result is not just less duplication, but more consistency across the entire agent system.

More broadly, agent skills address several systemic issues that tools alone cannot solve.

Reasoning Needs Context, Not Just Actions

Tools give agents the ability to execute actions, but they don’t explain how those actions should fit into a broader workflow. Agent skills provide the missing context that tells an agent when to act, when to wait, and when not to act at all. This includes organizational conventions, domain norms, and user-specific expectations that are difficult to encode as APIs but essential for reliable behaviour.

Loading Only What the Agent Actually Needs

One of the quiet failure modes of agent systems is context overload. When every instruction is always present, agents waste attention on information that may not be relevant to the current task. Agent skills allow reasoning guidance to be introduced incrementally—high-level intent first, detailed procedures only when necessary—keeping the model focused and efficient.

Build Once, Use Everywhere

Without agent skills, reasoning logic tends to be rewritten for every new agent. With skills, that logic becomes portable. A planning or evaluation strategy can be defined once and reused across agents, products, and domains. This mirrors how software engineering moved from copy-pasted code to shared libraries but applied to reasoning instead of execution.

Turning Expertise into a First-Class Artifact

As agents move into specialized domains, raw intelligence is no longer enough. They need structured domain knowledge and conventions. Agent skills provide a way to encode this expertise—whether legal reasoning, data workflows, or operational playbooks—into versioned, reviewable artifacts that teams can share and improve over time.

Reasoning You Can Actually Read and Review

A subtle advantage of agent skills is that they are designed to be human-readable. Defined in clear Markdown, they double as documentation and behavior specification. This makes them easier to audit, discuss, and refine, especially in contrast to tools whose behavior is often buried deep in code.

What Is a Skill in Claude, Exactly?

In the Claude ecosystem, a skill is a structured definition of a reusable reasoning capability. It tells the model how to behave in certain situations, what constraints to respect, and how to structure its internal thinking.

A skill is not executable code in the traditional sense. It does not run outside the model. Instead, it is consumed by the model as part of its context, much like system instructions, but with clearer boundaries and intent.

Agent skills are designed to be:

  • Reusable across agents
  • Explicitly named and scoped
  • Easier to version and update

This alone dramatically improves maintainability in complex agent systems.

Files Required to Define a Claude Skill

Claude skills are defined as small, self-contained packages that describe a reusable reasoning capability. While the exact structure may evolve over time, the underlying idea is intentionally simple: a skill should clearly explain what it does, when it applies, and how the agent should reason while using it.

At minimum, a Claude skill is centered around a skill.md file. This file acts as both documentation and instruction. It is written in natural language but structured carefully enough that the model can reliably internalize and apply it.

In practice, a skill package may include:

  • skill.md — the core definition of the skill

  • Optional supporting files (examples, references, constraints)

  • Optional metadata used by the agent runtime to register the skill

Folder Structure for Agent Skills
source: Akshay Kokane

The design mirrors how humans already document best practices. Instead of encoding reasoning implicitly inside prompts or code, the logic is surfaced explicitly as a reusable artifact.

Example of a skill.md File

Imagine a skill designed to help an agent perform careful, multi-step analysis. A simplified version of a skill.md file might describe:

  • The goal of producing structured, verifiable reasoning
  • An expectation that assumptions are explicitly stated
  • A requirement to validate conclusions before responding

The power here is not in the syntax, but in the consistency. Every agent using this skill will approach problems in roughly the same way, even across very different tasks.

This is where agent skills start to feel less like prompts and more like architectural components.

How Claude Calls and Uses Agent Skills

From the agent’s perspective, using agent skills is straightforward. Skills are attached to the agent at configuration time, much like tools. Once attached, the model can implicitly apply the skill whenever relevant.

There is no explicit “call” in the same sense as a tool invocation. Instead, the skill shapes the agent’s reasoning continuously. This is an important distinction. Tools are discrete actions. Agent skills are persistent influences.

Because of this, multiple agent skills can coexist within a single agent. One skill might govern planning behavior, another might enforce safety constraints, and a third might specialize the agent for a particular domain.

Why Agent Skills and Tools Are Not Interchangeable

It can be tempting to ask whether agent skills could simply be implemented as tools. In practice, this approach quickly breaks down.

Tools are reactive. They wait to be called. Agent skills are proactive. They influence how decisions are made before any tool is invoked.

If you tried to implement a planning skill as a tool, the agent would still need to know when to call it and how to apply its output. That logic would live elsewhere, defeating the purpose.

This is why agent skills and tools are not interchangeable abstractions. They live at different layers of the agent stack and solve different problems.

Understand the evolution of agentic LLMs and how autonomous reasoning and tool integration are shaping the future of AI systems

Using Agent Skills and Tools Together

The real power emerges when agent skills and tools are used together. A well-designed agent might rely on:

  • Agent skills to structure reasoning and decision-making
  • Tools to perform external actions and data retrieval

For example, a skill might enforce a rule that all external information must be cross-checked. The tools then provide the mechanisms to fetch that information. Each does what it is best at.

This layered approach leads to agents that are more reliable, more interpretable, and easier to evolve over time.

Why Agent Skills Matter Going Forward

As agentic systems continue to grow in complexity, the need for modular reasoning abstractions will only increase. Tools solved the problem of external capability reuse. Agent skills address the equally important problem of internal behavior reuse.

If tools were the moment agents learned to act, agent skills are the moment they started to think consistently.

And that shift, subtle as it may seem, is likely to define the next phase of agent design.

Ready to build robust and scalable LLM Applications?
Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI

If you’ve been paying attention to where language models are heading, there’s a clear trend: context windows are getting bigger, not smaller. Agents talk to tools, tools talk back, agents talk to other agents, and suddenly a single task isn’t a neat 2k-token prompt anymore. It’s a long-running conversation with memory, state, code, and side effects. This is the world of agentic systems and deep agents, and it’s exactly where things start to break in subtle ways.

The promise sounds simple. If we just give models more context, they should perform better. More history, more instructions, more documents, more traces of what already happened. But in practice, something strange shows up as context grows. Performance doesn’t just plateau; it often degrades. Important details get ignored. Earlier decisions get contradicted. The model starts to feel fuzzy and inconsistent. This phenomenon is often called context rot, and it’s one of the most practical limitations we’re running into right now.

This blog is a deep dive into that problem and into a new idea that takes it seriously: recursive language models. The goal here is not hype. It’s to understand why long-context systems fail, why many existing fixes only partially work, and how recursion changes the mental model of what a language model even is.

Context is growing, but Reliability isn’t

Agentic workflows almost force longer contexts. A planning agent might reason for several steps, call a tool, inspect the result, revise the plan, call another tool, and so on. A coding agent might ingest an entire repository, write code, run tests, read error logs, and iterate. Each step adds tokens. Each iteration pushes earlier information further away in the sequence.

In theory, attention lets a transformer look anywhere it wants. In practice, that promise is conditional. Models are trained on distributions of sequence lengths, positional relationships, and attention patterns. When we stretch those assumptions far enough, we see degradation. The model still produces fluent text, but correctness, coherence, and goal alignment start to slip.

That slippage is what people loosely describe as context rot. It’s not a single bug. It’s a collection of interacting failures that only show up when you scale context aggressively.

Understand why memory bottlenecks matter more than raw context size in modern LLMs.

What Context Rot actually is

Context rot is the gradual loss of effective information as a prompt grows longer. The tokens are still there. The model can technically attend to them. But their influence on the output weakens in ways that matter.

One way to think about it is signal-to-noise ratio. Early in a prompt, almost everything is signal. As the context grows, the model has to decide which parts still matter. That decision becomes harder when many tokens are only weakly relevant, or relevant only conditionally.

There are several root causes behind this effect, and they compound rather than act independently.

1. Attention dilution

Self-attention is powerful, but it’s not free. Each token competes with every other token for influence. When you have a few hundred tokens, that competition is manageable. When you have tens or hundreds of thousands, attention mass gets spread thin.

Important instructions don’t disappear, but they lose sharpness. The model’s internal representation becomes more averaged. This is especially problematic for agents, where a single constraint violated early can cascade into many wrong steps later.

2. Positional encoding degradation

Most transformer models rely on positional encodings that were trained on specific sequence length distributions. Even techniques designed for extrapolation, like RoPE or ALiBi, still face a form of distribution shift when pushed far beyond their training regime.

The model has seen far more examples of relationships between tokens at positions 1 and 500 than between positions 50,000 and 50,500. When you ask it to reason across those distances, you’re operating in a sparse part of the training distribution. The result is softer, less reliable attention.

3. Compounding reasoning errors

Long contexts often imply multi-step reasoning. Each step is probabilistic. A small mistake early on doesn’t just stay small; it conditions future steps. By the time you’re dozens of turns in, the model may be reasoning confidently from a flawed internal state.

This is a subtle but crucial point. Context rot isn’t just about forgetting. It’s also about believing the wrong things more strongly as time goes on.

4. Instruction interference

Another underappreciated factor is instruction collision. Long contexts often contain multiple goals, constraints, examples, and partial solutions. These can interfere with each other, especially when their relevance depends on latent state the model has to infer.

The longer the context, the harder it becomes for the model to maintain a clean hierarchy of what matters most right now.

Discover how action-oriented models extend LLM abilities for real-world task execution.

How People have tried to fix Context Rot

The industry didn’t ignore this problem. Many clever workarounds emerged, especially from teams building real agentic systems under pressure. But most of these solutions treat the symptoms rather than the root cause.

File-system-based memory

One approach popularized in systems like Claude is to move memory out of the prompt and into a file system. The agent writes notes, plans, or intermediate results to files and reads them back when needed.

This helps with token limits and makes state explicit. But it doesn’t actually solve context rot. The model still has to decide what to read, when to read it, and how to integrate it. Poor reads or partial reads reintroduce the same problems, just one level removed.

Periodic summarization

Another common technique is context compression. The agent periodically summarizes its own conversation, keeping only a condensed version of the past.

This reduces token count, but it introduces lossy compression. Summaries are interpretations, not ground truth. Once something is summarized incorrectly or omitted, it’s gone. Over many cycles, small distortions accumulate.

Context folding

Context folding tries to be more clever by hierarchically compressing context: recent details stay explicit, older details get abstracted.

This works better than naive summarization, but it still relies on the model’s ability to decide what is safe to abstract. That decision itself is subject to the same attention and reasoning limits.

Enter Recursive Language Models

In October 2025, Alex Zhang introduced a different way of thinking about the problem in a blog post that later became a full paper. The core idea behind recursive language models is deceptively simple: stop pretending that a single forward pass over an ever-growing context is the right abstraction.

Instead of one giant sequence, the recursive language model operates recursively over smaller, well-defined chunks of state. Each step produces not just text, but structured state that can be fed back into the model in a controlled way.

This reframes the recursive language model less as a static text predictor and more as a stateful program.

How Recursion Language Models Address Context Rot

The key insight of recursive language models is that context does not have to be flat. Information can be composed.

Rather than asking the model to attend across an entire history every time, the system maintains intermediate representations that summarize and formalize what has already happened. These representations are not free-form natural language. They are constrained, typed, and often executable.

By doing this, the model avoids attention dilution. It doesn’t need to rediscover what matters in a sea of tokens. The recursion boundary enforces relevance.

Step-by-step: How Recursive Language Models work

A recursive language model is not a new neural architecture. It is a thin wrapper around a standard language model that changes how context is accessed, while preserving the familiar abstraction of a single model call. From the user’s perspective, nothing looks different. You still call it as rlm.completion(messages), just as you would a normal language model API. The illusion is that the model can reason over near-infinite context.

Internally, everything hinges on a clear separation between the model and the context.

Recursive Language Models from the User's perspective
source: Alex Zhang

Each call to a recursive language model begins with what Alex Zhang calls the root language model, or the language model at depth zero. The root LM is only given the user’s query. The large body of associated context—documents, logs, codebases, transcripts—is not placed into the prompt at all. Instead, it is stored externally in an environment.

That environment is implemented as a persistent Python REPL loop, similar to a Jupyter Notebook. The full context is preloaded into memory as Python variables. For example:

Crucially, the root LM never sees context as tokens. It cannot attend over it. The context exists purely as data inside the environment.

The root LM interacts with this environment by emitting Python code. Each time the model outputs a code block, that code is executed in the REPL, and the result of the execution is fed back into the model’s context for the next step. Only the output is returned—typically truncated—so the root LM’s prompt stays small.

Explore the evolution from large language models to small language models

Suppose the user query is to find a specific fact buried somewhere in a massive document. The root LM might start by deciding how to look, not by trying to read everything. It expresses that decision as code:

The environment executes this code. Maybe the output is:

128

That output—just the number 128—is what the root LM sees next. The model has learned something important: there are many relevant lines, so it should narrow further.

It might then refine the query:

Again, the environment executes the code and returns a small slice of the result, such as a handful of example lines. At no point has the model attended over the full document. It has only seen evidence produced by deliberate queries.

Recursive Language Models in a REPL environment
source: Alex Zhang

This loop continues: the root LM writes code, the environment executes it, and the model observes the outputs. Through this process, the model can inspect subsets of the context, perform transformations, run regex searches, and store intermediate results as variables, all without bloating its context window.

At some point, the root LM may decide that a subset of the context is still too large or too semantically rich to handle procedurally. This is where recursion enters. From inside the REPL, the root LM can launch a recursive language model call, treating it like a function:

Here, filtered_chunk is a specific string or subset of the context stored in a variable. A new language model call is launched at depth one. That recursive LM sees only this small chunk, reasons over it, and returns a compact result—perhaps a summary, an extracted fact, or an answer to a sub-question.

The root LM receives the result, not the chunk itself. If needed, it can repeat this process, launching additional recursive calls over other subsets. Each call operates in a well-bounded regime, avoiding the attention degradation that comes with long contexts.

Throughout all of this, the root LM’s own context remains small and stable. It contains the user query, recent code it has written, and short execution outputs. The bulk of the context lives in Python memory, not in tokens.

Once the root LM decides it has enough information to answer the original query, it terminates the loop. It can either return an answer directly:

Or, if the answer was constructed inside the environment using variables and intermediate results, it can return:

From the outside, the user sees a single model call and a single response. Internally, the system has performed a structured exploration of arbitrarily large context using code, recursion, and explicit control flow.

This is why recursive language models avoid context rot at a fundamental level. The model does not try to read long context with attention. Instead, it queries it. Long context becomes data to be manipulated, not text to be attended over—and that shift in abstraction makes all the difference.

An example of a recursive language model (RLM) call
Alex Zhang

Read about the rise of autonomous language models that can plan and act.

Why This Matters

At first glance, recursive language models might seem like an implementation detail. After all, the user still makes a single model call and gets a single answer. But the shift they introduce is much deeper than an API trick. They change what we expect a language model to do when faced with long-horizon reasoning and massive context.

For the past few years, progress has largely come from scaling context windows. More tokens felt like the obvious solution to harder problems. If a model struggles to reason over a codebase, give it the whole repo. If it forgets earlier steps in an agent loop, just keep everything in the prompt. But context rot is a signal that this approach has diminishing returns. Attention is not a free lunch, and long contexts quietly push models into regimes they were never trained to handle reliably.

Recursive language models address this at the right level of abstraction. Instead of asking a model to absorb all context at once, they let the model interact with context. The difference is subtle but profound. Context becomes something the model can query, filter, and decompose, rather than something it must constantly attend to.

Conclusion

Context rot is not a minor inconvenience. It’s a fundamental symptom of pushing language models beyond the limits of flat, attention-based reasoning. As we ask models to operate over longer horizons and richer environments, the cracks become impossible to ignore.

Recursive language models offer a compelling alternative and what’s striking about this approach is how modest it is. There’s no new architecture, no exotic training scheme. Just a careful rethinking of how a language model should interact with information that doesn’t fit neatly into a single forward pass. In that sense, recursive language models feel less like a breakthrough and more like a course correction.

As agentic systems become more common and more ambitious, ideas like this will matter more. The future likely won’t belong to models that can attend to everything all the time, but to systems that know how to look, where to look, and when to delegate. Recursive language models are an early, concrete step in that direction—and a strong signal that the next gains in reliability will come from better structure, not just more tokens.

Ready to build robust and scalable LLM Applications?
Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.

As we stand on the brink of the next wave of AI evolution, large action models (LAMs) are emerging as a foundational paradigm to move beyond mere text generation and toward intelligent agents that can act, not just speak. In this post, we’ll explain why LLMs often aren’t enough for truly agentic workflows, how Large Action Models offer a compelling next step, what their core characteristics are, how they’re trained and integrated, and what real-world uses might look like.

Why LLMs aren’t enough for agentic workflows (the need for LAM)

Over the past few years, large language models (LLMs) — models trained to understand and generate human-like text — have made remarkable progress. They can draft emails, write code, summarize documents, answer questions, and even hold conversations. Their strengths lie in language understanding and generation, multimodal inputs, and zero- or few-shot generalization across tasks.

Yet, while LLMs shine in producing coherent and contextually relevant text, they hit a fundamental limitation: they are passive. They output text; they don’t execute actions in the world. That means when a user asks “book me a flight,” or “update my CRM and send follow-up email,” an LLM can produce a plan or instructions but cannot interact with the airline’s booking system, a CRM database, or an email client.

In short: LLMs lack agency. They cannot directly manipulate environments (digital or physical), cannot execute multi-step sequences on behalf of users, and cannot interact with external tools or systems in an autonomous, reliable way.

But many real-world applications demand action, not just advice. Users expect AI agents that can carry out tasks end-to-end: take intent, plan steps, and execute them in real environments. This gap between what LLMs can do and what real-world workflows require is precisely why we need Large Action Models.

Explore how LLMs evolve into agentic systems — great background to contrast with LAMs.

From LLMs to LAMs

The shift from LLMs to LAMs is more than a simple rebranding — it’s a conceptual transition in how we think about AI’s role. While an LLM remains a “language generator,” a Large Action Model becomes a “doer”.

In the seminal paper Large Action Models: From Inception to Implementation, the authors argue that to build truly autonomous, interactive agents, we need models that go beyond text: models that can interpret commands, plan action sequences, and execute them in a dynamic environment.

One helpful way to visualize the difference: an LLM might respond to “Create a slide deck from draft.docx” by outputting a plan (e.g., “open the draft, create slides, copy content, format, save”), but stops there. A Large Action Model would go further — generating a sequence of actionable commands (e.g., open file, click “New Slide,” copy content, format, save), which an agent can execute in a real GUI environment.

Thus, the transition from LLM to LAM involves not only a shift in output type (text → action) but in role: from assistant or advisor to operative agent.

From LLMs to LAM - Large Action Models
source: https://arxiv.org/pdf/2412.10047

Characteristics of Large Action Model

What distinguishes LAMs from LLMs? What features enable them to act rather than just talk? Based on the foundational paper and complementary sources, we can identify several defining characteristics:

Interpretation of user intent

Large Action Models must begin by understanding what a user wants, not just as a text prompt, but as a goal or intention to be realized. This involves parsing natural language (or other input modalities), inferring the user’s objectives, constraints, and context.

Learn the core steps to build autonomous agents — a practical primer before implementing LAMs.

Action generation

Once the intent is clear, LAMs don’t output more language — they output actions (or sequences of actions). These actions might correspond to clicking UI elements, typing into forms, executing commands, using APIs, or other interactions with software or systems.

Dynamic planning and adaptation

Real-world tasks often require multi-step workflows, branching logic, error handling, and adaptation to changing environments. Large Action Models must therefore plan sequences of subtasks, decompose high-level goals into actionable steps, and react dynamically if something changes mid-process.

Specialization and efficiency

Because Large Action Models are optimized for action, often in specific environments, they can afford to be more specialized (focused on particular domains, such as desktop GUI automation, web UI interaction, SaaS workflows, etc.) rather than the general-purpose scope of LLMs. This specialization can make them more efficient, both computationally and in terms of reliability, for their target tasks.

Additionally, an important technical dimension: many Large Action Models rely on neuro-symbolic AI — combining the pattern recognition power of neural networks with symbolic reasoning and planning. This hybrid enables them to reason about abstract goals, plan logically structured action sequences, and handle decision-making in a way that pure language models (or pure symbolic systems) struggle with.

Large Action Models Behind the Scenes
source: Salesforce

How Large Action Models are trained

Building a functional LAM is more involved than training a vanilla LLM. The pipeline proposed in the Large Action Models paper outlines a multi-phase workflow.

What kind of data is needed

To train Large Action Models, you need action data, not just text, but records of actual interactions: sequences of actions, environment states before and after each action, and the goal or intent that motivated them. This dataset should reflect realistic workflows: with all their branching logic, mistakes, corrections, variations, and context shifts.

This kind of data can come from “path data”, logs of human users performing tasks, including every click, keystroke, UI state change, timing, and context.

Because such data is more scarce and expensive than plain text corpora (used for LLMs), collecting and curating it properly is more challenging.

Data to Action - Large Action Models
source: Datacamp

Why evaluation is so important while training LAMs

Because Large Action Models don’t just generate text — they execute actions — the cost of error is higher. A misgenerated sentence is inconvenient; a mis-generated action could wreak havoc: submit wrong form, delete data, trigger unintended side effects, or even cause security issues.

Therefore rigorous evaluation (both offline and in real- or simulated environments) is critical before deployment. The original paper uses a workflow starting with offline evaluation (on pre-collected data), followed by integration into an agent system, environment grounding, and live testing in a Windows-OS GUI environment.

Evaluation must assess task success rate, robustness to environment changes, error-handling, fallback mechanisms, safety, and generalization beyond the training data.

Discover retrieval-augmented agent techniques — useful when designing LAMs that rely on external knowledge.

Integration into agentic frameworks: memory, tools, environment, feedback

Once trained, a Large Action Model must be embedded into a broader agent system. This includes:

  • Tool integration: the ability to invoke APIs, UI automation frameworks, command-line tools, or other interfaces.
  • Memory/state tracking: agents need to remember prior steps, environment states, user context, and long-term information, especially for complex workflows.
  • Environment grounding & feedback loops: the agent must observe the environment, execute actions, check results, detect errors, and adapt accordingly.
  • Governance, safety & oversight: because actions can have consequences, oversight mechanisms (logging, human-in-the-loop, auditing, fallback) are often needed.

Part of the power in Large Action Models comes from neuro-symbolic AI, combining neural networks’ flexibility with symbolic reasoning and planning, to handle both nuanced language understanding and structured, logical decision making.

Large Action Model Training Pipeline
source: https://arxiv.org/pdf/2412.10047

Example Use Case: How LAMs Transform an Insurance Workflow (A Before-and-After Comparison)

To understand the impact of large action models in a practical setting, let’s examine how they change a typical workflow inside an insurance company. Instead of describing the tasks themselves, we’ll focus on how a Large Action Model executes them compared to a traditional LLM or a human-assisted workflow.

Before Large Action Models: LLM + Human Agent

In a conventional setup, even with an LLM assistant, the agent still performs most of the operational steps manually.

  1. During a customer call, the LLM may assist with note-taking or drafting summaries, but it cannot interpret multi-turn conversation flow or convert insights into structured actions.
  2. After the call, the human agent must read the transcript, extract key fields, update CRM entries, prepare policy quotes, generate documents, and schedule follow-up tasks.
  3. The LLM can suggest what to do, but the human agent is responsible for interpreting the suggestions, translating them into real actions, navigating UI systems, and correcting mistakes if anything goes wrong.

This creates inefficiency. The LLM outputs plans in text form, but the human remains the executor, switching between tools, verifying fields, and bridging the gap between language and action.

After LAMs: A Fully Action-Aware Workflow

Large Action Models fundamentally change the workflow because they are trained to understand the environment, map intent to actions, and execute sequences reliably.

Here’s how the same workflow looks through the lens of a Large Action Model:

1. Understanding user intent at a deeper resolution

Instead of merely summarizing the conversation, a Large Action Model:

  • Interprets the customer’s intent as structured goals: request for a quote, change of coverage, renewal discussion, additional rider interest, etc.
  • Breaks down these goals into actionable subgoals: update CRM field X, calculate premium Y, prepare document Z.

This is different from LLMs, which can restate what happened but cannot convert it into environment-grounded actions.

2. Environment-aware reasoning rather than static suggestions

Instead of saying “You should update the CRM with this information,” a Large Action Model:

  • Identifies which CRM interface it is currently interacting with.
  • Parses UI layout or API schema.
  • Determines the correct sequence of clicks, field entries, or API calls.
  • Tracks state changes across the interface and adapts if the UI looks different from expected.
  • Large Action Models don’t assume a perfect environment—they react to UI changes and errors dynamically, something LLMs cannot do reliably.
3. Planning multi-step actions with symbolic reasoning

LAMs incorporate neuro-symbolic reasoning, enabling them to go beyond raw pattern prediction.

For example, if the premium calculation requires conditional logic (e.g., age > 50 triggers additional fields), a Large Action Model:

  • Builds a symbolic plan with branching logic.
  • Executes only the relevant branch depending on environment states.
  • Revises the plan if unexpected conditions occur (missing fields, mismatched data, incomplete customer history).

This is closer to how a trained insurance agent reasons—evaluating rules, exceptions, and dependencies—than how an LLM “guesses” the next token.

4. Error handling based on real-time environment feedback

LLMs cannot recover when their suggestions fail in execution.

Large Action Models, in contrast:

  • Detect that a field didn’t update, a form didn’t submit, or an API call returned an error.

  • Backtrack to the previous step.

  • Re-evaluate the environment.

  • Attempt an alternative reasoning path.

This closed-loop action-feedback cycle is precisely what allows Large Action Models to operate autonomously.

5. End-to-end optimization

At a workflow level, this results in:

  • Less context switching for human agents.
  • Higher consistency and fewer manual data-entry errors.
  • Faster processing time because the LAM runs deterministic action paths.
  • More predictable outcomes—because every step is logged, reasoned, and validated by the model’s action policies.

The transformation isn’t simply about automation—it’s about upgrading the cognitive and operational layer that connects user intent to real-world execution.

Why LAMs Matter — And What’s Next

The emergence of Large Action Models represents more than incremental progress, it signals a paradigm shift: from AI as text-based assistants to AI as autonomous agents capable of real-world action. As argued in the paper, this shift is a critical step toward more general, capable, and useful AI — and toward building systems that can operate in real environments, bridging language and action.

That said, Large Action Models remain in early stages. There are real challenges: collecting high-quality action data, building robust evaluation frameworks, ensuring safety and governance, preventing unintended consequences, ensuring generalization beyond training environments, and dealing with privacy and security concerns.

The path forward will likely involve hybrid approaches (neuro-symbolic reasoning, modular tool integrations), rigorous benchmarking, human-in-the-loop oversight, and careful design of agent architectures.

Conclusion

Large action models chart a compelling path forward. They build on the strengths of LLMs, natural language understanding, context-aware reasoning, while bridging a key gap: ability to act. For anyone building real-world AI agents — from enterprise automation to productivity tools to customer-facing systems, Large Action Models offer a blueprint for transforming AI from passive suggestions into autonomous action.

If you want to get deeper into how memory plays a role in agentic AI systems, a critical component when LAMs need to handle long-term tasks, check out this related post on Data Science Dojo: What is the Role of Memory in Agentic AI Systems? Unlocking Smarter, Human-Like Intelligence.

Or, if you are curious how LLM-based tools optimize inference performance and cost, useful context when building agentic systems, this post might interest you: Unlocking the Power of KV Cache: How to Speed Up LLM Inference and Cut Costs (Part 1).

LAMs are not “magic” — they are a powerful framework under active research, offering a rigorous way forward for action-oriented AI. As data scientists and engineers, staying informed and understanding both their potential and limitations will be key to designing the next generation of autonomous agents.

Ready to build robust and scalable LLM Applications?
Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.

In the first part of this series, we dug into why the KV cache exists, why it matters, and why it dominates the runtime characteristics of LLM inference. In Part 2, we’re going deeper into the systems-level issues that the KV cache introduces, particularly memory fragmentation and how this motivated the design of a new memory architecture for attention: paged attention, the foundational idea behind the vLLM inference engine.

Before diving into paged attention, you may want to revisit Part 1 of this series, where we unpack the fundamentals of the KV cache and why it dominates LLM memory behavior: KV Cache — How to Speed Up LLM Inference.

This post has one objective: make the vLLM paper’s ideas feel intuitive. The original work is dense (and excellent), but with the right framing, paged attention is actually a very natural idea — almost inevitable in hindsight. It’s essentially applying well-established operating systems concepts (paging, copy-on-write, block tables) to the KV cache problem that LLMs face. Once you see it, you can’t unsee it.

Let’s start with the root cause.

The Real Problem: KV Cache Is Huge, Dynamic, and Unpredictable

The KV cache holds one Key vector and one Value vector for every token in a sequence, across every layer and every attention head. For a typical 7B–13B model, this quickly grows into hundreds of megabytes per request. But the real challenge isn’t just size, it’s unpredictability.

Different requests vary wildly in:

  • prompt length,
  • generation length,
  • decoding strategy (sampling, beam search),
  • number of branch paths,
  • when they finish.

An LLM serving system cannot know in advance how long a request will run or how many tokens it will eventually accumulate. Yet GPUs require strict, contiguous, pre-allocated tensor layouts for optimal kernel execution. This mismatch — dynamic workload vs. static memory assumptions, is the source of nearly all downstream problems.

The traditional answer is:
“Allocate a single contiguous tensor chunk large enough for the maximum possible length.”

And this is where the trouble starts.

How Contiguous Allocation Breaks GPU Memory: Internal and External Fragmentation

To understand why this is harmful, picture the GPU memory as a long shelf. Each LLM request needs to reserve a large rectangular box for its KV cache, even if the request only ends up filling a fraction of the box. And since every box must be contiguous, the allocator cannot place a request’s box unless it finds one uninterrupted region of memory of the right size.

Average percentage of memory wastes in LLM serving systems - Understanding Paged Attention
source: https://arxiv.org/pdf/2309.06180

This creates three distinct kinds of waste, all described in the vLLM paper:

1. Reserved but Unused Slots

If the system allocates space for 2,048 possible tokens, but the request only produces 600 tokens, the remaining 1,448 positions are permanently wasted for the lifetime of that request. These unused slots cannot be repurposed.

2. Internal Fragmentation

Even within a request’s reserved slab, the actual KV cache grows token-by-token. Since the final length is unknown until the request finishes, internal fragmentation is unavoidable — you always over-allocate.

The paper observes that many real-world requests only use 20–30% of their allocated capacity. That means 70–80% of the reserved memory is dead weight for most of the request’s lifetime.

Discover the inner workings of LLMs — from tokenization to attention — to deepen your foundation before diving into paged attention.

3. External Fragmentation

Even worse, after many different requests have allocated and freed slabs of different sizes, the GPU memory layout ends up looking like Swiss cheese. The allocator may have plenty of free space in total, but not enough contiguous free space to fit a new request’s slab.

This causes new requests to fail even though the GPU technically has enough memory in aggregate.

The vLLM paper measures that only 20–38% of the allocated KV cache memory is actually used in existing systems. That’s an astonishingly low utilization for the largest memory component in LLM inference.

KV cache memory management in existing systems - Paged Attention
source: https://arxiv.org/pdf/2309.06180

This is the core problem: even before we run out of computation or bandwidth, we run out of contiguous GPU memory due to fragmentation.

Fine-Grained Batching: A Great Idea That Accidentally Worsens Memory Pressure

Before paged attention arrived, researchers attempted to improve throughput using smarter batching. Two techniques are important here:

  • Cellular batching,
  • Iteration-level scheduling (which vLLM cites explicitly).

These mechanisms work at token-level granularity instead of request-level granularity. Instead of waiting for entire requests to complete before adding new ones, the server can add or remove sequences each decoding iteration. This dramatically improves compute utilization because it keeps the GPU busy with fresh work every step.

In fact, iteration-level batching is almost required for modern LLM serving: it avoids the inefficiency where one long-running request delays the whole batch.

But here’s the catch that the vLLM paper highlights:

Fine-grained batching increases the number of concurrently active sequences.

And therefore:

Every active sequence maintains its own full, contiguous KV cache slab.

So while compute utilization goes up, memory pressure skyrockets.

If you have 100 active sequences simultaneously interleaved at the decoding step, you effectively have 100 large, partially empty, but reserved KV cache slabs sitting in memory. Fragmentation becomes even worse, and the chance of running out of contiguous space increases dramatically.

In other words:

Fine-grained batching solves the compute bottleneck but amplifies the memory bottleneck.

The system becomes memory-bound, not compute-bound.

This brings us to the core insight in the vLLM paper.

Explore how MCP is shaping the next generation of AI workflows, a must-read for building robust LLM systems.

Paged Attention: A Simple but Profound Idea

The vLLM authors ask a simple question:

“Why not treat the KV cache like an operating system treats virtual memory?”

In other words:

  • Break memory into fixed-size blocks (like OS pages).
  • Each block stores KV vectors for a small number of tokens (e.g., 16 tokens).
  • Maintain a mapping from logical blocks (the sequence’s view) to physical blocks (actual GPU memory).
  • Blocks can live anywhere in GPU memory — no need for contiguous slabs.
  • Blocks can be shared across sequences.
  • Use copy-on-write to handle divergence.
  • Reclaim blocks immediately when sequences finish.

This block-based KV representation is what the paper names paged attention.

Paged Attention Algorithm
source: https://arxiv.org/pdf/2309.06180

You might think: “Doesn’t attention require K and V to be in one contiguous array?”
Mathematically, no — attention only needs to iterate over all previous K/V vectors. Whether those vectors live contiguously or are chunked into blocks is irrelevant to correctness.

This means we can rewrite attention in block form:
for each block:

  • read its Keys
  • compute dot-product scores with the Query
  • apply softmax normalization
  • read block’s Values
  • accumulate outputs

The underlying math is identical; only the memory layout changes.

From here, we get several huge benefits.

Explore what “agentic” LLMs can do — planning, tool-use, memory — and how memory architecture like paged attention underpins them.

How Paged Attention Fixes Memory Fragmentation

Paged attention eliminates the need for large contiguous slabs. Each sequence grows its KV cache block-by-block, and each block can be placed anywhere in GPU memory. There is no long slab to reserve, so external fragmentation largely disappears.

Internal fragmentation also collapses.

The only unused memory per sequence is inside its final partially filled block — at most the space for block_size − 1 tokens. If the block size is 16 tokens, the maximum internal waste is 15 tokens. Compare that to 1,000+ tokens wasted in the old approach.

Reserved-but-unused memory disappears entirely.

There are no pre-allocated full-size slabs. Blocks are allocated on demand.

Memory utilization becomes extremely predictable.

For N tokens, the system allocates exactly ceil(N / block_size) blocks. Nothing more.

This is the same structural benefit that operating systems gain from virtual memory: the illusion of a large contiguous space, backed by small flexible pages underneath.

Logical Blocks, Physical Blocks, and Block Tables

The vLLM architecture uses a simple but powerful structure to track blocks:

  • Logical blocks: the sequence’s view of its KV cache
  • Physical blocks: actual GPU memory chunks
  • Block table: a mapping from logical indices to physical block IDs

This is visually similar to the page table in any OS textbook.

When a sequence generates tokens:

  • It writes K/V into the current physical block.
  • If the block fills up, vLLM allocates a new one and updates the table.
  • If two sequences share a prefix, their block tables point to the same physical blocks.

All of this is efficient because the attention kernel is redesigned to loop over blocks instead of a single contiguous tensor.

Sharing and Copy-on-Write: Why Paged Attention Helps Beam Search and Sampling

This is one of the most elegant parts of the paper.

When doing:

  • beam search, or
  • parallel sampling, or
  • agentic branching

many sequences share long prefixes.

Under the traditional contiguous layout, you either:

  • duplicate the KV cache for each branch (expensive), or
  • compromise batch flexibility (restrictive).

With paged attention:

  • multiple sequences simply reference the same physical blocks,
  • and only when a sequence diverges do we perform copy-on-write at the block level.

Copying one block is far cheaper than copying an entire slab. This leads to substantial memory savings — the paper reports that shared prefixes during beam search reduce KV memory usage by up to 55% in some scenarios.

How the Paged Attention Kernel Works (Intuitive View)

Even though the memory layout changes, the math of attention remains untouched.

Here’s the intuitive flow inside the kernel:

  1. Take the Query for the new token.

  2. Loop over each logical block of previous tokens.

  3. For each block:

    • Look up the physical block address through the block table.

    • Load the Keys in that block.

    • Compute attention scores (Q · Kᵀ).

    • Load the Values in that block.

    • Multiply and accumulate.

  4. Normalize across all blocks.

  5. Produce the final attention output.

Kernel optimizations in the paper include:

  • fused reshape-and-write kernels for block writes,

  • block-aware attention kernels,

  • efficient memory coalescing strategies,

  • minimizing per-block overhead.

While the block-aware kernels are slightly slower than fully contiguous ones, the system throughput increases dramatically because vLLM can batch far more requests simultaneously.

Paging Enables Swapping and Recomputing KV Blocks

Once KV data is broken into blocks, vLLM gains a capability that is nearly impossible with contiguous slabs: flexible eviction policies.

If GPU memory is full, vLLM can:

  • swap blocks to CPU memory, or

  • drop blocks entirely and recompute them if needed,

  • evict entire sequences’ blocks immediately.

The paper notes that recomputation can be faster than swapping small blocks over PCIe for certain workloads — an insight that wouldn’t be possible without block-level memory.

Block table translation in vLLM - Paged Attention
source: https://arxiv.org/pdf/2309.06180

This is a fundamental shift in how LLM serving systems deal with memory pressure.

Why Block Size Matters

Block size is the main tuning knob. A smaller block size:

  • reduces internal fragmentation,
  • increases sharing granularity,
  • but increases kernel overhead,
  • and increases the number of memory lookups.

A larger block size:

  • improves kernel efficiency,
  • but wastes more memory.

The vLLM authors test many configurations and find that around 16 tokens per block strikes a balance. But workloads differ, and this is a tunable dimension in future variations of paged attention.

Paged Attention vs. Traditional Systems: Throughput Gains

While paged attention increases per-kernel latency (~20–26% overhead), the end-to-end throughput improves by 2–4× because:

  • batches become much larger,
  • memory is no longer the bottleneck,
  • iteration-level scheduling can run without exploding memory use,
  • shared prefixes do not duplicate KV cache,
  • requests no longer fail due to lack of contiguous space.

This is the core result:
paged attention trades tiny per-kernel overhead for massive system-wide gains.

Why Paged Attention Works: A Systems Perspective

The beauty of paged attention is that it doesn’t try to fight the GPU or the attention kernel. Instead, it sidesteps the original constraint entirely.

Traditional systems try to squeeze dynamic workloads into rigid, contiguous layouts and then fight the consequences (compaction, large reservations, fragmentation). Paged attention flips the model: accept that token sequences grow unpredictably, and design memory as though you were building a small operating system for the KV cache.

Once you see it through that lens, the entire design becomes obvious:

  • block tables
  • shared blocks
  • copy-on-write
  • demand-based allocation
  • block-level eviction
  • block-level recomputation
  • fragmentation elimination
  • higher effective batch sizes

Paged attention is the kind of engineering idea that feels both novel and inevitable.

Practical Lessons for Engineers Using Paged Attention

If you’re building LLM services or agentic systems, here are some practical takeaways:

  • Measure how much of your KV cache memory is actually used. Traditional systems waste the majority of it.
  • If your batch sizes are small because of memory, paged attention will help dramatically.
  • If you rely on beam search, multi-sampling, or agent branching, block-level prefix sharing is a huge win.
  • If you use iteration-level scheduling, you need a KV cache representation that doesn’t explode memory.
  • Understand block size trade-offs (paging is not free; kernel overhead exists).
  • Consider recomputation as a valid alternative to swapping for certain workloads.

Conclusion: Paged Attention as the New Default Mental Model

Paged attention is not just another incremental optimization. It is a new lens for thinking about how KV cache memory should be managed in autoregressive models.

The math of attention stays the same. What changes is everything around it — the memory layout, the allocator, the scheduler, and the way prefix sharing works. The payoff is enormous: far less waste, far more flexibility, and significantly higher throughput. In many ways, paged attention is to KV memory what virtual memory was to general-purpose computing: a foundational concept that unlocks better utilization of hardware resources.

If you’re serving LLMs in production — or building agentic systems that rely on long-running multi-step reasoning — paged attention is now a core idea you should keep in your toolkit.

Ready to build robust and scalable LLM Applications?
Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.

Stay tuned — this is part of a 3-part deep-dive series.
In the upcoming posts, we’ll unpack Radix Attention, and a few other emerging techniques that push context efficiency even further. If you’re serious about building fast, scalable LLM systems, you’ll want to check back in for the next installments.

If you’ve spent any time experimenting with large language models (LLMs), you’ve likely encountered terms like queries, keys, and values—the building blocks of transformer attention. Understanding these concepts is the first step toward appreciating a powerful optimization called the KV cache, which is essential for both fast inference and cost efficiency. Today, we’re going to take a deep dive into what KV cache is, why it matters, and how you can optimize your prompts to take full advantage of it. And to take things even further, this post kicks off a 3-part series on modern attention-efficiency techniques. By the end, you’ll not only understand KV cache deeply—you’ll also be ready for the next installments where we break down newer methods like Paged Attention, Radix Attention, and a few emerging ideas reshaping long-context LLMs.

Queries, Keys, and Values: The Building Blocks of Attention

Before we can talk intelligently about KV cache, we need to revisit the basics of attention. In a transformer, every token you feed into the model generates three vectors: Q (query), K (key), and V (value). Each of these plays a distinct role in determining how the model attends to other tokens in the sequence.

  • Query (Q) represents the token that’s “asking” for information. It’s like a search request, what does this token want to know from the context?

  • Key (K) represents the token that’s being “indexed.” It’s like the label on a piece of information that queries can match against.

  • Value (V) represents the content or information of the token itself. Once a query finds a matching key, the corresponding value is returned as part of the output.

Query, Keys, Values in attention Mechanism - KV Cache Explanation
source: Medium

Mathematically, attention computes a score between the query and all keys using a dot product, applies a softmax to turn scores into weights, and then calculates a weighted sum of the values. This weighted sum becomes the output representation for that token. Step by step, it looks like this:

  1. Compute Blog | Data Science Dojo for all tokens X.

  2. Compute attention scores: Blog | Data Science Dojo

  3. Apply softmax to get attention weights:Blog | Data Science Dojo>

  4. Compute the weighted sum of values: Blog | Data Science Dojo

Understanding how attention mechanism works - KV Cache Explanation
source: Medium

It’s elegant, but there’s a catch: during inference, the model repeats this process for every new token in an autoregressive manner, which can become extremely costly, especially for long sequences.

To understand where token embeddings come from (the input for Q/K/V), this historical look at embeddings is helpful.

What is KV Cache?

Here’s where the KV cache comes into play. When generating text token by token, the model recalculates keys and values for all previous tokens at each step. This repetition is wasteful because those previous K and V vectors don’t change; only the query for the new token is different. The KV cache solves this by storing K and V vectors for all previous tokens, allowing the model to reuse them instead of recomputing them.

Think of it this way: if you’re reading a long document and want to summarize it sentence by sentence, you wouldn’t reread the first paragraphs every time you process the next sentence. KV cache lets the model “remember” what it already processed.

How KV Cache works
source: https://medium.com/@joaolages/kv-caching-explained-276520203249

KV Cache in Practice: Cost Implications

To appreciate the value of KV cache, it’s worth considering how it affects cost in practice. Many commercial LLM providers charge differently for tokens based on whether they hit the cache:

  • With Anthropic Claude, cached input tokens are far cheaper than uncached tokens. Cached tokens can cost as little as $0.30 per million tokens, whereas uncached tokens can cost up to $3 per million tokens—a 10x difference.

  • Similarly, in OpenAI’s GPT models, repeated prefixes in multi-turn chats benefit from KV caching, drastically reducing both time-to-first-token (TTFT) and inference costs.

This cost gap alone makes KV cache a critical optimization for anyone building production systems or agentic AI pipelines.

Need a refresher on self-attention and how Q/K/V work under the hood? This post makes it clear.

KV Cache in the Era of Agentic AI

Today, many applications are more than simple Q&A models, they’re agentic systems performing multiple steps of reasoning, tool usage, and observations. Consider an AI agent orchestrating a series of actions:

  1. The agent receives a system prompt describing its objectives.

  2. It ingests a user prompt.

  3. It generates an output, executes actions, observes results, and logs observations.

  4. The agent generates the next action based on all prior context.

In such multi-turn workflows, KV cache hit rate is extremely important. Every token in the prefix that can be reused reduces the compute needed for subsequent reasoning steps. Without caching, the model recalculates K/V for all past tokens at each step—wasting time, compute, and money.

Fortunately, if your context uses identical prefixes, you can take full advantage of KV cache. Whether you’re running a self-hosted model or calling an inference API, caching drastically reduces TTFT and inference costs.

For agents and systems that manage long context windows, this piece outlines the core principles of context engineering.

Prompting Tips to Improve KV Cache Hit Rate

Maximizing KV cache hit rate isn’t magic, it’s about structured, deterministic prompting. The team at Manus highlights several practical strategies for real-world AI agents in their blog “Context Engineering for AI Agents: Lessons from Building Manus” (Manus, 2025).

Here’s a summary of the key recommendations:

  1. Keep your prompt prefix stable

    Due to the autoregressive nature of LLMs, even a single-token difference can invalidate the KV cache from that point onward. A common example is including a timestamp at the beginning of the system prompt: while it allows the model to tell the current time, it completely kills cache reuse. Manus emphasizes that stable system prompts are critical for cache efficiency.

  2. Make your context append-only

    Avoid modifying previous actions or observations. Many programming languages and serialization libraries do not guarantee stable key ordering, which can silently break the cache if JSON objects or other structured data are rewritten. Manus recommends designing your agent’s context so that all new information is appended, leaving previous entries untouched.

  3. Mark cache breakpoints explicitly

    Some inference frameworks do not support automatic incremental prefix caching. In these cases, you need to manually insert cache breakpoints to control which portions of context are reused. Manus notes that these breakpoints should at minimum include the end of the system prompt and account for potential cache expiration.

  4. By following these structured prompting strategies, you maximize KV cache reuse, which leads to faster inference, lower costs, and more efficient multi-turn agent execution—lessons that the Manus team has validated through real-world deployments.

The Basics of LLM Inference: Prefill and Decoding

To understand why prompt caching (KV caching) is such a game-changer, it helps to first see what happens under the hood during LLM inference. Large language models generate text in two distinct phases:

1. Prefill – Understanding the Prompt

In this phase, the model processes the entire input prompt all at once. Each token in the prompt is converted into embeddings, and the model computes hidden states and attention representations across all tokens. These computations allow the model to “understand” the context and produce the first output token. Essentially, the prefill phase is the model setting the stage for generation.

2. Decoding – Generating Tokens Autoregressively

Once the first token is generated, the model enters the decoding phase. Here, it generates one token at a time, using all previous tokens (both the input prompt and already-generated tokens) as context. Each new token depends on the history of what’s been produced so far.

Step-by-Step Example: QKV Computation Without KV Cache

Suppose you have the tokens:

[Alice, went, to, the, market]

At token 5 (“market”), without KV cache:

  1. Compute Q, K, V for “Alice” → store temporarily

  2. Compute Q, K, V for “went”

  3. Compute Q, K, V for “to”

  4. Compute Q, K, V for “the”

  5. Compute Q for “market” and recompute K, V for all previous tokens

Notice that K and V for the first four tokens are recomputed unnecessarily.

Step-by-Step Example: With KV Cache

With KV cache:

  1. Compute Q, K, V for each token as before once and store K and V

  2. At token 5 (“market”):

    • Compute Q only for “market”

    • Use cached K and V for previous tokens

    • Compute attention weights and output

Kv Cache vs Without KV Cache
source:https://sankalp.bearblog.dev/how-prompt-caching-works/

This simple change reduces compute and memory overhead substantially. The more tokens in the context, the bigger the savings.

Want a full walkthrough of how LLMs process tokens and generate output? See this detailed explainer.

Limitations of Simple KV Cache

While KV cache provides significant improvements, it’s not perfect:

  • Memory growth: K/V tensors grow linearly with context length. Long sequences can exhaust GPU memory.

  • Static cache structure: Simple caching doesn’t handle sliding windows or context truncation efficiently.

  • Inflexibility with multi-query attention: Models using multi-query attention can reduce KV memory but may require different caching strategies.

These limitations have driven research into more advanced attention techniques.

Beyond Simple KV Cache: Advanced Techniques

As models scale to longer contexts, the simple KV cache runs into practical limits—mainly GPU memory and the cost of attending to every past token. That’s why newer techniques like Paged Attention and Radix Attention were developed. They’re not replacements for KV caching but smarter ways of organizing and accessing cached tokens so the model stays fast even with huge context windows. We’ll break down each of these techniques in the upcoming blogs, so stay tuned for that.

1. Paged Attention

Paged attention divides the model’s context into discrete “pages” of tokens, similar to how a computer manages memory with virtual pages. Instead of keeping every token in GPU memory, only the pages relevant to the current generation step are actively loaded.

  • Memory efficiency: Older pages that are less likely to impact the immediate token prediction can be offloaded to slower storage (like CPU RAM or even disk) or recomputed on demand.

  • Scalability: This allows models to process very long sequences—think entire books or multi-hour dialogues—without exceeding memory limits.

  • Practical example: Imagine a multi-turn chatbot with a 20,000-token conversation history. With naive caching, the GPU memory would balloon as each new token is generated. With paged attention, only the most relevant pages (e.g., the last few turns plus critical context) remain in memory, while earlier parts are swapped out. The model still has access to the full history if needed but doesn’t carry the entire context in GPU memory at all times.

2. Radix Attention

Radix attention takes a fundamentally different approach: it reorganizes tokens hierarchically into a radix-tree structure. Rather than attending to every single token individually, the model computes attention over grouped summaries of tokens.

  • Logarithmic scaling: By aggregating keys and values in a tree, the number of attention computations grows logarithmically with sequence length, rather than linearly. This dramatically reduces computational cost for extremely long sequences.

  • Preserving context fidelity: Unlike simple downsampling, radix attention preserves critical information by hierarchically combining tokens, ensuring that higher-level representations still capture the essence of earlier tokens.

  • Ideal for agentic workflows: In systems where models must maintain reasoning across long interactions—such as multi-step planning agents or memory-augmented AI—radix attention ensures that even very old information can influence current decisions without slowing down generation.

A broad overview of LLM architecture and capabilities — great context if you’re new to transformers.

Conclusion: Why KV Cache Matters More Than Ever

The KV cache is one of the simplest yet most powerful optimizations in modern LLM workflows. It transforms inference from repetitive, expensive computation into fast, cost-efficient generation. In the age of agentic AI—where models are performing multi-step reasoning, tool use, and long-term planning—maximizing KV cache hit rate is no longer optional; it’s foundational.

From a practical standpoint, following prompt engineering best practices—keeping your prefix stable, maintaining an append-only context, and using deterministic serialization—can unlock dramatic savings in compute, memory, and latency. Combined with emerging attention techniques like paged and radix attention, KV cache ensures that your LLM workflows remain both performant and scalable.

In other words, the KV cache isn’t just a nice-to-have; it’s the backbone of fast, efficient, and cost-effective LLM inference.

Ready to build robust and scalable LLM Applications?
Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.

Stay tuned — this is part of a 3-part deep-dive series.
In the upcoming posts, we’ll unpack Paged Attention, Radix Attention, and a few other emerging techniques that push context efficiency even further. If you’re serious about building fast, scalable LLM systems, you’ll want to check back in for the next installments.