As we stand on the brink of the next wave of AI evolution, large action models (LAMs) are emerging as a foundational paradigm to move beyond mere text generation and toward intelligent agents that can act, not just speak. In this post, we’ll explain why LLMs often aren’t enough for truly agentic workflows, how Large Action Models offer a compelling next step, what their core characteristics are, how they’re trained and integrated, and what real-world uses might look like.
Why LLMs aren’t enough for agentic workflows (the need for LAM)
Over the past few years, large language models (LLMs) — models trained to understand and generate human-like text — have made remarkable progress. They can draft emails, write code, summarize documents, answer questions, and even hold conversations. Their strengths lie in language understanding and generation, multimodal inputs, and zero- or few-shot generalization across tasks.
Yet, while LLMs shine in producing coherent and contextually relevant text, they hit a fundamental limitation: they are passive. They output text; they don’t execute actions in the world. That means when a user asks “book me a flight,” or “update my CRM and send follow-up email,” an LLM can produce a plan or instructions but cannot interact with the airline’s booking system, a CRM database, or an email client.
In short: LLMs lack agency. They cannot directly manipulate environments (digital or physical), cannot execute multi-step sequences on behalf of users, and cannot interact with external tools or systems in an autonomous, reliable way.
But many real-world applications demand action, not just advice. Users expect AI agents that can carry out tasks end-to-end: take intent, plan steps, and execute them in real environments. This gap between what LLMs can do and what real-world workflows require is precisely why we need Large Action Models.
The shift from LLMs to LAMs is more than a simple rebranding — it’s a conceptual transition in how we think about AI’s role. While an LLM remains a “language generator,” a Large Action Model becomes a “doer”.
In the seminal paper Large Action Models: From Inception to Implementation, the authors argue that to build truly autonomous, interactive agents, we need models that go beyond text: models that can interpret commands, plan action sequences, and execute them in a dynamic environment.
One helpful way to visualize the difference: an LLM might respond to “Create a slide deck from draft.docx” by outputting a plan (e.g., “open the draft, create slides, copy content, format, save”), but stops there. A Large Action Model would go further — generating a sequence of actionable commands (e.g., open file, click “New Slide,” copy content, format, save), which an agent can execute in a real GUI environment.
Thus, the transition from LLM to LAM involves not only a shift in output type (text → action) but in role: from assistant or advisor to operative agent.
source: https://arxiv.org/pdf/2412.10047
Characteristics of Large Action Model
What distinguishes LAMs from LLMs? What features enable them to act rather than just talk? Based on the foundational paper and complementary sources, we can identify several defining characteristics:
Interpretation of user intent
Large Action Models must begin by understanding what a user wants, not just as a text prompt, but as a goal or intention to be realized. This involves parsing natural language (or other input modalities), inferring the user’s objectives, constraints, and context.
Once the intent is clear, LAMs don’t output more language — they output actions (or sequences of actions). These actions might correspond to clicking UI elements, typing into forms, executing commands, using APIs, or other interactions with software or systems.
Dynamic planning and adaptation
Real-world tasks often require multi-step workflows, branching logic, error handling, and adaptation to changing environments. Large Action Models must therefore plan sequences of subtasks, decompose high-level goals into actionable steps, and react dynamically if something changes mid-process.
Specialization and efficiency
Because Large Action Models are optimized for action, often in specific environments, they can afford to be more specialized (focused on particular domains, such as desktop GUI automation, web UI interaction, SaaS workflows, etc.) rather than the general-purpose scope of LLMs. This specialization can make them more efficient, both computationally and in terms of reliability, for their target tasks.
Additionally, an important technical dimension: many Large Action Models rely on neuro-symbolic AI — combining the pattern recognition power of neural networks with symbolic reasoning and planning. This hybrid enables them to reason about abstract goals, plan logically structured action sequences, and handle decision-making in a way that pure language models (or pure symbolic systems) struggle with.
source: Salesforce
How Large Action Models are trained
Building a functional LAM is more involved than training a vanilla LLM. The pipeline proposed in the Large Action Models paper outlines a multi-phase workflow.
What kind of data is needed
To train Large Action Models, you need action data, not just text, but records of actual interactions: sequences of actions, environment states before and after each action, and the goal or intent that motivated them. This dataset should reflect realistic workflows: with all their branching logic, mistakes, corrections, variations, and context shifts.
This kind of data can come from “path data”, logs of human users performing tasks, including every click, keystroke, UI state change, timing, and context.
Because such data is more scarce and expensive than plain text corpora (used for LLMs), collecting and curating it properly is more challenging.
source: Datacamp
Why evaluation is so important while training LAMs
Because Large Action Models don’t just generate text — they execute actions — the cost of error is higher. A misgenerated sentence is inconvenient; a mis-generated action could wreak havoc: submit wrong form, delete data, trigger unintended side effects, or even cause security issues.
Therefore rigorous evaluation (both offline and in real- or simulated environments) is critical before deployment. The original paper uses a workflow starting with offline evaluation (on pre-collected data), followed by integration into an agent system, environment grounding, and live testing in a Windows-OS GUI environment.
Evaluation must assess task success rate, robustness to environment changes, error-handling, fallback mechanisms, safety, and generalization beyond the training data.
Integration into agentic frameworks: memory, tools, environment, feedback
Once trained, a Large Action Model must be embedded into a broader agent system. This includes:
Tool integration: the ability to invoke APIs, UI automation frameworks, command-line tools, or other interfaces.
Memory/state tracking: agents need to remember prior steps, environment states, user context, and long-term information, especially for complex workflows.
Environment grounding & feedback loops: the agent must observe the environment, execute actions, check results, detect errors, and adapt accordingly.
Governance, safety & oversight: because actions can have consequences, oversight mechanisms (logging, human-in-the-loop, auditing, fallback) are often needed.
Part of the power in Large Action Models comes from neuro-symbolic AI, combining neural networks’ flexibility with symbolic reasoning and planning, to handle both nuanced language understanding and structured, logical decision making.
source: https://arxiv.org/pdf/2412.10047
Example Use Case: How LAMs Transform an Insurance Workflow (A Before-and-After Comparison)
To understand the impact of large action models in a practical setting, let’s examine how they change a typical workflow inside an insurance company. Instead of describing the tasks themselves, we’ll focus on how a Large Action Model executes them compared to a traditional LLM or a human-assisted workflow.
Before Large Action Models: LLM + Human Agent
In a conventional setup, even with an LLM assistant, the agent still performs most of the operational steps manually.
During a customer call, the LLM may assist with note-taking or drafting summaries, but it cannot interpret multi-turn conversation flow or convert insights into structured actions.
After the call, the human agent must read the transcript, extract key fields, update CRM entries, prepare policy quotes, generate documents, and schedule follow-up tasks.
The LLM can suggest what to do, but the human agent is responsible for interpreting the suggestions, translating them into real actions, navigating UI systems, and correcting mistakes if anything goes wrong.
This creates inefficiency. The LLM outputs plans in text form, but the human remains the executor, switching between tools, verifying fields, and bridging the gap between language and action.
After LAMs: A Fully Action-Aware Workflow
Large Action Models fundamentally change the workflow because they are trained to understand the environment, map intent to actions, and execute sequences reliably.
Here’s how the same workflow looks through the lens of a Large Action Model:
1. Understanding user intent at a deeper resolution
Instead of merely summarizing the conversation, a Large Action Model:
Interprets the customer’s intent as structured goals: request for a quote, change of coverage, renewal discussion, additional rider interest, etc.
Breaks down these goals into actionable subgoals: update CRM field X, calculate premium Y, prepare document Z.
This is different from LLMs, which can restate what happened but cannot convert it into environment-grounded actions.
2. Environment-aware reasoning rather than static suggestions
Instead of saying “You should update the CRM with this information,” a Large Action Model:
Identifies which CRM interface it is currently interacting with.
Parses UI layout or API schema.
Determines the correct sequence of clicks, field entries, or API calls.
Tracks state changes across the interface and adapts if the UI looks different from expected.
Large Action Models don’t assume a perfect environment—they react to UI changes and errors dynamically, something LLMs cannot do reliably.
3. Planning multi-step actions with symbolic reasoning
LAMs incorporate neuro-symbolic reasoning, enabling them to go beyond raw pattern prediction.
For example, if the premium calculation requires conditional logic (e.g., age > 50 triggers additional fields), a Large Action Model:
Builds a symbolic plan with branching logic.
Executes only the relevant branch depending on environment states.
Revises the plan if unexpected conditions occur (missing fields, mismatched data, incomplete customer history).
This is closer to how a trained insurance agent reasons—evaluating rules, exceptions, and dependencies—than how an LLM “guesses” the next token.
4. Error handling based on real-time environment feedback
LLMs cannot recover when their suggestions fail in execution.
Large Action Models, in contrast:
Detect that a field didn’t update, a form didn’t submit, or an API call returned an error.
Backtrack to the previous step.
Re-evaluate the environment.
Attempt an alternative reasoning path.
This closed-loop action-feedback cycle is precisely what allows Large Action Models to operate autonomously.
5. End-to-end optimization
At a workflow level, this results in:
Less context switching for human agents.
Higher consistency and fewer manual data-entry errors.
Faster processing time because the LAM runs deterministic action paths.
More predictable outcomes—because every step is logged, reasoned, and validated by the model’s action policies.
The transformation isn’t simply about automation—it’s about upgrading the cognitive and operational layer that connects user intent to real-world execution.
Why LAMs Matter — And What’s Next
The emergence of Large Action Models represents more than incremental progress, it signals a paradigm shift: from AI as text-based assistants to AI as autonomous agents capable of real-world action. As argued in the paper, this shift is a critical step toward more general, capable, and useful AI — and toward building systems that can operate in real environments, bridging language and action.
That said, Large Action Models remain in early stages. There are real challenges: collecting high-quality action data, building robust evaluation frameworks, ensuring safety and governance, preventing unintended consequences, ensuring generalization beyond training environments, and dealing with privacy and security concerns.
The path forward will likely involve hybrid approaches (neuro-symbolic reasoning, modular tool integrations), rigorous benchmarking, human-in-the-loop oversight, and careful design of agent architectures.
Conclusion
Large action models chart a compelling path forward. They build on the strengths of LLMs, natural language understanding, context-aware reasoning, while bridging a key gap: ability to act. For anyone building real-world AI agents — from enterprise automation to productivity tools to customer-facing systems, Large Action Models offer a blueprint for transforming AI from passive suggestions into autonomous action.
LAMs are not “magic” — they are a powerful framework under active research, offering a rigorous way forward for action-oriented AI. As data scientists and engineers, staying informed and understanding both their potential and limitations will be key to designing the next generation of autonomous agents.
Ready to build robust and scalable LLM Applications? Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.
In the first part of this series, we dug into why the KV cache exists, why it matters, and why it dominates the runtime characteristics of LLM inference. In Part 2, we’re going deeper into the systems-level issues that the KV cache introduces, particularly memory fragmentation and how this motivated the design of a new memory architecture for attention: paged attention, the foundational idea behind the vLLM inference engine.
Before diving into paged attention, you may want to revisit Part 1 of this series, where we unpack the fundamentals of the KV cache and why it dominates LLM memory behavior: KV Cache — How to Speed Up LLM Inference.
This post has one objective: make the vLLM paper’s ideas feel intuitive. The original work is dense (and excellent), but with the right framing, paged attention is actually a very natural idea — almost inevitable in hindsight. It’s essentially applying well-established operating systems concepts (paging, copy-on-write, block tables) to the KV cache problem that LLMs face. Once you see it, you can’t unsee it.
Let’s start with the root cause.
The Real Problem: KV Cache Is Huge, Dynamic, and Unpredictable
The KV cache holds one Key vector and one Value vector for every token in a sequence, across every layer and every attention head. For a typical 7B–13B model, this quickly grows into hundreds of megabytes per request. But the real challenge isn’t just size, it’s unpredictability.
Different requests vary wildly in:
prompt length,
generation length,
decoding strategy (sampling, beam search),
number of branch paths,
when they finish.
An LLM serving system cannot know in advance how long a request will run or how many tokens it will eventually accumulate. Yet GPUs require strict, contiguous, pre-allocated tensor layouts for optimal kernel execution. This mismatch — dynamic workload vs. static memory assumptions, is the source of nearly all downstream problems.
The traditional answer is: “Allocate a single contiguous tensor chunk large enough for the maximum possible length.”
And this is where the trouble starts.
How Contiguous Allocation Breaks GPU Memory: Internal and External Fragmentation
To understand why this is harmful, picture the GPU memory as a long shelf. Each LLM request needs to reserve a large rectangular box for its KV cache, even if the request only ends up filling a fraction of the box. And since every box must be contiguous, the allocator cannot place a request’s box unless it finds one uninterrupted region of memory of the right size.
source: https://arxiv.org/pdf/2309.06180
This creates three distinct kinds of waste, all described in the vLLM paper:
1. Reserved but Unused Slots
If the system allocates space for 2,048 possible tokens, but the request only produces 600 tokens, the remaining 1,448 positions are permanently wasted for the lifetime of that request. These unused slots cannot be repurposed.
2. Internal Fragmentation
Even within a request’s reserved slab, the actual KV cache grows token-by-token. Since the final length is unknown until the request finishes, internal fragmentation is unavoidable — you always over-allocate.
The paper observes that many real-world requests only use 20–30% of their allocated capacity. That means 70–80% of the reserved memory is dead weight for most of the request’s lifetime.
Even worse, after many different requests have allocated and freed slabs of different sizes, the GPU memory layout ends up looking like Swiss cheese. The allocator may have plenty of free space in total, but not enough contiguous free space to fit a new request’s slab.
This causes new requests to fail even though the GPU technically has enough memory in aggregate.
The vLLM paper measures that only 20–38% of the allocated KV cache memory is actually used in existing systems. That’s an astonishingly low utilization for the largest memory component in LLM inference.
source: https://arxiv.org/pdf/2309.06180
This is the core problem: even before we run out of computation or bandwidth, we run out of contiguous GPU memory due to fragmentation.
Fine-Grained Batching: A Great Idea That Accidentally Worsens Memory Pressure
Before paged attention arrived, researchers attempted to improve throughput using smarter batching. Two techniques are important here:
These mechanisms work at token-level granularity instead of request-level granularity. Instead of waiting for entire requests to complete before adding new ones, the server can add or remove sequences each decoding iteration. This dramatically improves compute utilization because it keeps the GPU busy with fresh work every step.
In fact, iteration-level batching is almost required for modern LLM serving: it avoids the inefficiency where one long-running request delays the whole batch.
But here’s the catch that the vLLM paper highlights:
Fine-grained batching increases the number of concurrently active sequences.
And therefore:
Every active sequence maintains its own full, contiguous KV cache slab.
So while compute utilization goes up, memory pressure skyrockets.
If you have 100 active sequences simultaneously interleaved at the decoding step, you effectively have 100 large, partially empty, but reserved KV cache slabs sitting in memory. Fragmentation becomes even worse, and the chance of running out of contiguous space increases dramatically.
In other words:
Fine-grained batching solves the compute bottleneck but amplifies the memory bottleneck.
The system becomes memory-bound, not compute-bound.
This brings us to the core insight in the vLLM paper.
“Why not treat the KV cache like an operating system treats virtual memory?”
In other words:
Break memory into fixed-size blocks (like OS pages).
Each block stores KV vectors for a small number of tokens (e.g., 16 tokens).
Maintain a mapping from logical blocks (the sequence’s view) to physical blocks (actual GPU memory).
Blocks can live anywhere in GPU memory — no need for contiguous slabs.
Blocks can be shared across sequences.
Use copy-on-write to handle divergence.
Reclaim blocks immediately when sequences finish.
This block-based KV representation is what the paper names paged attention.
source: https://arxiv.org/pdf/2309.06180
You might think: “Doesn’t attention require K and V to be in one contiguous array?” Mathematically, no — attention only needs to iterate over all previous K/V vectors. Whether those vectors live contiguously or are chunked into blocks is irrelevant to correctness.
This means we can rewrite attention in block form: for each block:
read its Keys
compute dot-product scores with the Query
apply softmax normalization
read block’s Values
accumulate outputs
The underlying math is identical; only the memory layout changes.
Paged attention eliminates the need for large contiguous slabs. Each sequence grows its KV cache block-by-block, and each block can be placed anywhere in GPU memory. There is no long slab to reserve, so external fragmentation largely disappears.
Internal fragmentation also collapses.
The only unused memory per sequence is inside its final partially filled block — at most the space for block_size − 1 tokens. If the block size is 16 tokens, the maximum internal waste is 15 tokens. Compare that to 1,000+ tokens wasted in the old approach.
Reserved-but-unused memory disappears entirely.
There are no pre-allocated full-size slabs. Blocks are allocated on demand.
Memory utilization becomes extremely predictable.
For N tokens, the system allocates exactly ceil(N / block_size) blocks. Nothing more.
This is the same structural benefit that operating systems gain from virtual memory: the illusion of a large contiguous space, backed by small flexible pages underneath.
Logical Blocks, Physical Blocks, and Block Tables
The vLLM architecture uses a simple but powerful structure to track blocks:
Logical blocks: the sequence’s view of its KV cache
Physical blocks: actual GPU memory chunks
Block table: a mapping from logical indices to physical block IDs
This is visually similar to the page table in any OS textbook.
When a sequence generates tokens:
It writes K/V into the current physical block.
If the block fills up, vLLM allocates a new one and updates the table.
If two sequences share a prefix, their block tables point to the same physical blocks.
All of this is efficient because the attention kernel is redesigned to loop over blocks instead of a single contiguous tensor.
Sharing and Copy-on-Write: Why Paged Attention Helps Beam Search and Sampling
This is one of the most elegant parts of the paper.
When doing:
beam search, or
parallel sampling, or
agentic branching
many sequences share long prefixes.
Under the traditional contiguous layout, you either:
duplicate the KV cache for each branch (expensive), or
compromise batch flexibility (restrictive).
With paged attention:
multiple sequences simply reference the same physical blocks,
and only when a sequence diverges do we perform copy-on-write at the block level.
Copying one block is far cheaper than copying an entire slab. This leads to substantial memory savings — the paper reports that shared prefixes during beam search reduce KV memory usage by up to 55% in some scenarios.
How the Paged Attention Kernel Works (Intuitive View)
Even though the memory layout changes, the math of attention remains untouched.
Here’s the intuitive flow inside the kernel:
Take the Query for the new token.
Loop over each logical block of previous tokens.
For each block:
Look up the physical block address through the block table.
Load the Keys in that block.
Compute attention scores (Q · Kᵀ).
Load the Values in that block.
Multiply and accumulate.
Normalize across all blocks.
Produce the final attention output.
Kernel optimizations in the paper include:
fused reshape-and-write kernels for block writes,
block-aware attention kernels,
efficient memory coalescing strategies,
minimizing per-block overhead.
While the block-aware kernels are slightly slower than fully contiguous ones, the system throughput increases dramatically because vLLM can batch far more requests simultaneously.
Paging Enables Swapping and Recomputing KV Blocks
Once KV data is broken into blocks, vLLM gains a capability that is nearly impossible with contiguous slabs: flexible eviction policies.
If GPU memory is full, vLLM can:
swap blocks to CPU memory, or
drop blocks entirely and recompute them if needed,
evict entire sequences’ blocks immediately.
The paper notes that recomputation can be faster than swapping small blocks over PCIe for certain workloads — an insight that wouldn’t be possible without block-level memory.
source: https://arxiv.org/pdf/2309.06180
This is a fundamental shift in how LLM serving systems deal with memory pressure.
Why Block Size Matters
Block size is the main tuning knob. A smaller block size:
reduces internal fragmentation,
increases sharing granularity,
but increases kernel overhead,
and increases the number of memory lookups.
A larger block size:
improves kernel efficiency,
but wastes more memory.
The vLLM authors test many configurations and find that around 16 tokens per block strikes a balance. But workloads differ, and this is a tunable dimension in future variations of paged attention.
Paged Attention vs. Traditional Systems: Throughput Gains
While paged attention increases per-kernel latency (~20–26% overhead), the end-to-end throughput improves by 2–4× because:
batches become much larger,
memory is no longer the bottleneck,
iteration-level scheduling can run without exploding memory use,
shared prefixes do not duplicate KV cache,
requests no longer fail due to lack of contiguous space.
This is the core result: paged attention trades tiny per-kernel overhead for massive system-wide gains.
The beauty of paged attention is that it doesn’t try to fight the GPU or the attention kernel. Instead, it sidesteps the original constraint entirely.
Traditional systems try to squeeze dynamic workloads into rigid, contiguous layouts and then fight the consequences (compaction, large reservations, fragmentation). Paged attention flips the model: accept that token sequences grow unpredictably, and design memory as though you were building a small operating system for the KV cache.
Once you see it through that lens, the entire design becomes obvious:
block tables
shared blocks
copy-on-write
demand-based allocation
block-level eviction
block-level recomputation
fragmentation elimination
higher effective batch sizes
Paged attention is the kind of engineering idea that feels both novel and inevitable.
Practical Lessons for Engineers Using Paged Attention
If you’re building LLM services or agentic systems, here are some practical takeaways:
Measure how much of your KV cache memory is actually used. Traditional systems waste the majority of it.
If your batch sizes are small because of memory, paged attention will help dramatically.
If you rely on beam search, multi-sampling, or agent branching, block-level prefix sharing is a huge win.
If you use iteration-level scheduling, you need a KV cache representation that doesn’t explode memory.
Understand block size trade-offs (paging is not free; kernel overhead exists).
Consider recomputation as a valid alternative to swapping for certain workloads.
Conclusion: Paged Attention as the New Default Mental Model
Paged attention is not just another incremental optimization. It is a new lens for thinking about how KV cache memory should be managed in autoregressive models.
The math of attention stays the same. What changes is everything around it — the memory layout, the allocator, the scheduler, and the way prefix sharing works. The payoff is enormous: far less waste, far more flexibility, and significantly higher throughput. In many ways, paged attention is to KV memory what virtual memory was to general-purpose computing: a foundational concept that unlocks better utilization of hardware resources.
If you’re serving LLMs in production — or building agentic systems that rely on long-running multi-step reasoning — paged attention is now a core idea you should keep in your toolkit.
Ready to build robust and scalable LLM Applications? Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.
Stay tuned — this is part of a 3-part deep-dive series. In the upcoming posts, we’ll unpack Radix Attention, and a few other emerging techniques that push context efficiency even further. If you’re serious about building fast, scalable LLM systems, you’ll want to check back in for the next installments.
If you’ve spent any time experimenting with large language models (LLMs), you’ve likely encountered terms like queries, keys, and values—the building blocks of transformer attention. Understanding these concepts is the first step toward appreciating a powerful optimization called the KV cache, which is essential for both fast inference and cost efficiency. Today, we’re going to take a deep dive into what KV cache is, why it matters, and how you can optimize your prompts to take full advantage of it. And to take things even further, this post kicks off a 3-part series on modern attention-efficiency techniques. By the end, you’ll not only understand KV cache deeply—you’ll also be ready for the next installments where we break down newer methods like Paged Attention, Radix Attention, and a few emerging ideas reshaping long-context LLMs.
Queries, Keys, and Values: The Building Blocks of Attention
Before we can talk intelligently about KV cache, we need to revisit the basics of attention. In a transformer, every token you feed into the model generates three vectors: Q (query), K (key), and V (value). Each of these plays a distinct role in determining how the model attends to other tokens in the sequence.
Query (Q) represents the token that’s “asking” for information. It’s like a search request, what does this token want to know from the context?
Key (K) represents the token that’s being “indexed.” It’s like the label on a piece of information that queries can match against.
Value (V) represents the content or information of the token itself. Once a query finds a matching key, the corresponding value is returned as part of the output.
source: Medium
Mathematically, attention computes a score between the query and all keys using a dot product, applies a softmax to turn scores into weights, and then calculates a weighted sum of the values. This weighted sum becomes the output representation for that token. Step by step, it looks like this:
Compute for all tokens X.
Compute attention scores:
Apply softmax to get attention weights:>
Compute the weighted sum of values:
source: Medium
It’s elegant, but there’s a catch: during inference, the model repeats this process for every new token in an autoregressive manner, which can become extremely costly, especially for long sequences.
Here’s where the KV cache comes into play. When generating text token by token, the model recalculates keys and values for all previous tokens at each step. This repetition is wasteful because those previous K and V vectors don’t change; only the query for the new token is different. The KV cache solves this by storing K and V vectors for all previous tokens, allowing the model to reuse them instead of recomputing them.
Think of it this way: if you’re reading a long document and want to summarize it sentence by sentence, you wouldn’t reread the first paragraphs every time you process the next sentence. KV cache lets the model “remember” what it already processed.
To appreciate the value of KV cache, it’s worth considering how it affects cost in practice. Many commercial LLM providers charge differently for tokens based on whether they hit the cache:
With Anthropic Claude, cached input tokens are far cheaper than uncached tokens. Cached tokens can cost as little as $0.30 per million tokens, whereas uncached tokens can cost up to $3 per million tokens—a 10x difference.
Similarly, in OpenAI’s GPT models, repeated prefixes in multi-turn chats benefit from KV caching, drastically reducing both time-to-first-token (TTFT) and inference costs.
This cost gap alone makes KV cache a critical optimization for anyone building production systems or agentic AI pipelines.
Today, many applications are more than simple Q&A models, they’re agentic systems performing multiple steps of reasoning, tool usage, and observations. Consider an AI agent orchestrating a series of actions:
The agent receives a system prompt describing its objectives.
It ingests a user prompt.
It generates an output, executes actions, observes results, and logs observations.
The agent generates the next action based on all prior context.
In such multi-turn workflows, KV cache hit rate is extremely important. Every token in the prefix that can be reused reduces the compute needed for subsequent reasoning steps. Without caching, the model recalculates K/V for all past tokens at each step—wasting time, compute, and money.
Fortunately, if your context uses identical prefixes, you can take full advantage of KV cache. Whether you’re running a self-hosted model or calling an inference API, caching drastically reduces TTFT and inference costs.
Maximizing KV cache hit rate isn’t magic, it’s about structured, deterministic prompting. The team at Manus highlights several practical strategies for real-world AI agents in their blog “Context Engineering for AI Agents: Lessons from Building Manus” (Manus, 2025).
Here’s a summary of the key recommendations:
Keep your prompt prefix stable
Due to the autoregressive nature of LLMs, even a single-token difference can invalidate the KV cache from that point onward. A common example is including a timestamp at the beginning of the system prompt: while it allows the model to tell the current time, it completely kills cache reuse. Manus emphasizes that stable system prompts are critical for cache efficiency.
Make your context append-only
Avoid modifying previous actions or observations. Many programming languages and serialization libraries do not guarantee stable key ordering, which can silently break the cache if JSON objects or other structured data are rewritten. Manus recommends designing your agent’s context so that all new information is appended, leaving previous entries untouched.
Mark cache breakpoints explicitly
Some inference frameworks do not support automatic incremental prefix caching. In these cases, you need to manually insert cache breakpoints to control which portions of context are reused. Manus notes that these breakpoints should at minimum include the end of the system prompt and account for potential cache expiration.
By following these structured prompting strategies, you maximize KV cache reuse, which leads to faster inference, lower costs, and more efficient multi-turn agent execution—lessons that the Manus team has validated through real-world deployments.
The Basics of LLM Inference: Prefill and Decoding
To understand why prompt caching (KV caching) is such a game-changer, it helps to first see what happens under the hood during LLM inference. Large language models generate text in two distinct phases:
1. Prefill – Understanding the Prompt
In this phase, the model processes the entire input prompt all at once. Each token in the prompt is converted into embeddings, and the model computes hidden states and attention representations across all tokens. These computations allow the model to “understand” the context and produce the first output token. Essentially, the prefill phase is the model setting the stage for generation.
2. Decoding – Generating Tokens Autoregressively
Once the first token is generated, the model enters the decoding phase. Here, it generates one token at a time, using all previous tokens (both the input prompt and already-generated tokens) as context. Each new token depends on the history of what’s been produced so far.
Step-by-Step Example: QKV Computation Without KV Cache
Suppose you have the tokens:
[Alice, went, to, the, market]
At token 5 (“market”), without KV cache:
Compute Q, K, V for “Alice” → store temporarily
Compute Q, K, V for “went”
Compute Q, K, V for “to”
Compute Q, K, V for “the”
Compute Q for “market” and recompute K, V for all previous tokens
Notice that K and V for the first four tokens are recomputed unnecessarily.
Step-by-Step Example: With KV Cache
With KV cache:
Compute Q, K, V for each token as before once and store K and V
While KV cache provides significant improvements, it’s not perfect:
Memory growth: K/V tensors grow linearly with context length. Long sequences can exhaust GPU memory.
Static cache structure: Simple caching doesn’t handle sliding windows or context truncation efficiently.
Inflexibility with multi-query attention: Models using multi-query attention can reduce KV memory but may require different caching strategies.
These limitations have driven research into more advanced attention techniques.
Beyond Simple KV Cache: Advanced Techniques
As models scale to longer contexts, the simple KV cache runs into practical limits—mainly GPU memory and the cost of attending to every past token. That’s why newer techniques like Paged Attention and Radix Attention were developed. They’re not replacements for KV caching but smarter ways of organizing and accessing cached tokens so the model stays fast even with huge context windows. We’ll break down each of these techniques in the upcoming blogs, so stay tuned for that.
1. Paged Attention
Paged attention divides the model’s context into discrete “pages” of tokens, similar to how a computer manages memory with virtual pages. Instead of keeping every token in GPU memory, only the pages relevant to the current generation step are actively loaded.
Memory efficiency: Older pages that are less likely to impact the immediate token prediction can be offloaded to slower storage (like CPU RAM or even disk) or recomputed on demand.
Scalability: This allows models to process very long sequences—think entire books or multi-hour dialogues—without exceeding memory limits.
Practical example: Imagine a multi-turn chatbot with a 20,000-token conversation history. With naive caching, the GPU memory would balloon as each new token is generated. With paged attention, only the most relevant pages (e.g., the last few turns plus critical context) remain in memory, while earlier parts are swapped out. The model still has access to the full history if needed but doesn’t carry the entire context in GPU memory at all times.
2. Radix Attention
Radix attention takes a fundamentally different approach: it reorganizes tokens hierarchically into a radix-tree structure. Rather than attending to every single token individually, the model computes attention over grouped summaries of tokens.
Logarithmic scaling: By aggregating keys and values in a tree, the number of attention computations grows logarithmically with sequence length, rather than linearly. This dramatically reduces computational cost for extremely long sequences.
Preserving context fidelity: Unlike simple downsampling, radix attention preserves critical information by hierarchically combining tokens, ensuring that higher-level representations still capture the essence of earlier tokens.
Ideal for agentic workflows: In systems where models must maintain reasoning across long interactions—such as multi-step planning agents or memory-augmented AI—radix attention ensures that even very old information can influence current decisions without slowing down generation.
The KV cache is one of the simplest yet most powerful optimizations in modern LLM workflows. It transforms inference from repetitive, expensive computation into fast, cost-efficient generation. In the age of agentic AI—where models are performing multi-step reasoning, tool use, and long-term planning—maximizing KV cache hit rate is no longer optional; it’s foundational.
From a practical standpoint, following prompt engineering best practices—keeping your prefix stable, maintaining an append-only context, and using deterministic serialization—can unlock dramatic savings in compute, memory, and latency. Combined with emerging attention techniques like paged and radix attention, KV cache ensures that your LLM workflows remain both performant and scalable.
In other words, the KV cache isn’t just a nice-to-have; it’s the backbone of fast, efficient, and cost-effective LLM inference.
Ready to build robust and scalable LLM Applications? Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.
Stay tuned — this is part of a 3-part deep-dive series. In the upcoming posts, we’ll unpack Paged Attention, Radix Attention, and a few other emerging techniques that push context efficiency even further. If you’re serious about building fast, scalable LLM systems, you’ll want to check back in for the next installments.
Artificial intelligence is no longer experimental infrastructure. It is core business infrastructure. The same way organizations matured cybersecurity, cloud strategy, and data governance over decades, AI now requires its own institutional backbone. This backbone is AI governance—a collection of controls, oversight mechanisms, accountability structures, and risk management protocols that ensures AI systems do not just perform, but perform responsibly.
Unlike traditional software, AI systems behave probabilistically. They evolve with data, generate unbounded outputs, influence decisions, and often interact directly with users. This changes the risk profile. If software fails, it breaks. If AI fails, it can discriminate, hallucinate, leak sensitive data, enable fraud, reinforce bias, or reduce human agency at scale. These are systemic risks, not isolated bugs. And unlike a single system outage, the reputational, regulatory, and competitive consequences can compound rapidly.
For CTOs, CIOs, and AI teams, the challenge is no longer “Can we build AI?” but “Can we govern AI well enough to deploy it safely, sustainably, and defensibly?” Multiple industries are already learning that the cost of deploying AI without strong AI governance is far higher than the cost of deploying it slowly.
This blog is a practical, executive-ready, engineering-aware AI governance checklist designed to move organizations from uncertain experimentation to mature, compliant, and scalable AI operations.
The Strategic Case for AI Governance
Organizations frequently misinterpret AI governance as a compliance checklist or legal risk requirement. It is both of those, but it’s also far more strategic. AI governance directly influences competitive advantage.
Organizations with weak AI governance eventually experience one or more of the following:
Production models that deteriorate silently due to drift, until failures become public
Non-standardized AI environments that create fragmentation across teams
Undocumented data sources that introduce liability and breach exposure
Procurement of third-party AI models without benchmarking, validation, or auditing
Public credibility damage due to algorithmic harm, bias, or unverified behavior
Stalled AI projects when legal, security, or compliance teams intervene too late
By contrast, organizations that operationalize AI governance early gain:
Faster deployment cycles because safety, compliance, and procurement are standardized
Fewer internal blockers between technical and regulatory teams
Higher confidence from partners, customers, and investors
A repeatable blueprint for responsible AI innovation
Reduced likelihood of catastrophic AI incidents
Mature AI governance becomes an enabler of innovation, not a restriction on it.
Who Owns AI Governance in an Organization?
Because AI touches every domain—data infrastructure, cybersecurity, compliance, product, ethics, automation, customer interaction, and regulatory reporting, AI governance cannot sit inside a single team. It must operate as a distributed ownership model:
Role
Primary Responsibility in AI Governance
CTO
AI architecture, model evaluation, technical safeguards, deployment standards
CIO
IT policy, enterprise risk alignment, operational compliance
CISO
Security, threat modeling, adversarial risk, data protection
AI Lead/ML Engineering
Model quality, fairness testing, monitoring, retraining pipelines
Effective AI governance begins with classification, not adoption. Teams must first inventory what exists, what is being built, and what risks are already present.
Has your organization created a documented AI governance mandate?
Are AI objectives aligned with enterprise risk tolerance?
Are proposed AI use cases categorized by risk level (low, medium, high, critical)?
Do high-risk systems have mandatory human review and audit trails?
Is there an executive sponsor accountable for AI governance outcomes?
Are AI policies centrally accessible across departments?
A governance program is not a slide deck. It must translate to enforceable organizational behavior.
2. Policy Standards & Allowed Use Boundaries
Organizations need explicit “rules of play” for AI—especially generative and agentic systems.
Is there a documented acceptable use policy for AI?
Are restricted AI use cases clearly defined (medical diagnosis, autonomous action, legal advice, financial execution, surveillance, etc.)?
Do policies establish working principles such as safety, fairness, transparency, and privacy?
Are AI escalation paths defined for ethical violations?
Are exemptions, overrides, and approvals traceable and logged?
AI policy ambiguity always results in infrastructure chaos later.
3. Regulatory Mapping & Compliance Requirements
Different jurisdictions treat AI risk differently. AI governance must account for multi-region complexity.
Has your organization mapped applicable AI regulations (e.g., EU AI Act, sector regulations, data residency laws, consumer protection frameworks)?
Are compliance owners assigned per region?
Are model decisions auditable to meet explainability obligations?
Is regulatory change monitoring built into quarterly governance reviews?
Can the organization respond to a regulatory inquiry with evidence, artifacts, and model lineage?
4. Data Provenance, Consent & Lifecycle Governance
AI is a derivative of data quality and legality. Poor data governance becomes AI liability.
Is all training and production data legally sourced and documented?
Are sensitive fields tokenized, anonymized, or encrypted where required?
Is data lineage tracked from ingestion to inference?
Are retention schedules applied and enforced?
Can you produce evidence of consent for user-generated training data?
Are synthetic or augmented datasets documented as such?
Unverified or undocumented data propels AI projects into regulatory jeopardy.
Are new AI capabilities benchmarked before adoption?
Are employees continuously trained on responsible AI practices?
Are governance KPIs reported at the executive level?
source: aimultiple
What Prompted AI Governance Policies?
AI governance didn’t emerge from compliance theory, it emerged from real-world consequences. As AI moved out of labs and into hospitals, banks, hiring systems, social platforms, and public services, the risks became too significant to ignore. Early deployments revealed biased decision-making in recruiting and lending, inaccurate facial recognition systems, and medical models trained on non-representative data. These failures demonstrated that AI could unintentionally discriminate, harm, and misinform at scale.
At the same time, the explosion of generative AI introduced new challenges: automated misinformation, hallucinated outputs presented as facts, intellectual property disputes, fraud at scale, and a general inability to trace how decisions were being produced. Organizations that deployed AI often lacked answers to fundamental questions such as Who is accountable when a model fails? Can decisions be audited? Was user data used with consent? The growing opacity around data usage and algorithmic decision-making also heightened public concern around privacy, fairness, and trust.
Governments, industry bodies, and enterprises recognized that AI was no longer just a technological innovation—it had become a societal and economic force that required guardrails. Policies and frameworks were ultimately driven by a convergence of urgency: real-world harm, erosion of public trust, absence of accountability, geopolitical AI competition, and the need to balance innovation with safety. AI governance was the response, a necessary shift from experimentation to responsible stewardship.
Who Actually Sets the Rules for AI Governance?
One of the most common assumptions about AI governance is that it is defined by a single authority, a universal standard, or a binding global rulebook. In reality, there is no central governing body for AI. Instead, AI governance is shaped by a layered ecosystem of regulators, standards bodies, industry alliances, individual organizations, and independent auditors. Each contributes a different piece of the puzzle—some legally enforceable, others voluntary but influential, and many operationally essential.
1. Governments and Regulatory Authorities
Governments are the only entities that can create legally binding AI rules. These rules typically focus on citizen protection, data rights, market competition, and high-risk AI usage.
Notable examples include:
European Union – EU AI Act (risk-based AI regulation), GDPR (data rights and consent)
United States – Executive orders on AI safety, NIST AI Risk Management Framework (widely adopted even though voluntary), state-level AI regulations emerging
China – Governance rules for recommendation algorithms, deep synthesis, and generative AI
Canada, UK, India, Brazil, Singapore – National AI strategies and evolving compliance requirements
Government regulation tends to answer questions like: Is this AI system safe? Is data usage lawful? Who is liable if harm occurs?
2. International Standards and Multilateral Organizations
These bodies do not always enforce laws, but they strongly influence how AI is built and audited worldwide by defining technical and ethical norms.
Key organizations include:
ISO/IEC – standards for AI risk management, transparency, bias controls, robustness
OECD – global principles for trustworthy AI used as a reference by policymakers
UNESCO – ethical AI recommendations adopted by 190+ countries
World Economic Forum (WEF) – governance frameworks for public and private sector alignment
These groups answer: What does responsible AI look like in practice? What should good governance quality standards be?
3. Industry Consortia and Research Institutions
While not regulators, industry coalitions often build the most practical frameworks that companies adopt before laws catch up.
Examples include:
Partnership on AI
Frontier Model Forum
OpenAI, Anthropic, Meta, Google DeepMind safety research divisions
Academic institutions publishing benchmark safety, alignment, and risk research
They influence governance by answering: How do we technically stress test models? What guardrails are possible? What are best practices for safe deployment?
4. Individual Organizations (Internal AI Governance Owners)
This is where AI governance becomes real, enforceable, and operational.
Companies are expected to define and implement:
What AI can and cannot be used for internally
How models are validated before deployment
Who signs off on high-risk AI rollouts
How bias, privacy, and security testing is performed
What happens when AI causes harm
This layer answers the most critical question: How do we ensure AI is safe inside our business, regardless of what regulators require?
5. Independent Auditors and Compliance Bodies
Once AI systems are deployed, independent reviewers validate whether governance claims hold up to scrutiny.
These include:
Third-party AI auditing firms
Security compliance reviewers (SOC 2, ISO 27001, etc.)
Sector-specific auditors in finance, healthcare, insurance, and public infrastructure
They answer: Can you prove your AI does what you claim? Where is the evidence?
The KPI Framework for AI Governance Success
A functional AI governance program must show impact, not existence.
Category
Metric Examples
Compliance
% models with completed audits, % regulatory requests fulfilled within SLA
Drift detection time, alert response time, resolution SLA
Documentation
% models with model cards, traceable training provenance
Human oversight
% decisions reviewable by humans, override execution time
Security
Reduction in injection attempts, abuse detection coverage
Final Thought: AI Governance Is a Business Multiplier
AI governance is not bureaucracy. It is competitive infrastructure. Companies without it move recklessly. Companies with it move confidently.
And confidence moves faster than experimentation alone.
If organizations treat AI governance as a compliance exercise, they will always feel constrained. If they treat it as an operational foundation, they become unstoppable—because they can scale intelligence without scaling risk.
The question for leaders today is not:
“How do we govern AI?”
It is:
“How quickly can we govern AI well enough to lead with it?”
Ready to build robust and scalable LLM Applications? Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.
For much of the last decade, AI language models have been defined by a simple paradigm: input comes in, text comes out. Users ask questions, models answer. Users request summaries, models comply. That architecture created one of the fastest-adopted technologies in history — but it also created a ceiling.
Something fundamentally new is happening now.
LLMs are no longer just responding. They are beginning to act. They plan, evaluate, self-correct, call tools, browse the web, write code, coordinate with other AI, and make decisions over multiple steps without human intervention. These systems are not just conversational — they are goal-driven.
The industry now has a term for this new paradigm: agentic llm.
In 2025, the distinction between an LLM and an agentic llm is the difference between a calculator and a pilot. One computes. The other navigates.
What Is an Agentic LLM?
An agentic llm is a language model that operates with intent, planning, and action rather than single-turn responses. Instead of generating answers, it generates outcomes. It has the ability to:
Reason through multi-step problems
Act using tools, code, browsers, or APIs
Interact with environments, systems, and other agents
Evaluate itself and iterate toward better solutions
Agency means autonomy, the system can pursue a goal even when the path isn’t explicit. The user defines the what, while the agent figures out the how.
Today’s frontier systems are firmly moving into that final category.
Traditional LLM vs Agentic LLM
For years, we measured AI progress by how convincingly a model could sound intelligent. But intelligence that only speaks without acting is limited to being reactive. Traditional LLMs fall into this category, they are exceptional pattern matchers, but they lack continuity, intention, and agency. They wait for input, generate an answer, then reset. They don’t evolve across interactions, don’t remember outcomes, and don’t take initiative unless instructed explicitly at every step.
The limitations become obvious when tasks require more than a single answer. Ask a traditional model to debug a system, improve through failure, or execute a multi-step plan, and you’ll notice how quickly it collapses into depending on you, the human, to orchestrate every stage. These models are dependent, not autonomous.
An agentic llm, on the other hand, doesn’t just generate responses, it drives outcomes. It can reason through a plan, decide what tools it needs, execute actions, verify results, and adapt if something fails. Rather than being a sophisticated text interface, it becomes an active participant in problem solving.
Key difference in mindset:
Traditional LLMs optimize for the most convincing next sentence.
An agentic llm optimizes for the most effective next action.
The contrast in behavior:
Traditional LLM
Agentic LLM
Waits for user instructions
Initiates next steps when given a goal
No memory across messages
Maintains state during and across tasks
Cannot execute real-world actions
Calls tools, runs code, browses, automates
Produces answers
Produces outcomes
Needs perfect prompting
Improves via iteration and feedback
Reacts
Plans, decides, and acts
A good way to think about it: traditional LLMs are systems of language, while an agentic llm is a system of behavior.
The Three Pillars That Make an LLM Truly “Agentic”
source: https://arxiv.org/pdf/2503.23037
Agency doesn’t emerge just because a model is large or advanced. It emerges when the model gains three fundamental abilities — and an agentic llm must have all of them.
1. Reasoning — The ability to think before responding
Instead of immediately generating text, an agentic llm evaluates the problem space first. This includes:
Breaking tasks into logical steps
Exploring multiple possible solutions internally
Spotting flaws in its own reasoning
Revising its approach before committing to an answer
Optimizing the decision path, not just the phrasing
This shift alone changes the user experience dramatically. Instead of a model that reacts, you interact with one that deliberates.
2. Acting — The ability to do, not just describe
Reasoning becomes agency only when paired with execution. A true agentic llm can:
Run code and interpret the output
Call APIs, trigger automations, or fetch real-time data
Write to databases or external memory stores
Navigate software interfaces or browsers
Modify environments based on goals
In other words, it moves from explaining how to actually doing.
3. Interacting — The ability to collaborate and coordinate
Modern AI doesn’t operate in isolation. The most capable agentic llm systems are designed to participate in multi-agent ecosystems where they can:
Because these models take action, safe environments must exist where they can:
Run or test code
Interact with files
Execute tasks without damaging live systems
5. Feedback loops
To improve over time, an agentic llm needs mechanisms that allow it to:
Evaluate success vs failure
Adjust strategies dynamically
Retain learnings for future tasks
Minimize repeated mistakes
Together, these components convert a powerful model into an autonomous problem-solving system.
source: Cobius Greyling & AI
From Token Prediction to Decision-Making
Classic LLMs optimize for the most probable next word. Agentic llms optimize for the most probable successful outcome. This makes them fundamentally different species of system.
Instead of asking:
“What is the best next token?”
They implicitly or explicitly answer:
“What sequence of actions maximizes goal success?”
This resembles human cognition:
System 1: fast, instinctive responses
System 2: slow, deliberate reasoning
Traditional LLMs approximate System 1. Agentic llms introduce System 2.
Browse the web and extract structured insights autonomously
Write, run, and fix code without supervision
Trigger workflows, fill forms, or navigate software
Call external services with judgment
Coordinate multiple AI sub-agents
Learn from execution failures and retry intelligently
Generate new data from real interactions
Improve through simulated self-play or tool feedback
These models are evolving from interactive assistants to autonomous knowledge workers.
Agentic LLMs Currently Available in 2025
As the concept of an agentic llm moves from theory to product, several high-profile models in 2025 demonstrate real-world adoption of reasoning, tool use, memory and agency. Below are some of the leading models, along with their vendor, agentic features and availability.
Claude 4 (Anthropic)
Anthropic’s Claude 4 family—including the Opus and Sonnet variants—was launched in 2025 and explicitly targets agentic use-cases such as tool invocation, file access, extended memory, and long‐horizon reasoning. These models support “computer use” (controlling a virtual screen, exploring software) and improved multi-step workflows, positioning Claude 4 as a full-fledged agentic llm rather than a mere assistant.
Gemini 2.5 (Google / DeepMind)
Google’s Gemini series, particularly the 2.5 update, includes features such as large context windows, native multimodal input (text + image + audio) and integrated tool usage for browser navigation and document manipulation. As such, it qualifies as an agentic llm by virtue of planning, tool invocation and environment interaction.
Llama 4 (Meta)
Meta’s Llama 4 release in 2025 includes versions like “Scout” and “Maverick” that are multimodal and support extremely large context lengths. While more often discussed as a foundation model, Llama 4’s architecture is increasingly used to power agentic workflows (memory + tools + extended context), making it part of the agentic llm category.
Grok 3 (xAI)
xAI’s Grok 3 (and its code-/agent oriented variants) are aimed at interactive, tool-enabled models. With features like DeeperSearch, extended reasoning, large token context windows and integration in Azure/Microsoft ecosystems, Grok 3 is positioned as an agentic llm in practice rather than simply a chat model.
Qwen 3 (Alibaba)
Alibaba’s Qwen series (notably Qwen 3) is open-licensed and supports multimodal input, enhanced reasoning and “thinking” modes. While not always labelled explicitly as an agentic llm by the vendor, its published parameters and tool-use orientation place it in that emerging class.
DeepSeek R1/V3 (DeepSeek)
DeepSeek’s R1 and V3 models (and particularly the reasoning-optimized variants) are designed with agentic capabilities in mind: tool usage, structured output, function-calling, multi-step workflows. Though lesser known compared to the big vendors, they exemplify the agentic llm class in open-weight or semi-open formats.
Giving AI the ability to act introduces new safety challenges. The biggest risks include:
Risk
Mitigation
Taking incorrect actions
Validate with external tools or constraints
Infinite loops
Step caps + runtime limits
Misusing tools
Restricted access + sandboxing
Unclear reasoning
Logged decision trails
Goal misalignment
Human review checkpoints
The most effective agentic llm is not the most independent — it is the one that is bounded, observable, and auditable.
The Future: From Copilots to AI Workforces
The trajectory is now clear:
Era
AI Role
2023
LLM as chat assistant
2024
LLM as reasoning engine
2025
Agentic llm as autonomous worker
2026+
Multi-agent AI organizations
In the coming years, we’ll stop prompting single models and start deploying teams of interacting agentic llms that self-organize around goals.
In that world, companies won’t ask:
“Which LLM should we use?”
They’ll ask:
“How many AI agents do we deploy, and how should they collaborate?”
Conclusion — The Age of the Agentic LLM Is Here
The evolution of AI is no longer confined to smarter answers, faster responses, or larger parameter counts — the real transformation is happening at the level of autonomy, decision-making, and execution. For the first time, we are witnessing language models shift from being passive interfaces into active systems that can reason, plan, act, and adapt in pursuit of real objectives. This is what defines an agentic llm, and it marks a fundamental turning point in how humans and machines collaborate.
Traditional LLMs democratized access to knowledge and conversation, but agentic llms are democratizing action. They don’t just interpret instructions — they carry them out. They don’t just answer questions — they solve problems across multiple steps. They don’t just generate text — they interact with systems, trigger workflows, evaluate outcomes, and refine their strategies based on feedback. Most importantly, they shift the burden of orchestration away from the user and onto the system itself, enabling AI to become not just a tool, but a partner in execution.
Yet, power always demands responsibility. As agentic llms become more capable, the need for guardrails, observability, validation layers, and human oversight grows even more critical. The goal is not to build the most autonomous model possible, but the most usefully autonomous one—an agent that can operate independently while remaining aligned, auditable, and safe. The future belongs not to the models that act the fastest, but to the ones that act the most reliably and explainably.
Ready to build robust and scalable LLM Applications? Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.
Refrag is the latest innovation from Meta Superintelligence Labs, designed to supercharge retrieval-augmented generation (RAG) systems. As large language models (LLMs) become central to enterprise AI, the challenge of efficiently processing long-context inputs—especially those packed with retrieved knowledge has grown significantly.
Refragtackles this problem head-on. It introduces a new way to represent, compress, and retrieve information, offering up to 30× acceleration in time-to-first-token (TTFT) and 16× context window expansion, all without compromising accuracy or reliability.
What is Retrieval-Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is a technique that enhances large language models by connecting them to external knowledge sources. Instead of relying solely on their internal parameters, RAG models retrieve relevant documents, passages, or data snippets from external corpora to ground their responses in factual, up-to-date information.
A retriever searches an external database for top-k relevant documents.
The retrieved text is concatenated with the original query and sent to the LLM for generation.
This approach reduces hallucinations, improves factual grounding, and enables models to adapt quickly to new or domain-specific information—without expensive retraining.
However, the process comes at a cost. RAG systems often feed very long contexts into the model, and as these contexts grow, computational complexity explodes.
The Bottleneck: Long Contexts in LLMs
Modern transformers process input sequences using an attention mechanism, where every token attends to every other token in the sequence. This operation scales quadratically with sequence length. In practice, doubling the input length can quadruple compute and memory requirements.
For RAG applications, this creates several bottlenecks:
Increased latency: The model takes longer to generate the first token (TTFT).
High memory usage: Large key-value (KV) caches are needed to store token representations.
Reduced throughput: Fewer parallel requests can be processed at once.
Scalability limits: Context length constraints prevent using extensive retrieved data.
Worse, not all retrieved passages are useful. Many are marginally relevant, yet the model still expends full computational effort to process them. This inefficiency creates a trade-off between knowledge richness and system performance, a trade-off Refrag is designed to eliminate.
Why Refrag Matters for the Future of RAG
Traditional RAG pipelines prioritize retrieval precision but neglect representation efficiency. Meta recognized that while retrieval quality had improved, context handling had stagnated. Large contexts were becoming the single biggest latency bottleneck in real-world AI systems—especially for enterprises deploying production-scale assistants, search engines, and document analyzers.
Refrag redefines how retrieved knowledge is represented and processed. By encoding retrieved text into dense chunk embeddings and selectively deciding what information deserves full attention, it optimizes both speed and accuracy bridging the gap between compactness and completeness.
Refrag introduces a modular, plug-and-play framework built on four key pillars: context compression, selective expansion, efficient decoding, and architectural compatibility.
1. Context Compression via Chunk Embeddings
Refrag employs a lightweight encoder that divides retrieved passages into fixed-size chunks—typically 16 tokens each. Every chunk is then compressed into a dense vector representation, also known as a chunk embedding.
Instead of feeding thousands of raw tokens to the decoder, the model processes a much shorter sequence of embeddings. This reduces the effective input length by up to 16×, leading to massive savings in computation and memory.
This step alone dramatically improves efficiency, but it introduces the risk of information loss. That’s where it’s reinforcement learning (RL) policy comes in.
2. Selective Expansion with Reinforcement Learning
Not all tokens can be compressed safely. Some contain critical details—numbers, named entities, or unique terms that drive the model’s reasoning.
Refrag trains a reinforcement learning policy that identifies these high-information chunks and allows them to bypass compression. The result is a hybrid input sequence:
Dense chunk embeddings for general context.
Raw tokens for critical information.
This selective expansion preserves essential semantics while still achieving large-scale compression. The RL policy is guided by reward signals based on model perplexity and downstream task accuracy.
3. Efficient Decoding and Memory Utilization
By shortening the decoder’s input sequence, it minimizes quadratic attention costs. The decoder no longer needs to attend to thousands of raw tokens; instead, it focuses on a smaller set of compressed representations.
This architectural shift leads to:
30.85× faster TTFT (time-to-first-token)
6.78× improvement in throughput compared to LLaMA baselines
16× context window expansion, enabling models to reason across entire books or multi-document corpora
In practical terms, this means that enterprise-grade RAG systems can operate with lower GPU memory, reduced latency, and greater scalability—all while maintaining accuracy.
4. Plug-and-Play Architecture
A standout advantage of Refrag is its compatibility. It doesn’t require modifying the underlying LLM. The encoder operates independently, producing pre-computed embeddings that can be cached and reused.
This plug-and-play design allows seamless integration with popular architectures like LLaMA, RoBERTa, and OPT—enabling organizations to upgrade their RAG pipelines without re-engineering their models.
Although Refrag demonstrates remarkable gains, several open challenges remain:
Generalization across data domains: How well does the RL chunk selector perform on heterogeneous corpora such as code, legal, and multimodal data?
Limits of compression: What is the theoretical compression ceiling before semantic drift or factual loss becomes unacceptable?
Hybrid architectures: Can it be combined with prompt compression, streaming attention, or token pruning to further enhance efficiency?
End-to-end optimization: How can retrievers and Refrag encoders be co-trained for domain-specific tasks?
Meta has announced plans to release the source code on GitHub under the repository facebookresearch/refrag, inviting the global AI community to explore, benchmark, and extend its capabilities.
FAQs
Q1. What is REFRAG?
It’s Meta’s decoding framework for RAG systems, compressing retrieved passages into embeddings for faster, longer-context LLM inference.
Q2. How much faster is REFRAG?
Up to 30.85× faster TTFT and 6.78× throughput improvement compared to LLaMA baselines.
Meta’s Refrag marks a transformative leap in the evolution of retrieval-augmented generation. By combining compression intelligence, reinforcement learning, and context expansion, it finally makes large-context LLMs practical for real-world, latency-sensitive applications.
For enterprises building retrieval-heavy systems—from customer support to scientific research assistants, it offers a path toward faster, cheaper, and smarter AI.
Ready to implement RAG and Refrag in your enterprise? Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.