For a hands-on learning experience to develop Agentic AI applications, join our Agentic AI Bootcamp today. Early Bird Discount
/ Blog / Part 2 — Memory Is the Real Bottleneck: How Paged Attention Powers the vLLM Inference Engine

Part 2 — Memory Is the Real Bottleneck: How Paged Attention Powers the vLLM Inference Engine

Part 2 — Memory Is the Real Bottleneck: How Paged Attention Powers the vLLM Inference Engine

Want to Build AI agents that can reason, plan, and execute autonomously?

In the first part of this series, we dug into why the KV cache exists, why it matters, and why it dominates the runtime characteristics of LLM inference. In Part 2, we’re going deeper into the systems-level issues that the KV cache introduces, particularly memory fragmentation and how this motivated the design of a new memory architecture for attention: paged attention, the foundational idea behind the vLLM inference engine.

Before diving into paged attention, you may want to revisit Part 1 of this series, where we unpack the fundamentals of the KV cache and why it dominates LLM memory behavior: KV Cache — How to Speed Up LLM Inference.

This post has one objective: make the vLLM paper’s ideas feel intuitive. The original work is dense (and excellent), but with the right framing, paged attention is actually a very natural idea — almost inevitable in hindsight. It’s essentially applying well-established operating systems concepts (paging, copy-on-write, block tables) to the KV cache problem that LLMs face. Once you see it, you can’t unsee it.

Let’s start with the root cause.

The Real Problem: KV Cache Is Huge, Dynamic, and Unpredictable

The KV cache holds one Key vector and one Value vector for every token in a sequence, across every layer and every attention head. For a typical 7B–13B model, this quickly grows into hundreds of megabytes per request. But the real challenge isn’t just size, it’s unpredictability.

Different requests vary wildly in:

  • prompt length,
  • generation length,
  • decoding strategy (sampling, beam search),
  • number of branch paths,
  • when they finish.

An LLM serving system cannot know in advance how long a request will run or how many tokens it will eventually accumulate. Yet GPUs require strict, contiguous, pre-allocated tensor layouts for optimal kernel execution. This mismatch — dynamic workload vs. static memory assumptions, is the source of nearly all downstream problems.

The traditional answer is:
“Allocate a single contiguous tensor chunk large enough for the maximum possible length.”

And this is where the trouble starts.

How Contiguous Allocation Breaks GPU Memory: Internal and External Fragmentation

To understand why this is harmful, picture the GPU memory as a long shelf. Each LLM request needs to reserve a large rectangular box for its KV cache, even if the request only ends up filling a fraction of the box. And since every box must be contiguous, the allocator cannot place a request’s box unless it finds one uninterrupted region of memory of the right size.

Average percentage of memory wastes in LLM serving systems - Understanding Paged Attention
source: https://arxiv.org/pdf/2309.06180

This creates three distinct kinds of waste, all described in the vLLM paper:

1. Reserved but Unused Slots

If the system allocates space for 2,048 possible tokens, but the request only produces 600 tokens, the remaining 1,448 positions are permanently wasted for the lifetime of that request. These unused slots cannot be repurposed.

2. Internal Fragmentation

Even within a request’s reserved slab, the actual KV cache grows token-by-token. Since the final length is unknown until the request finishes, internal fragmentation is unavoidable — you always over-allocate.

The paper observes that many real-world requests only use 20–30% of their allocated capacity. That means 70–80% of the reserved memory is dead weight for most of the request’s lifetime.

Discover the inner workings of LLMs — from tokenization to attention — to deepen your foundation before diving into paged attention.

3. External Fragmentation

Even worse, after many different requests have allocated and freed slabs of different sizes, the GPU memory layout ends up looking like Swiss cheese. The allocator may have plenty of free space in total, but not enough contiguous free space to fit a new request’s slab.

This causes new requests to fail even though the GPU technically has enough memory in aggregate.

The vLLM paper measures that only 20–38% of the allocated KV cache memory is actually used in existing systems. That’s an astonishingly low utilization for the largest memory component in LLM inference.

KV cache memory management in existing systems - Paged Attention
source: https://arxiv.org/pdf/2309.06180

This is the core problem: even before we run out of computation or bandwidth, we run out of contiguous GPU memory due to fragmentation.

Fine-Grained Batching: A Great Idea That Accidentally Worsens Memory Pressure

Before paged attention arrived, researchers attempted to improve throughput using smarter batching. Two techniques are important here:

  • Cellular batching,
  • Iteration-level scheduling (which vLLM cites explicitly).

These mechanisms work at token-level granularity instead of request-level granularity. Instead of waiting for entire requests to complete before adding new ones, the server can add or remove sequences each decoding iteration. This dramatically improves compute utilization because it keeps the GPU busy with fresh work every step.

In fact, iteration-level batching is almost required for modern LLM serving: it avoids the inefficiency where one long-running request delays the whole batch.

But here’s the catch that the vLLM paper highlights:

Fine-grained batching increases the number of concurrently active sequences.

And therefore:

Every active sequence maintains its own full, contiguous KV cache slab.

So while compute utilization goes up, memory pressure skyrockets.

If you have 100 active sequences simultaneously interleaved at the decoding step, you effectively have 100 large, partially empty, but reserved KV cache slabs sitting in memory. Fragmentation becomes even worse, and the chance of running out of contiguous space increases dramatically.

In other words:

Fine-grained batching solves the compute bottleneck but amplifies the memory bottleneck.

The system becomes memory-bound, not compute-bound.

This brings us to the core insight in the vLLM paper.

Explore how MCP is shaping the next generation of AI workflows, a must-read for building robust LLM systems.

Paged Attention: A Simple but Profound Idea

The vLLM authors ask a simple question:

“Why not treat the KV cache like an operating system treats virtual memory?”

In other words:

  • Break memory into fixed-size blocks (like OS pages).
  • Each block stores KV vectors for a small number of tokens (e.g., 16 tokens).
  • Maintain a mapping from logical blocks (the sequence’s view) to physical blocks (actual GPU memory).
  • Blocks can live anywhere in GPU memory — no need for contiguous slabs.
  • Blocks can be shared across sequences.
  • Use copy-on-write to handle divergence.
  • Reclaim blocks immediately when sequences finish.

This block-based KV representation is what the paper names paged attention.

Paged Attention Algorithm
source: https://arxiv.org/pdf/2309.06180

You might think: “Doesn’t attention require K and V to be in one contiguous array?”
Mathematically, no — attention only needs to iterate over all previous K/V vectors. Whether those vectors live contiguously or are chunked into blocks is irrelevant to correctness.

This means we can rewrite attention in block form:
for each block:

  • read its Keys
  • compute dot-product scores with the Query
  • apply softmax normalization
  • read block’s Values
  • accumulate outputs

The underlying math is identical; only the memory layout changes.

From here, we get several huge benefits.

Explore what “agentic” LLMs can do — planning, tool-use, memory — and how memory architecture like paged attention underpins them.

How Paged Attention Fixes Memory Fragmentation

Paged attention eliminates the need for large contiguous slabs. Each sequence grows its KV cache block-by-block, and each block can be placed anywhere in GPU memory. There is no long slab to reserve, so external fragmentation largely disappears.

Internal fragmentation also collapses.

The only unused memory per sequence is inside its final partially filled block — at most the space for block_size − 1 tokens. If the block size is 16 tokens, the maximum internal waste is 15 tokens. Compare that to 1,000+ tokens wasted in the old approach.

Reserved-but-unused memory disappears entirely.

There are no pre-allocated full-size slabs. Blocks are allocated on demand.

Memory utilization becomes extremely predictable.

For N tokens, the system allocates exactly ceil(N / block_size) blocks. Nothing more.

This is the same structural benefit that operating systems gain from virtual memory: the illusion of a large contiguous space, backed by small flexible pages underneath.

Logical Blocks, Physical Blocks, and Block Tables

The vLLM architecture uses a simple but powerful structure to track blocks:

  • Logical blocks: the sequence’s view of its KV cache
  • Physical blocks: actual GPU memory chunks
  • Block table: a mapping from logical indices to physical block IDs

This is visually similar to the page table in any OS textbook.

When a sequence generates tokens:

  • It writes K/V into the current physical block.
  • If the block fills up, vLLM allocates a new one and updates the table.
  • If two sequences share a prefix, their block tables point to the same physical blocks.

All of this is efficient because the attention kernel is redesigned to loop over blocks instead of a single contiguous tensor.

Sharing and Copy-on-Write: Why Paged Attention Helps Beam Search and Sampling

This is one of the most elegant parts of the paper.

When doing:

  • beam search, or
  • parallel sampling, or
  • agentic branching

many sequences share long prefixes.

Under the traditional contiguous layout, you either:

  • duplicate the KV cache for each branch (expensive), or
  • compromise batch flexibility (restrictive).

With paged attention:

  • multiple sequences simply reference the same physical blocks,
  • and only when a sequence diverges do we perform copy-on-write at the block level.

Copying one block is far cheaper than copying an entire slab. This leads to substantial memory savings — the paper reports that shared prefixes during beam search reduce KV memory usage by up to 55% in some scenarios.

How the Paged Attention Kernel Works (Intuitive View)

Even though the memory layout changes, the math of attention remains untouched.

Here’s the intuitive flow inside the kernel:

  1. Take the Query for the new token.

  2. Loop over each logical block of previous tokens.

  3. For each block:

    • Look up the physical block address through the block table.

    • Load the Keys in that block.

    • Compute attention scores (Q · Kᵀ).

    • Load the Values in that block.

    • Multiply and accumulate.

  4. Normalize across all blocks.

  5. Produce the final attention output.

Kernel optimizations in the paper include:

  • fused reshape-and-write kernels for block writes,

  • block-aware attention kernels,

  • efficient memory coalescing strategies,

  • minimizing per-block overhead.

While the block-aware kernels are slightly slower than fully contiguous ones, the system throughput increases dramatically because vLLM can batch far more requests simultaneously.

Paging Enables Swapping and Recomputing KV Blocks

Once KV data is broken into blocks, vLLM gains a capability that is nearly impossible with contiguous slabs: flexible eviction policies.

If GPU memory is full, vLLM can:

  • swap blocks to CPU memory, or

  • drop blocks entirely and recompute them if needed,

  • evict entire sequences’ blocks immediately.

The paper notes that recomputation can be faster than swapping small blocks over PCIe for certain workloads — an insight that wouldn’t be possible without block-level memory.

Block table translation in vLLM - Paged Attention
source: https://arxiv.org/pdf/2309.06180

This is a fundamental shift in how LLM serving systems deal with memory pressure.

Why Block Size Matters

Block size is the main tuning knob. A smaller block size:

  • reduces internal fragmentation,
  • increases sharing granularity,
  • but increases kernel overhead,
  • and increases the number of memory lookups.

A larger block size:

  • improves kernel efficiency,
  • but wastes more memory.

The vLLM authors test many configurations and find that around 16 tokens per block strikes a balance. But workloads differ, and this is a tunable dimension in future variations of paged attention.

Paged Attention vs. Traditional Systems: Throughput Gains

While paged attention increases per-kernel latency (~20–26% overhead), the end-to-end throughput improves by 2–4× because:

  • batches become much larger,
  • memory is no longer the bottleneck,
  • iteration-level scheduling can run without exploding memory use,
  • shared prefixes do not duplicate KV cache,
  • requests no longer fail due to lack of contiguous space.

This is the core result:
paged attention trades tiny per-kernel overhead for massive system-wide gains.

Explore memory’s critical role in agentic AI — a perfect companion to technical discussion about KV cache and paged attention.

Why Paged Attention Works: A Systems Perspective

The beauty of paged attention is that it doesn’t try to fight the GPU or the attention kernel. Instead, it sidesteps the original constraint entirely.

Traditional systems try to squeeze dynamic workloads into rigid, contiguous layouts and then fight the consequences (compaction, large reservations, fragmentation). Paged attention flips the model: accept that token sequences grow unpredictably, and design memory as though you were building a small operating system for the KV cache.

Once you see it through that lens, the entire design becomes obvious:

  • block tables
  • shared blocks
  • copy-on-write
  • demand-based allocation
  • block-level eviction
  • block-level recomputation
  • fragmentation elimination
  • higher effective batch sizes

Paged attention is the kind of engineering idea that feels both novel and inevitable.

Practical Lessons for Engineers Using Paged Attention

If you’re building LLM services or agentic systems, here are some practical takeaways:

  • Measure how much of your KV cache memory is actually used. Traditional systems waste the majority of it.
  • If your batch sizes are small because of memory, paged attention will help dramatically.
  • If you rely on beam search, multi-sampling, or agent branching, block-level prefix sharing is a huge win.
  • If you use iteration-level scheduling, you need a KV cache representation that doesn’t explode memory.
  • Understand block size trade-offs (paging is not free; kernel overhead exists).
  • Consider recomputation as a valid alternative to swapping for certain workloads.

Conclusion: Paged Attention as the New Default Mental Model

Paged attention is not just another incremental optimization. It is a new lens for thinking about how KV cache memory should be managed in autoregressive models.

The math of attention stays the same. What changes is everything around it — the memory layout, the allocator, the scheduler, and the way prefix sharing works. The payoff is enormous: far less waste, far more flexibility, and significantly higher throughput. In many ways, paged attention is to KV memory what virtual memory was to general-purpose computing: a foundational concept that unlocks better utilization of hardware resources.

If you’re serving LLMs in production — or building agentic systems that rely on long-running multi-step reasoning — paged attention is now a core idea you should keep in your toolkit.

Ready to build robust and scalable LLM Applications?
Explore Data Science Dojo’s LLM Bootcamp and Agentic AI Bootcamp for hands-on training in building production-grade retrieval-augmented and agentic AI systems.

Stay tuned — this is part of a 3-part deep-dive series.
In the upcoming posts, we’ll unpack Radix Attention, and a few other emerging techniques that push context efficiency even further. If you’re serious about building fast, scalable LLM systems, you’ll want to check back in for the next installments.

Subscribe to our newsletter

Monthly curated AI content, Data Science Dojo updates, and more.

Sign up to get the latest on events and webinars

Data Science Dojo | data science for everyone

Discover more from Data Science Dojo

Subscribe to get the latest updates on AI, Data Science, LLMs, and Machine Learning.