2.1 Context Windows and the Memory Wall

Chapter 1 ended with a claim: decode is memory-bound, and weight size is what matters most. That claim has a sequel. Once you start running multi-turn agents, the KV cache becomes the dominant memory cost, and it grows linearly with context length. This section explains why, and what the constraint forces you to do.

The Cost of Remembering

A transformer's KV cache holds the key and value tensors from every layer's attention computation, for every token in the context. The model uses this cache to avoid recomputing attention over earlier tokens on every decode step.

The size is:

kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_value

For a typical small LLM (say, 1.5B parameters, 24 layers, 4 KV heads after GQA, head_dim 128, FP16):

8K context: 2 × 24 × 4 × 128 × 8192 × 2 ≈ 400 MB
32K context: 1.6 GB
128K context: 6.4 GB

The weights for that same model at INT4 take roughly 750 MB. At 32K context, the KV cache is 2x larger than the weights themselves. On an NPU with 2–3 GB usable memory, this is the difference between a working agent and one that OOMs on the third turn.

This is the memory wall. You can quantize the weights, you can prune the model, you can fuse operators — but if your context is long, the cache will eat your budget.

Three Ways to Push Back

The good news is the cache responds to optimization, and you have several levers. None of them is free.

Lever 1: Smaller Architectures

The cache scales with num_layers × num_kv_heads × head_dim. Architectures that reduce any of these shrink the cache proportionally:

Grouped-query attention (GQA) shares KV heads across query heads. A model with 32 query heads and 4 KV heads has 8x smaller cache than full multi-head attention with the same hidden dimension.
Multi-query attention (MQA) is the extreme case — a single KV head per layer. Smaller cache, occasionally weaker quality.
Multi-Latent Attention (MLA), used by DeepSeek's recent models, compresses KV into a low-rank latent representation that's reconstructed on the fly. Aggressive cache reduction at the cost of decode complexity.

When you're choosing a model for NPU deployment, look at the attention architecture, not just the parameter count. A 3B model with GQA can have a smaller runtime footprint than a 1.5B model with full MHA.

Lever 2: Quantize the Cache

Most production stacks now run KV cache at INT8 or even INT4. The accuracy hit is small for most workloads — keys and values come from a distribution that quantizes more cleanly than weights, because they're activations the model has already learned to produce.

KV precision	Cache size (relative)	Quality impact
FP16	1.0x	baseline
INT8	0.5x	negligible for most workloads
INT4	0.25x	measurable on long-context recall, often acceptable

INT8 KV cache is essentially free and should be your default. INT4 is workload-dependent — test it on tasks that exercise long-range recall before committing.

Lever 3: Shorten the Context

This is the cheapest lever and the one most teams underuse. Every token in the context costs cache memory and prefill time. Most agent prompts have substantial waste:

Verbose system prompts copy-pasted from cloud deployments
Tool definitions repeated when only a few are relevant this turn
Conversation history that includes long tool outputs no longer needed for the next decision
Few-shot examples that the model has now learned to imitate without them

A disciplined context budget is one of the highest-leverage optimizations available. Halving the average context halves the cache and roughly halves TTFT. You can have a 32K-capable model and still operate at an 8K average — if you're deliberate about what goes in.

Attention Patterns Beyond Full Self-Attention

Some models support attention patterns that bound the cache regardless of input length:

Sliding window attention restricts each token to attending only to the last N tokens. The cache stops growing once it reaches N. This works well for chat-style agents where recent context dominates, but degrades on tasks that require recalling something said many turns ago.

Sink attention (StreamingLLM and variants) keeps a small number of initial tokens "pinned" in attention alongside the sliding window. The intuition: the first few tokens of the prompt act as attention anchors, and dropping them collapses output quality. Sink attention preserves them while otherwise sliding.

Hybrid architectures like Mamba/SSM-transformer hybrids (e.g., Jamba) use state-space models for most layers and attention sparingly. Their per-token state is bounded by architecture, not context length — attractive for very long context on constrained hardware, though tooling support varies.

For most NPU deployments today, you'll be working with full attention or GQA + sliding window. The exotic options are worth knowing about but rarely the right starting point.

A Working Budget for a Typical NPU Agent

Here's a concrete budget that fits on a 3 GB NPU memory envelope:

Component	Size
Model weights (1.5B, INT4)	750 MB
Activation overhead	400 MB
KV cache (8K context, INT8, GQA-4)	200 MB
Tokenizer + runtime	50 MB
Headroom	600 MB
Total	~2.0 GB

That budget gives you a comfortable chat agent with multi-turn memory up to about 8K tokens of effective context. Push to 16K and you eat into headroom. Push to 32K and you're allocating from disk, with all the latency that implies.

Design for the budget you have, not the context window the model advertises. A model trained on 128K context doesn't mean you can run it at 128K on your hardware.

What This Section Bought You

You now have a framework for thinking about agent memory on NPUs:

The KV cache is the dominant per-session memory cost beyond a few thousand tokens of context
Three levers reduce it: architecture choice (GQA/MQA/MLA), cache quantization, and prompt discipline
Sliding-window and sink-attention variants trade long-range recall for bounded cache
Real budgets are tighter than advertised context windows — design to the device, not the brochure

The next section drills into KV cache engineering: how to reuse caches across turns, what to do when they evict, and how to detect when you're spending more time managing memory than thinking.

Next: 2.2 KV Cache Engineering: Reuse, Eviction, and Quantization