2.1 Context Windows and the Memory Wall
2.1 Context Windows and the Memory Wall
Chapter 1 ended with a claim: decode is memory-bound, and weight size is what matters most. That claim has a sequel. Once you start running multi-turn agents, the KV cache becomes the dominant memory cost, and it grows linearly with context length. This section explains why, and what the constraint forces you to do.
The Cost of Remembering
A transformer's KV cache holds the key and value tensors from every layer's attention computation, for every token in the context. The model uses this cache to avoid recomputing attention over earlier tokens on every decode step.
The size is:
kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_value
For a typical small LLM (say, 1.5B parameters, 24 layers, 4 KV heads after GQA, head_dim 128, FP16):
- 8K context:
2 × 24 × 4 × 128 × 8192 × 2≈ 400 MB - 32K context: 1.6 GB
- 128K context: 6.4 GB
The weights for that same model at INT4 take roughly 750 MB. At 32K context, the KV cache is 2x larger than the weights themselves. On an NPU with 2–3 GB usable memory, this is the difference between a working agent and one that OOMs on the third turn.
This is the memory wall. You can quantize the weights, you can prune the model, you can fuse operators — but if your context is long, the cache will eat your budget.
Three Ways to Push Back
The good news is the cache responds to optimization, and you have several levers. None of them is free.
Lever 1: Smaller Architectures
The cache scales with num_layers × num_kv_heads × head_dim. Architectures that reduce any of these shrink the cache proportionally:
- Grouped-query attention (GQA) shares KV heads across query heads. A model with 32 query heads and 4 KV heads has 8x smaller cache than full multi-head attention with the same hidden dimension.
- Multi-query attention (MQA) is the extreme case — a single KV head per layer. Smaller cache, occasionally weaker quality.
- Multi-Latent Attention (MLA), used by DeepSeek's recent models, compresses KV into a low-rank latent representation that's reconstructed on the fly. Aggressive cache reduction at the cost of decode complexity.
When you're choosing a model for NPU deployment, look at the attention architecture, not just the parameter count. A 3B model with GQA can have a smaller runtime footprint than a 1.5B model with full MHA.
Lever 2: Quantize the Cache
Most production stacks now run KV cache at INT8 or even INT4. The accuracy hit is small for most workloads — keys and values come from a distribution that quantizes more cleanly than weights, because they're activations the model has already learned to produce.
| KV precision | Cache size (relative) | Quality impact |
|---|---|---|
| FP16 | 1.0x | baseline |
| INT8 | 0.5x | negligible for most workloads |
| INT4 | 0.25x | measurable on long-context recall, often acceptable |
INT8 KV cache is essentially free and should be your default. INT4 is workload-dependent — test it on tasks that exercise long-range recall before committing.
Lever 3: Shorten the Context
This is the cheapest lever and the one most teams underuse. Every token in the context costs cache memory and prefill time. Most agent prompts have substantial waste:
- Verbose system prompts copy-pasted from cloud deployments
- Tool definitions repeated when only a few are relevant this turn
- Conversation history that includes long tool outputs no longer needed for the next decision
- Few-shot examples that the model has now learned to imitate without them
A disciplined context budget is one of the highest-leverage optimizations available. Halving the average context halves the cache and roughly halves TTFT. You can have a 32K-capable model and still operate at an 8K average — if you're deliberate about what goes in.
Attention Patterns Beyond Full Self-Attention
Some models support attention patterns that bound the cache regardless of input length:
Sliding window attention restricts each token to attending only to the last N tokens. The cache stops growing once it reaches N. This works well for chat-style agents where recent context dominates, but degrades on tasks that require recalling something said many turns ago.
Sink attention (StreamingLLM and variants) keeps a small number of initial tokens "pinned" in attention alongside the sliding window. The intuition: the first few tokens of the prompt act as attention anchors, and dropping them collapses output quality. Sink attention preserves them while otherwise sliding.
Hybrid architectures like Mamba/SSM-transformer hybrids (e.g., Jamba) use state-space models for most layers and attention sparingly. Their per-token state is bounded by architecture, not context length — attractive for very long context on constrained hardware, though tooling support varies.
For most NPU deployments today, you'll be working with full attention or GQA + sliding window. The exotic options are worth knowing about but rarely the right starting point.
A Working Budget for a Typical NPU Agent
Here's a concrete budget that fits on a 3 GB NPU memory envelope:
| Component | Size |
|---|---|
| Model weights (1.5B, INT4) | 750 MB |
| Activation overhead | 400 MB |
| KV cache (8K context, INT8, GQA-4) | 200 MB |
| Tokenizer + runtime | 50 MB |
| Headroom | 600 MB |
| Total | ~2.0 GB |
That budget gives you a comfortable chat agent with multi-turn memory up to about 8K tokens of effective context. Push to 16K and you eat into headroom. Push to 32K and you're allocating from disk, with all the latency that implies.
Design for the budget you have, not the context window the model advertises. A model trained on 128K context doesn't mean you can run it at 128K on your hardware.
What This Section Bought You
You now have a framework for thinking about agent memory on NPUs:
- The KV cache is the dominant per-session memory cost beyond a few thousand tokens of context
- Three levers reduce it: architecture choice (GQA/MQA/MLA), cache quantization, and prompt discipline
- Sliding-window and sink-attention variants trade long-range recall for bounded cache
- Real budgets are tighter than advertised context windows — design to the device, not the brochure
The next section drills into KV cache engineering: how to reuse caches across turns, what to do when they evict, and how to detect when you're spending more time managing memory than thinking.
Next: 2.2 KV Cache Engineering: Reuse, Eviction, and Quantization