2.1 Context Windows and the Memory Wall
2.1 Context Windows and the Memory Wall
Chapter 1 endedestablished withthe aconstraints. claim:This decodechapter is memory-bound,about andworking weightinside sizethem. The most important number an agent designer keeps in their head is whathow matters most. That claim has a sequel. Once you start running multi-turn agents, the KV cache becomes the dominantmuch memory cost, and it grows linearly with context length. This section explains why, and what the constraint forces you to do.
The Cost of Remembering
A transformer's KV cache holds the key and value tensors from every layer's attention computation, for every token in the context. The model uses this cache to avoid recomputing attention over earlier tokens on every decode step.
The size is:
kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_value
For a typical small LLM (say, 1.5B parameters, 24 layers, 4 KV heads after GQA, head_dim 128, FP16):
2 × 24 × 4 × 128 × 8192 × 2The weights for that same model at INT4 take roughly 750 MB. At 32K context, the KV cache eats per token, because that number multiplied by your context length is 2x larger than the weightswall themselves.you hit. On anIntel NPU with 2–3M2M-100, GBthat usablewall memory, this is the difference betweenhas a workingspecific agentshape, and oneit's set by an architectural choice Meta made in 2020 that OOMsno onamount theof thirdclever turn.
This is the memory wall. Youserving can quantizepaper theover: weights,M2M-100 you can prune the model, you can fuse operators — but if your context is long, the cache will eat your budget.
Three Ways to Push Back
The good news is the cache responds to optimization, and you have several levers. None of them is free.
Lever 1: Smaller Architectures
The cache scales with num_layers × num_kv_heads × head_dim. Architectures that reduce any of these shrink the cache proportionally:
The KV Cache Formula
For an encoder-decoder transformer like M2M-100, the per-step decoder state contains two KV caches: self-attention (the decoder attending to its own previous tokens) and cross-attention (the decoder attending to the encoder output). Both contribute, and for M2M-100 specifically both are the same hiddenper-layer dimension.
The formula is:
KV_self = 2 · L_dec · n_heads · head_dim · T_dec · sizeof(dtype)
KV_cross = 2 · L_dec · n_heads · head_dim · T_enc · sizeof(dtype)
Total = KV_self + KV_cross
The factor of 2 is for K and V tensors. L_dec is the extremenumber caseof —decoder alayers. singleThe cross-attention KV headis percomputed layer.once Smallerover cache,the occasionallyfull weakerencoder quality.
T_enc) T_dec M2M-100 onKV Footprint, Specific Numbers
The configurations come straight from the fly. Aggressive cache reduction at the cost of decode complexity.
When you're choosing aHuggingFace model forcards:
M2M-100 deployment,418M: look12 atencoder thelayers, 12 decoder layers, 16 attention architecture,heads, nothead_dim just64 the(embed_dim ).
parameter1024 count./ A16 3BheadsmodelM2M-100 with1.2B: GQA24 canencoder havelayers, 24 decoder layers, 16 attention heads, head_dim 64.
M2M-100 12B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 256 (embed_dim 4096 / 16 heads).
At T_enc = T_dec = 128 (a smallersentence-level runtimetranslation footprintworking than a 1.5B model with full MHA.point):
Lever
M2M-100 2: Quantize the Cache
Most production stacks now run KV cache at INT8 or even INT4. The accuracy hit is small for most workloads — keys and values come from a distribution that quantizes more cleanly than weights, because they're activations the model has already learned to produce.418M
INT8M2M-100 KV cache is essentially free and should be your default. INT4 is workload-dependent — test it on tasks that exercise long-range recall before committing.
Lever 3: Shorten the Context
This is the cheapest lever and the one most teams underuse. Every token in the context costs cache memory and prefill time. Most agent prompts have substantial waste:
A disciplined context budget is one of the highest-leverage optimizations available. Halving the average context halves the cache and roughly halves TTFT. You can have a 32K-capable model and still operate at an 8K average — if you're deliberate about what goes in.
Attention Patterns Beyond Full Self-Attention
Some models support attention patterns that bound the cache regardless of input length:
Sliding window attention1.2B restricts each token to attending only to the last N tokens. The cache stops growing once it reaches N. This works well for chat-style agents where recent context dominates, but degrades on tasks that require recalling something said many turns ago.
Sink attention (StreamingLLM and variants) keeps a small number of initial tokens "pinned" in attention alongside the sliding window. The intuition: the first few tokens of the prompt act as attention anchors, and dropping them collapses output quality. Sink attention preserves them while otherwise sliding.
Hybrid architectures like Mamba/SSM-transformer hybrids (e.g., Jamba) use state-space models for most layers and attention sparingly. Their per-token state is bounded by architecture, not context length — attractive for very long context on constrained hardware, though tooling support varies.
For most NPU deployments today, you'll be working with full attention or GQA + sliding window. The exotic options are worth knowing about but rarely the right starting point.
A Working Budget for a Typical NPU Agent
Here's a concrete budget that fits on a 3 GB NPU memory envelope:
ThatM2M-100 budget12B, givessame shape: roughly 96 MiB FP16 (head_dim balloons to 256, which is the dominant scaling factor).
For sentence-level translation these numbers are small — they sit comfortably in DRAM next to ~840 MB of FP16 weights for the 418M model. The KV cache is not the bottleneck for short translation. Where it bites is when context grows: at T_enc = T_dec = 1024 the 1.2B model's KV state crosses 200 MB at FP16, and the cross-attention component dominates because translating long source documents keeps that full encoder output live in memory the entire time.
The Full-MHA Tax — The Headline Insight
Here's the comparison that should be the takeaway from this chapter:
Per-token decoder self-attention KV bytes:
2 · 24 · 16 · 64 · 2 = 98,304 bytes/token
Phi-3-mini-3.8B with GQA-8 at FP16: 2 · 32 · 8 · 96 · 2 = 98,304 bytes/token
These are identical to the byte.
A 1.2-billion-parameter encoder-decoder translation model from 2020 has the same per-token decoder self-attention KV footprint as a modern 3.8-billion-parameter decoder-only LLM, because Phi-3 uses Grouped Query Attention with one-quarter the KV heads. And M2M-100 carries cross-attention KV at the same per-layer cost on top, which Phi-3 does not have at all.
The architectural conclusion is direct: the KV cache wall is set by attention design, not parameter count. Phi-3 deploys to NPU comfortably at 4K context. M2M-100 1.2B at 1K context exerts the same per-token KV bandwidth pressure on the LPDDR5X bus.
This is what we mean when we say M2M-100 is "expensive per parameter" — not in FLOPs or weight memory, but in the bandwidth its decoder consumes per generated token. The fix is GQA. The fix requires retraining. Nobody has retrained M2M-100 with GQA. So we live with it.
Modern Attention Optimizations and Why M2M-100 Doesn't Get Them
The KV cache footprint has driven roughly five years of architectural innovation in decoder-only LLMs, and M2M-100 predates all of it:
Grouped Query Attention (GQA) shares K and V across groups of query heads — typically 4 or 8 query heads per KV head. Llama 2 70B, Llama 3, Phi-3 use GQA. Reduces KV size by n_kv / n_heads. M2M-100 has no GQA.
Multi-Query Attention (MQA) is GQA's extreme — one KV head shared by all queries. Falcon-7B uses MQA. M2M-100 has no MQA.
Multi-head Latent Attention (MLA) compresses K and V into a low-rank latent space, decompressing only at attention time. DeepSeek-V2 and V3 use MLA. M2M-100 has no MLA.
KV cache quantization drops the cache from FP16 to INT8 (or below). Halves bandwidth at modest quality cost. Works on any model. This is the lever you acan comfortablepull chatfor agentM2M-100.
The honest summary: of the four major KV optimizations, only the last one — cache quantization — is available to M2M-100. INT8 KV halves your bandwidth pressure and roughly doubles your effective context length before hitting the bandwidth wall. Use it.
The Bandwidth Wall, Quantified
Combine this section with multi-turnChapter memory1.3's upceiling. Lunar Lake's LPDDR5X-8533 delivers 136.5 GB/s shared. For decode at sustained throughput, every weight has to aboutbe 8Kstreamed every token. For an 8B INT4 model that's 4 GB, ceiling 34 tok/s.
The KV cache adds to this. For M2M-100 1.2B at FP16 generating a long output, the per-token weight read is ~2.4 GB (the FP16 decoder weights), the per-token KV read grows from near-zero at token 1 to ~100 KB by token 1000, and the cross-attention KV is read in full every step. The effective bandwidth-per-token is dominated by weights for moderate contexts and only crosses over into KV-dominated regime above several thousand tokens of effectivedecoded output. For sentence-level translation this never matters. For document-level translation it sets the upper bound on practical context.
Does Any of This Matter for Short-Context Translation?
For a single English-to-French sentence (T_enc ≈ 32, T_dec ≈ 32), M2M-100 418M has about 3 MB of KV state in FP16 — completely negligible against 840 MB of FP16 weights. The KV cache is not the bottleneck for M2M-100_418M on short inputs; weight memory is.
So why discuss it? Three reasons:
Longer documents matter. Paragraph-level translation at T = 512 puts you in the regime where KV cache starts to 16Kcompete with weight memory for bandwidth. Document-level translation at T = 2048 is firmly KV-dominated. Many real translation workloads are not single sentences.
The 12B variant matters. Cross-attention KV reaches 96 MiB at T=128 on the 12B model, and the model already strains consumer NPU memory at INT4. KV is the difference between fitting and not.
The principle generalizes. Every other encoder-decoder seq2seq model — NLLB-200, MarianMT, FLAN-T5, Whisper — has the same architectural problem and the same lack of GQA. If you take one thing from this chapter, it's that modern attention optimizations don't apply to 2020-era seq2seq, and you eat into headroom. Pushneed to 32Kplan andaround you're allocating from disk, with all the latency that implies.
Design for the budget you have, not the context window the model advertises. A model trained on 128K context doesn't mean you can run it at 128K on your hardware.it.
What This Section Bought You
You should now have a framework for thinking about agent memory on NPUs:understand:
- The KV cache formula for encoder-decoder models: self-attention plus cross-attention, both scaling with layers, heads, head_dim, and sequence length
The next section drillsmoves intofrom theory to engineering: how do you actually manage KV cache engineering:on howIntel NPU through OpenVINO GenAI, and what does the prefix-caching / chunked-prefill / static-shape stack do for you (and to reuse caches across turns, what to do when they evict, and how to detect when you're spending more time managing memory than thinking.you)?
Next: 2.2 KV Cache Engineering: Reuse, Eviction, and QuantizationEngineering