2.1 Context Windows and the Memory Wall
2.1 Context Windows and the Memory Wall
Chapter 1 established the constraints. This chapter is about working inside them. The most important number an agent designer keeps in their head is how much memory the KV cache eats per token, because that number multiplied by your context length is the wall you hit. On Intel NPU with M2M-100, that wall has a specific shape, and it's set by an architectural choice Meta made in 2020 that no amount of clever serving can paper over: M2M-100 uses full multi-head attention with no GQA.
The KV Cache Formula
For an encoder-decoder transformer like M2M-100, the per-step decoder state contains two KV caches: self-attention (the decoder attending to its own previous tokens) and cross-attention (the decoder attending to the encoder output). Both contribute, and for M2M-100 specifically both are the same per-layer size, because the model doesn't compress KV heads.
The formula is:
KV_self = 2 · L_dec · n_heads · head_dim · T_dec · sizeof(dtype)
KV_cross = 2 · L_dec · n_heads · head_dim · T_enc · sizeof(dtype)
Total = KV_self + KV_cross
The factor of 2 is for K and V tensors. L_dec is the number of decoder layers. The cross-attention KV is computed once over the full encoder output (length T_enc) and reused on every decoder step; the self-attention KV grows with T_dec as we generate.
M2M-100 KV Footprint, Specific Numbers
The configurations come straight from the HuggingFace model cards:
M2M-100 418M: 12 encoder layers, 12 decoder layers, 16 attention heads, head_dim 64 (embed_dim 1024 / 16 heads).
M2M-100 1.2B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 64.
M2M-100 12B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 256 (embed_dim 4096 / 16 heads).
At T_enc = T_dec = 128 (a sentence-level translation working point):
M2M-100 418M
| Precision | Self-attn KV | Cross-attn KV | Total |
|---|---|---|---|
| FP16 | 6.29 MB | 6.29 MB | 12.58 MB |
| INT8 KV | 3.15 MB | 3.15 MB | 6.29 MB |
M2M-100 1.2B
| Precision | Self-attn KV | Cross-attn KV | Total |
|---|---|---|---|
| FP16 | 12.58 MB | 12.58 MB | 25.17 MB |
| INT8 KV | 6.29 MB | 6.29 MB | 12.58 MB |
M2M-100 12B, same shape: roughly 96 MiB FP16 (head_dim balloons to 256, which is the dominant scaling factor).
For sentence-level translation these numbers are small — they sit comfortably in DRAM next to ~840 MB of FP16 weights for the 418M model. The KV cache is not the bottleneck for short translation. Where it bites is when context grows: at T_enc = T_dec = 1024 the 1.2B model's KV state crosses 200 MB at FP16, and the cross-attention component dominates because translating long source documents keeps that full encoder output live in memory the entire time.
The Full-MHA Tax — The Headline Insight
Here's the comparison that should be the takeaway from this chapter:
Per-token decoder self-attention KV bytes:
- M2M-100 1.2B at FP16:
2 · 24 · 16 · 64 · 2 = 98,304 bytes/token - Phi-3-mini-3.8B with GQA-8 at FP16:
2 · 32 · 8 · 96 · 2 = 98,304 bytes/token
These are identical to the byte.
A 1.2-billion-parameter encoder-decoder translation model from 2020 has the same per-token decoder self-attention KV footprint as a modern 3.8-billion-parameter decoder-only LLM, because Phi-3 uses Grouped Query Attention with one-quarter the KV heads. And M2M-100 carries cross-attention KV at the same per-layer cost on top, which Phi-3 does not have at all.
The architectural conclusion is direct: the KV cache wall is set by attention design, not parameter count. Phi-3 deploys to NPU comfortably at 4K context. M2M-100 1.2B at 1K context exerts the same per-token KV bandwidth pressure on the LPDDR5X bus.
This is what we mean when we say M2M-100 is "expensive per parameter" — not in FLOPs or weight memory, but in the bandwidth its decoder consumes per generated token. The fix is GQA. The fix requires retraining. Nobody has retrained M2M-100 with GQA. So we live with it.
Modern Attention Optimizations and Why M2M-100 Doesn't Get Them
The KV cache footprint has driven roughly five years of architectural innovation in decoder-only LLMs, and M2M-100 predates all of it:
Grouped Query Attention (GQA) shares K and V across groups of query heads — typically 4 or 8 query heads per KV head. Llama 2 70B, Llama 3, Phi-3 use GQA. Reduces KV size by n_kv / n_heads. M2M-100 has no GQA.
Multi-Query Attention (MQA) is GQA's extreme — one KV head shared by all queries. Falcon-7B uses MQA. M2M-100 has no MQA.
Multi-head Latent Attention (MLA) compresses K and V into a low-rank latent space, decompressing only at attention time. DeepSeek-V2 and V3 use MLA. M2M-100 has no MLA.
KV cache quantization drops the cache from FP16 to INT8 (or below). Halves bandwidth at modest quality cost. Works on any model. This is the lever you can pull for M2M-100.
The honest summary: of the four major KV optimizations, only the last one — cache quantization — is available to M2M-100. INT8 KV halves your bandwidth pressure and roughly doubles your effective context length before hitting the bandwidth wall. Use it.
The Bandwidth Wall, Quantified
Combine this section with Chapter 1.3's ceiling. Lunar Lake's LPDDR5X-8533 delivers 136.5 GB/s shared. For decode at sustained throughput, every weight has to be streamed every token. For an 8B INT4 model that's 4 GB, ceiling 34 tok/s.
The KV cache adds to this. For M2M-100 1.2B at FP16 generating a long output, the per-token weight read is ~2.4 GB (the FP16 decoder weights), the per-token KV read grows from near-zero at token 1 to ~100 KB by token 1000, and the cross-attention KV is read in full every step. The effective bandwidth-per-token is dominated by weights for moderate contexts and only crosses over into KV-dominated regime above several thousand tokens of decoded output. For sentence-level translation this never matters. For document-level translation it sets the upper bound on practical context.
Does Any of This Matter for Short-Context Translation?
For a single English-to-French sentence (T_enc ≈ 32, T_dec ≈ 32), M2M-100 418M has about 3 MB of KV state in FP16 — completely negligible against 840 MB of FP16 weights. The KV cache is not the bottleneck for M2M-100_418M on short inputs; weight memory is.
So why discuss it? Three reasons:
Longer documents matter. Paragraph-level translation at T = 512 puts you in the regime where KV cache starts to compete with weight memory for bandwidth. Document-level translation at T = 2048 is firmly KV-dominated. Many real translation workloads are not single sentences.
The 12B variant matters. Cross-attention KV reaches 96 MiB at T=128 on the 12B model, and the model already strains consumer NPU memory at INT4. KV is the difference between fitting and not.
The principle generalizes. Every other encoder-decoder seq2seq model — NLLB-200, MarianMT, FLAN-T5, Whisper — has the same architectural problem and the same lack of GQA. If you take one thing from this chapter, it's that modern attention optimizations don't apply to 2020-era seq2seq, and you need to plan around it.
What This Section Bought You
You should now understand:
- The KV cache formula for encoder-decoder models: self-attention plus cross-attention, both scaling with layers, heads, head_dim, and sequence length
- M2M-100 1.2B has the same per-token KV bandwidth as Phi-3-mini-3.8B despite being a third the parameter count, because Phi-3 uses GQA
- The KV cache wall is set by attention design, not parameter count — and M2M-100's full MHA puts it permanently on the wrong side of the wall
- Only KV cache quantization is available as a lever for M2M-100; the modern optimizations (GQA, MQA, MLA) require retraining
- For short-context translation the KV cache is negligible vs weight memory; for long-context translation it dominates
- Cross-attention KV is the M2M-100-specific cost that adds to (not replaces) self-attention KV every decoder step
The next section moves from theory to engineering: how do you actually manage KV cache on Intel NPU through OpenVINO GenAI, and what does the prefix-caching / chunked-prefill / static-shape stack do for you (and to you)?
Next: 2.2 KV Cache Engineering