2.1 Context Windows and the Memory Wall

Chapter 1 ~~ended~~established ~~with~~the aconstraints. ~~claim:~~This ~~decode~~chapter is ~~memory-bound,~~about ~~and~~working ~~weight~~inside ~~size~~them. The most important number an agent designer keeps in their head is ~~what~~how ~~matters most. That claim has a sequel. Once you start running multi-turn agents,~~ ~~the KV cache becomes the dominant~~much memory ~~cost~~~~, and it grows linearly with context length. This section explains why, and what the constraint forces you to do.~~

The Cost of Remembering

A transformer's KV cache holds the key and value tensors from every layer's attention computation, for every token in the context. The model uses this cache to avoid recomputing attention over earlier tokens on every decode step.

~~The size is:~~

kv_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_value

~~For a typical small LLM (say, 1.5B parameters, 24 layers, 4 KV heads after GQA, head_dim 128, FP16):~~

~~8K context:~~ 2 × 24 × 4 × 128 × 8192 × 2 ≈ ~~400 MB~~ ~~32K context:~~ ~~1.6 GB~~ ~~128K context:~~ ~~6.4 GB~~

~~The weights for that same model at INT4 take roughly 750 MB.~~ ~~At 32K context,~~ the KV cache eats per token, because that number multiplied by your context length is ~~2x larger than~~ the ~~weights~~wall ~~themselves.~~you hit. On anIntel NPU with ~~2–3~~M2M-100, GBthat ~~usable~~wall ~~memory, this is the difference between~~has a ~~working~~specific ~~agent~~shape, and ~~one~~it's set by an architectural choice Meta made in 2020 that ~~OOMs~~no onamount ~~the~~of ~~third~~clever ~~turn.~~

~~This is the memory wall. You~~serving can ~~quantize~~paper ~~the~~over: ~~weights,~~M2M-100 ~~you can prune the model, you can fuse operators — but if your context is long, the cache will eat your budget.~~

Three Ways to Push Back

~~The good news is the cache responds to optimization, and you have several levers. None of them is free.~~

Lever 1: Smaller Architectures

~~The cache scales with~~ num_layers × num_kv_heads × head_dim~~. Architectures that reduce any of these shrink the cache proportionally:~~

~~Grouped-query attention (GQA)~~ ~~shares KV heads across query heads. A model with 32 query heads and 4 KV heads has 8x smaller cache than~~uses full multi-head attention with no GQA.

The KV Cache Formula

For an encoder-decoder transformer like M2M-100, the per-step decoder state contains two KV caches: self-attention (the decoder attending to its own previous tokens) and cross-attention (the decoder attending to the encoder output). Both contribute, and for M2M-100 specifically both are the same ~~hidden~~per-layer ~~dimension.~~

size, ~~Multi-query~~because ~~attention~~the ~~(MQA)~~model doesn't compress KV heads.

The formula is:

KV_self  = 2 · L_dec · n_heads · head_dim · T_dec · sizeof(dtype)
KV_cross = 2 · L_dec · n_heads · head_dim · T_enc · sizeof(dtype)
Total    = KV_self + KV_cross

The factor of 2 is for K and V tensors. L_dec is the ~~extreme~~number ~~case~~of —decoder alayers. ~~single~~The cross-attention KV ~~head~~is ~~per~~computed ~~layer.~~once ~~Smaller~~over ~~cache,~~the ~~occasionally~~full ~~weaker~~encoder ~~quality.~~

~~Multi-Latent Attention~~output (~~MLA)~~,length ~~used~~T_enc) byand ~~DeepSeek's~~reused ~~recent~~on ~~models,~~every ~~compresses~~decoder step; the self-attention KV ~~into~~grows awith ~~low-rank~~T_dec ~~latent~~as ~~representation~~we ~~that's~~generate. ~~reconstructed~~

M2M-100 onKV Footprint, Specific Numbers

The configurations come straight from the ~~fly. Aggressive cache reduction at the cost of decode complexity.~~

~~When you're choosing a~~HuggingFace model ~~for~~cards:

~~NPU~~

M2M-100 ~~deployment,~~418M: ~~look~~12 atencoder ~~the~~layers, 12 decoder layers, 16 attention ~~architecture,~~heads, ~~not~~head_dim ~~just~~64 ~~the~~(embed_dim parameter1024 count./ A16 3Bheads). ~~model~~M2M-100 ~~with~~1.2B: ~~GQA~~24 ~~can~~encoder ~~have~~layers, 24 decoder layers, 16 attention heads, head_dim 64. M2M-100 12B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 256 (embed_dim 4096 / 16 heads).

At T_enc = T_dec = 128 (a ~~smaller~~sentence-level ~~runtime~~translation ~~footprint~~working ~~than a 1.5B model with full MHA.~~point):

Lever

M2M-100 ~~2: Quantize the Cache~~

Most production stacks now run KV cache at INT8 or even INT4. The accuracy hit is small for most workloads — keys and values come from a distribution that quantizes more cleanly than weights, because they're activations the model has already learned to produce.418M

~~KV precision~~Precision	~~Cache~~Self-attn ~~size (relative)~~KV	~~Quality~~Cross-attn ~~impact~~KV

Total FP16 ~~1.0x~~6.29 MB ~~baseline~~6.29 MB 12.58 MB INT8 KV ~~0.5x~~3.15 MB ~~negligible~~3.15 ~~for most workloads~~ ~~INT4~~MB ~~0.25x~~6.29 ~~measurable on long-context recall, often acceptable~~MB

~~INT8~~M2M-100 ~~KV cache is essentially free and should be your default. INT4 is workload-dependent — test it on tasks that exercise long-range recall before committing.~~

Lever 3: Shorten the Context

~~This is the cheapest lever and the one most teams underuse. Every token in the context costs cache memory and prefill time. Most agent prompts have substantial waste:~~

~~Verbose system prompts copy-pasted from cloud deployments~~ ~~Tool definitions repeated when only a few are relevant this turn~~ ~~Conversation history that includes long tool outputs no longer needed for the next decision~~ ~~Few-shot examples that the model has now learned to imitate without them~~

A disciplined context budget is one of the highest-leverage optimizations available. Halving the average context halves the cache and roughly halves TTFT. You can have a 32K-capable model and still operate at an 8K average — if you're deliberate about what goes in.

Attention Patterns Beyond Full Self-Attention

~~Some models support attention patterns that bound the cache regardless of input length:~~

~~Sliding window attention~~1.2B restricts each token to attending only to the last N tokens. The cache stops growing once it reaches N. This works well for chat-style agents where recent context dominates, but degrades on tasks that require recalling something said many turns ago.

~~Sink attention~~ (StreamingLLM and variants) keeps a small number of initial tokens "pinned" in attention alongside the sliding window. The intuition: the first few tokens of the prompt act as attention anchors, and dropping them collapses output quality. Sink attention preserves them while otherwise sliding.

~~Hybrid architectures~~ like Mamba/SSM-transformer hybrids (e.g., Jamba) use state-space models for most layers and attention sparingly. Their per-token state is bounded by architecture, not context length — attractive for very long context on constrained hardware, though tooling support varies.

~~For most NPU deployments today, you'll be working with full attention or GQA + sliding window. The exotic options are worth knowing about but rarely the right starting point.~~

A Working Budget for a Typical NPU Agent

~~Here's a concrete budget that fits on a 3 GB NPU memory envelope:~~

~~Component~~Precision	~~Size~~Self-attn KV

Cross-attn KV Total ~~Model weights (1.5B, INT4)~~FP16 ~~750~~12.58 MB 12.58 MB 25.17 MB ~~Activation~~INT8 ~~overhead~~KV ~~400~~6.29 MB ~~KV cache (8K context, INT8, GQA-4)~~ ~~200~~6.29 MB ~~Tokenizer + runtime~~ 5012.58 MB ~~Headroom~~ ~~600 MB~~ ~~Total~~ ~~~2.0 GB~~

~~That~~M2M-100 ~~budget~~12B, ~~gives~~same shape: roughly 96 MiB FP16 (head_dim balloons to 256, which is the dominant scaling factor).

For sentence-level translation these numbers are small — they sit comfortably in DRAM next to ~840 MB of FP16 weights for the 418M model. The KV cache is not the bottleneck for short translation. Where it bites is when context grows: at T_enc = T_dec = 1024 the 1.2B model's KV state crosses 200 MB at FP16, and the cross-attention component dominates because translating long source documents keeps that full encoder output live in memory the entire time.

The Full-MHA Tax — The Headline Insight

Here's the comparison that should be the takeaway from this chapter:

Per-token decoder self-attention KV bytes:

M2M-100 1.2B at FP16: 2 · 24 · 16 · 64 · 2 = 98,304 bytes/token Phi-3-mini-3.8B with GQA-8 at FP16: 2 · 32 · 8 · 96 · 2 = 98,304 bytes/token

These are identical to the byte.

A 1.2-billion-parameter encoder-decoder translation model from 2020 has the same per-token decoder self-attention KV footprint as a modern 3.8-billion-parameter decoder-only LLM, because Phi-3 uses Grouped Query Attention with one-quarter the KV heads. And M2M-100 carries cross-attention KV at the same per-layer cost on top, which Phi-3 does not have at all.

The architectural conclusion is direct: the KV cache wall is set by attention design, not parameter count. Phi-3 deploys to NPU comfortably at 4K context. M2M-100 1.2B at 1K context exerts the same per-token KV bandwidth pressure on the LPDDR5X bus.

This is what we mean when we say M2M-100 is "expensive per parameter" — not in FLOPs or weight memory, but in the bandwidth its decoder consumes per generated token. The fix is GQA. The fix requires retraining. Nobody has retrained M2M-100 with GQA. So we live with it.

Modern Attention Optimizations and Why M2M-100 Doesn't Get Them

The KV cache footprint has driven roughly five years of architectural innovation in decoder-only LLMs, and M2M-100 predates all of it:

Grouped Query Attention (GQA) shares K and V across groups of query heads — typically 4 or 8 query heads per KV head. Llama 2 70B, Llama 3, Phi-3 use GQA. Reduces KV size by n_kv / n_heads. M2M-100 has no GQA.

Multi-Query Attention (MQA) is GQA's extreme — one KV head shared by all queries. Falcon-7B uses MQA. M2M-100 has no MQA.

Multi-head Latent Attention (MLA) compresses K and V into a low-rank latent space, decompressing only at attention time. DeepSeek-V2 and V3 use MLA. M2M-100 has no MLA.

KV cache quantization drops the cache from FP16 to INT8 (or below). Halves bandwidth at modest quality cost. Works on any model. This is the lever you acan ~~comfortable~~pull ~~chat~~for ~~agent~~M2M-100.

The honest summary: of the four major KV optimizations, only the last one — cache quantization — is available to M2M-100. INT8 KV halves your bandwidth pressure and roughly doubles your effective context length before hitting the bandwidth wall. Use it.

The Bandwidth Wall, Quantified

Combine this section with ~~multi-turn~~Chapter ~~memory~~1.3's upceiling. Lunar Lake's LPDDR5X-8533 delivers 136.5 GB/s shared. For decode at sustained throughput, every weight has to ~~about~~be 8Kstreamed every token. For an 8B INT4 model that's 4 GB, ceiling 34 tok/s.

The KV cache adds to this. For M2M-100 1.2B at FP16 generating a long output, the per-token weight read is ~2.4 GB (the FP16 decoder weights), the per-token KV read grows from near-zero at token 1 to ~100 KB by token 1000, and the cross-attention KV is read in full every step. The effective bandwidth-per-token is dominated by weights for moderate contexts and only crosses over into KV-dominated regime above several thousand tokens of ~~effective~~decoded output. For sentence-level translation this never matters. For document-level translation it sets the upper bound on practical context.

~~Push~~

Does Any of This Matter for Short-Context Translation?

For a single English-to-French sentence (T_enc ≈ 32, T_dec ≈ 32), M2M-100 418M has about 3 MB of KV state in FP16 — completely negligible against 840 MB of FP16 weights. The KV cache is not the bottleneck for M2M-100_418M on short inputs; weight memory is.

So why discuss it? Three reasons:

Longer documents matter. Paragraph-level translation at T = 512 puts you in the regime where KV cache starts to ~~16K~~compete with weight memory for bandwidth. Document-level translation at T = 2048 is firmly KV-dominated. Many real translation workloads are not single sentences.

The 12B variant matters. Cross-attention KV reaches 96 MiB at T=128 on the 12B model, and the model already strains consumer NPU memory at INT4. KV is the difference between fitting and not.

The principle generalizes. Every other encoder-decoder seq2seq model — NLLB-200, MarianMT, FLAN-T5, Whisper — has the same architectural problem and the same lack of GQA. If you take one thing from this chapter, it's that modern attention optimizations don't apply to 2020-era seq2seq, and you ~~eat into headroom. Push~~need to ~~32K~~plan ~~and~~around ~~you're allocating from disk, with all the latency that implies.~~

~~Design for the budget you have, not the context window the model advertises.~~ ~~A model trained on 128K context doesn't mean you can run it at 128K on your hardware.~~it.

What This Section Bought You

You should now ~~have a framework for thinking about agent memory on NPUs:~~understand:

- The KV cache formula for encoder-decoder models: self-attention plus cross-attention, both scaling with layers, heads, head_dim, and sequence length

M2M-100 1.2B has the same per-token KV bandwidth as Phi-3-mini-3.8B despite being a third the parameter count, because Phi-3 uses GQA The KV cache wall is set by attention design, not parameter count — and M2M-100's full MHA puts it permanently on the wrong side of the wall Only KV cache quantization is available as a lever for M2M-100; the modern optimizations (GQA, MQA, MLA) require retraining For short-context translation the KV cache is negligible vs weight memory; for long-context translation it dominates Cross-attention KV is the ~~dominant per-session memory~~M2M-100-specific cost ~~beyond~~that ~~a few thousand tokens of context~~ ~~Three levers reduce it~~~~: architecture choice (GQA/MQA/MLA), cache quantization, and prompt discipline~~ ~~Sliding-window and sink-attention variants~~ ~~trade long-range recall for bounded cache~~ ~~Real budgets are tighter than advertised context windows~~ ~~— design~~adds to ~~the device,~~ (not ~~the~~replaces) ~~brochure~~self-attention KV every decoder step

The next section ~~drills~~moves ~~into~~from theory to engineering: how do you actually manage KV cache ~~engineering:~~on ~~how~~Intel NPU through OpenVINO GenAI, and what does the prefix-caching / chunked-prefill / static-shape stack do for you (and to ~~reuse caches across turns, what to do when they evict, and how to detect when you're spending more time managing memory than thinking.~~you)?

Next: 2.2 KV Cache ~~Engineering: Reuse, Eviction, and Quantization~~Engineering