2.1 Context Windows and the Memory Wall
2.1
The Contextagent's Windowsstate — what it remembers from past steps and thewhat Memoryit Walluses
Chapterto 1 establishedmake the constraints.next decision — is the bridge between hardware constraints and agent behavior. This chaptersection is about workingthe insidememory them.wall: Thewhy mostit importantexists, numberwhat anit means in numbers, and how to budget for it in the agent designerloop.
The intwo theirkey headstate mechanisms are KV cache (the prefill and decode phases' attention memory) and context window (the prompt that feeds the next prefill). They're distinct costs with different scaling properties, and conflating them is how much memory the KV cache eats per token, because that number multiplied by your context length is the wall you hit. On Intel NPU with M2M-100, that wall has a specificcommon shape,design and it's set by an architectural choice Meta made in 2020 that no amount of clever serving can paper over: M2M-100 uses full multi-head attention with no GQA.mistake.
The KV Cache Formulaand Its Footprint
The KV (key-value) cache is the core optimization of autoregressive LLM inference: instead of recomputing the attention keys and values for every token position on every decoding step, you compute them once and keep them in memory. On the second token, you use the KV from token 1 plus the new KV for token 2. On the third token, you use KVs from tokens 1–2 plus the new one. This is why decode is so much faster than prefill — you're amortizing the work.
The KV cache lives in DRAM and is dimensioned by [batch_size, num_heads, seq_len, head_dim]. For a typical transformer:
batch_size: 1 on NPU (Chapter 1.3)
num_heads: 16 (common)
seq_len: grows from 1 to context_length as you decode
head_dim: 64 (common)
Per-token KV cache footprint = batch × num_heads × head_dim × 2 (K + V) × dtype_bytes.
For anM2M-100 encoder-decoder1.2B transformer(16 likeheads, 64 head_dim) at FP16 (2 bytes):
But M2M-100 is encoder-decoder, so there's a second KV cache: the encoder output).output, Both contribute, and for M2M-100 specifically both arewhich the samedecoder's per-layercross-attention size,reads becauseat theevery model doesn't compress KV heads.
The formula is:
KV_self = 2 · L_dec · n_heads · head_dim · T_dec · sizeof(dtype)
KV_cross = 2 · L_dec · n_heads · head_dim · T_enc · sizeof(dtype)
Total = KV_self + KV_cross
The factor of 2 is for K and V tensors. L_dec is the number of decoder layers.step. The cross-attentionencoder KV is computed once over(during the full encoder output (length T_enc)prefill) and reused onthroughout everydecode, decoderso step;it doesn't grow with seq_len, but it's identical in size to the self-attention KV grows with T_dec as we generate.
M2M-100 KV Footprint, Specific Numbers
The configurations come straight fromof the HuggingFacedecoder modelat cards:any given encoder context length.
Full M2M-100 418M:decoder 12KV footprint at T=128 token context and encoder layers, 12 decoder layers, 16 attention heads, head_dim 64 (embed_dim 1024 / 16 heads).
M2M-100 1.2B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 64.
M2M-100 12B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 256 (embed_dim 4096 / 16 heads).
At T_enc = T_dec = 128 (a sentence-level translation working point):
M2M-100 418M
M2M-100 1.2B
M2M-100 12B, same shape: roughly 96 MiB FP16 (head_dim balloons to 256, which is the dominant scaling factor).
For sentence-level translation these numbers are small — they sit comfortably in DRAM next to ~840 MB of FP16 weights for the 418M model. The KV cache is not the bottleneck for short translation. Where it bites is when context grows: at T_enc = T_dec = 1024 the 1.2B model's KV state crosses 200 MB at FP16, and the cross-attention component dominates because translating long source documents keeps that full encoder output live in memory the entire time.
The Full-MHA Tax — The Headline Insight
Here's the comparison that should be the takeaway from this chapter:
Per-token decoder self-attention KV bytes:L=128:
M2M-100Self-attention1.2Bat FP16:MB2 ·KV: 24·layers16×·12864tokens·×24 KB =98,30412.3bytes/token- Cross-attention KV (encoder output): 24 layers × 128 source tokens × 4 KB = 12.3 MB
Now compare to Phi-3-mini-3.8B, which uses GQA (grouped-query attention) with GQA-8 atKV FP16:heads 2instead ·of 3216:
TheseSo arePhi-3-mini identicalsaves to3× the byte.
A 1.2-billion-parameter encoder-decoder translation model from 2020 has the same per-token decoder self-attentionon KV footprint asper atoken, modern 3.8-billion-parameter decoder-only LLM, because Phi-3it uses Grouped Query Attention with one-quarterhalves the KV heads.head Andcount. M2M-100 carrieshas cross-attentionfull KVMHA atand pays the samebandwidth per-layerprice.
The onAttention top, which Phi-3 does not have at all.Wall
The architecturalattention conclusionwall is direct:simple to state: at some context length, the KV cache's bandwidth demand exceeds what the NPU can sustain. On Lunar Lake with 136.5 GB/s platform bandwidth, and given the 18% utilization we saw in Chapter 1.3, the per-NPU effective bandwidth is roughly 136.5 × 0.18 ≈ ~25 GB/s available.
For M2M-100 decoder at FP16:
This is well below the 25 GB/s ceiling, so the M2M-100 KV cache isn't the bottleneck yet. The wall appears at much larger context lengths or larger models.
The working hypothesis from Chapter 2.1 is that the KV cache wall appears somewhere between 2K and 8K tokens for typical 8B models on Lunar Lake, depending on model architecture. Intel's validated 8K context "preview" on Lunar Lake is setright byat attentionthat design,edge. notThe parameterwall count.doesn't Phi-3mean deploysyou can't have 8K; it means you're committing to NPUrecompute, comfortablysliding atwindows, 4Kor context.multi-GPU M2M-100distribution 1.2Bto atstay 1Kabove contexta exertslatency thefloor.
Context per-tokenWindow vs. KV bandwidth pressure on the LPDDR5X bus.Cache
ThisA critical distinction: context window is what we mean when we say M2M-100 is "expensive per parameter" — not in FLOPs or weight memory, but in the bandwidthmodel itscan decoderattend consumes per generated token. The fix is GQA. The fix requires retraining. Nobody has retrained M2M-100 with GQA. So we live with it.
Modern Attention Optimizations and Why M2M-100 Doesn't Get Them
Theto; KV cache footprintis haswhat drivenyou roughlymust five years of architectural innovationkeep in memory.
For a decoder-only LLMs,model and M2M-100 predates all of it:
Grouped Query Attention (GQA) shares K and V across groups of query heads — typically 4 or 8 query heads per KV head.like Llama 2 70B, Llama 3, Phi-3 use GQA. Reduces KV size by n_kv / n_heads. M2M-100 has no GQA.70B:
Multi-Query
Multi-head Latent Attention (MLA) compresses K and V into a low-rank latent space, decompressing only at attention time. DeepSeek-V2 and V3 use MLA. M2M-100 has no MLA.
KV cache quantizationfor dropsfull context: 70B parameters × 16 heads × 64 head_dim × 2 (K+V) × 2 bytes × 8K tokens ÷ (70B total params) = roughly 70–80 GB for a single sequence at full context.
That doesn't fit on a single Lunar Lake. The roofline says: if you want 8K context with 70B, you compress the cache from FP16 to INT8model (quantize), shard it (multi-GPU), or below)use a sliding window (throw away old context). Halves bandwidth at modest quality cost. Works on any model. This is the lever you can pull for M2M-100.
The honest summary: of the four major KV optimizations, only the last one — cache quantization — is available to M2M-100. INT8 KV halves your bandwidth pressure and roughly doubles your effective context length before hitting the bandwidth wall. Use it.
The Bandwidth Wall, Quantified
Combine this section with Chapter 1.3's ceiling. Lunar Lake's LPDDR5X-8533 delivers 136.5 GB/s shared. For decode at sustained throughput, every weight has to be streamed every token. For an 8B INT4 model that's 4 GB, ceiling 34 tok/s.
The KV cache adds to this. For M2M-100 1.2B at FP16128 generating a long output, the per-token weight read is ~2.4 GB (the FP16 decoder weights), the per-token KV read grows from near-zero at token 1 to ~100 KB by token 1000, and the cross-attention KV is read in full every step. The effective bandwidth-per-token is dominated by weights for moderate contexts and only crosses over into KV-dominated regime above several thousand tokens of decoded output. For sentence-level translation this never matters. For document-level translation it sets the upper bound on practical context.
Does Any of This Matter for Short-Context Translation?
For a single English-to-French sentence (T_enc ≈ 32, T_dec ≈ 32), M2M-100 418M has about 3 MB of KV state in FP16 — completely negligible against 840 MB of FP16 weights. Thetokens, KV cache is not25 MB, which fits easily. At 2K tokens, it's about 400 MB (2K ÷ 128 × 25 MB). At 8K, it's 1.6 GB — still under the bottleneck4–8 for M2M-100_418M on short inputs;GB weight memorybudget, is.but now you're committing real DRAM.
SoThe whypractical discussimplication: it?the Threeagent's reasons:
Longerwindow documents(what matter.it Paragraph-levelcan translation at T = 512 puts yousee in thea regimesingle whereprompt) is bounded by KV cache startssize, tonot competeby withmodel weightcapability. memoryAn 8B model trained on 8K context can't actually use that context on NPU if the KV cache doesn't fit.
Implications for bandwidth.Agent Document-levelDesign
Three atconsequences Tflow =from 2048 is firmly KV-dominated. Many real translation workloads are not single sentences.this:
The1. 12BBounded variantcontext matters.is a feature, not a limitation. Cross-attentionIf KVyour reachesagent 96loops MiB(agent atthinks T=128→ onacts the→ 12B model,observes), and the modelcontext alreadywindow strainsis consumerfixed NPUat, say, 1K tokens, then the agent's working memory atis INT4.fixed. KVEvery isobservation older than 1K tokens falls off the differencewindow. betweenThis fittingforces anda not.design choice: either the agent uses only recent observations (myopic), or long-term memory lives outside the model in a vector store or database (Chapter 2.3).
The2. principleKV generalizes.cache reuse is precious. Every other encoder-decoder seq2seq model — NLLB-200, MarianMT, FLAN-T5, Whisper — hasIn the sameM2M-100 architecturalpattern problem and(encoder-decoder), the sameencoder lackis ofcomputed GQA.once; Ifthe youKV takecache oneis thingreused fromthroughout decode. In a chatbot where the user query is short but the response is long, this chapter,is it'sefficient. In a long-conversation scenario where both sides grow, every new user message requires a re-encode. This is why copy-on-write KV cache techniques (keeping separate buffers for user messages that modern attention optimizations don't applychange) matter.
3. The sliding-window technique (Phi Silica's N=64 approach from Chapter 1.3) is a deliberate trade: throw away the oldest tokens' KVs to 2020-erafree seq2seq,DRAM, andthen recompute them if you need to planbacktrack. aroundOn NPU where compute is cheaper than bandwidth (relatively speaking), this is a valid trade. On GPU where compute is expensive relative to DRAM, it usually isn't.
How Intel's "8K Validated Preview" Works
Intel's announcement that Lunar Lake supports "8K context" (Chapter 1.2's static-shape discussion) is narrowly true: the compiler can emit a static-shape graph for 8K, and it runs without crashing. What's not guaranteed is latency.
The 8K window likely uses chunked prefill (process 1K chunks at a time) and either sliding-window KV for decode or hybrid compute-cache layering (let the CPU assist with KV management). The "preview" designation means it's not validated for production; the team is still characterizing it.
For agent design, treat 8K as the ceiling, not the target. A 1K–2K working memory is reliable; 4K–8K requires careful modeling and testing; beyond 8K requires either multi-GPU or architectural workarounds.
What This Section Bought You
You should now understand:
TheKV cacheformulafootprintfor encoder-decoder models: self-attention plus cross-attention, both scalingscales withlayers,[seq_len,heads,num_heads, head_dim,andlayers,sequencedtype]length
The next section moves from theoryturns to engineering:the howagent's doreasoning youloop: given bounded context and bounded KV cache, what patterns actually manage KV cache on Intel NPU through OpenVINO GenAI, and what does the prefix-caching / chunked-prefill / static-shape stack dowork for youmulti-step (and to you)?agents?
Previous: Chapter 1: Foundations Next: 2.2 KV Cache Engineering