2.1 Context Windows and the Memory Wall

2.1

The ~~Context~~agent's ~~Windows~~state — what it remembers from past steps and ~~the~~what ~~Memory~~it ~~Wall~~uses

~~Chapter~~to ~~1 established~~make the ~~constraints.~~next decision — is the bridge between hardware constraints and agent behavior. This ~~chapter~~section is about ~~working~~the ~~inside~~memory ~~them.~~wall: ~~The~~why ~~most~~it ~~important~~exists, ~~number~~what anit means in numbers, and how to budget for it in the agent ~~designer~~loop.

~~keeps~~

The intwo ~~their~~key ~~head~~state mechanisms are KV cache (the prefill and decode phases' attention memory) and context window (the prompt that feeds the next prefill). They're distinct costs with different scaling properties, and conflating them is ~~how much memory the KV cache eats per token, because that number multiplied by your context length is the wall you hit. On Intel NPU with M2M-100, that wall has~~ a ~~specific~~common ~~shape,~~design ~~and it's set by an architectural choice Meta made in 2020 that no amount of clever serving can paper over: M2M-100 uses full multi-head attention with no GQA.~~mistake.

The KV Cache Formulaand Its Footprint

The KV (key-value) cache is the core optimization of autoregressive LLM inference: instead of recomputing the attention keys and values for every token position on every decoding step, you compute them once and keep them in memory. On the second token, you use the KV from token 1 plus the new KV for token 2. On the third token, you use KVs from tokens 1–2 plus the new one. This is why decode is so much faster than prefill — you're amortizing the work.

The KV cache lives in DRAM and is dimensioned by [batch_size, num_heads, seq_len, head_dim]. For a typical transformer:

batch_size: 1 on NPU (Chapter 1.3) num_heads: 16 (common) seq_len: grows from 1 to context_length as you decode head_dim: 64 (common)

Per-token KV cache footprint = batch × num_heads × head_dim × 2 (K + V) × dtype_bytes.

For anM2M-100 ~~encoder-decoder~~1.2B ~~transformer~~(16 ~~like~~heads, 64 head_dim) at FP16 (2 bytes):

Per token per layer: 1 × 16 × 64 × 2 × 2 = 4,096 bytes = 4 KB per token per layer M2M-~~100,~~100 ~~the~~1.2B ~~per-step~~has 24 encoder + 24 decoder ~~state~~layers; ~~contains two KV caches:~~ ~~self-attention~~ (the decoder ~~attending to its own previous tokens) and~~keeps ~~cross-attention~~48 × 4 KB = 192 KB per token ~~(the~~At ~~decoder~~128-token ~~attending~~context: to128 tokens × 192 KB = 24.6 MB per inference batch

But M2M-100 is encoder-decoder, so there's a second KV cache: the encoder ~~output).~~output, ~~Both contribute, and for M2M-100 specifically both are~~which the ~~same~~decoder's ~~per-layer~~cross-attention ~~size,~~reads ~~because~~at ~~the~~every ~~model doesn't compress KV heads.~~

~~The formula is:~~

KV_self  = 2 · L_dec · n_heads · head_dim · T_dec · sizeof(dtype)
KV_cross = 2 · L_dec · n_heads · head_dim · T_enc · sizeof(dtype)
Total    = KV_self + KV_cross

~~The factor of 2 is for K and V tensors.~~ L_dec ~~is the number of decoder layers.~~step. The ~~cross-attention~~encoder KV is computed once ~~over~~(during ~~the full encoder output (length~~ T_enc)prefill) and reused onthroughout ~~every~~decode, ~~decoder~~so ~~step;~~it doesn't grow with seq_len, but it's identical in size to the self-attention KV ~~grows with~~ T_dec ~~as we generate.~~

M2M-100 KV Footprint, Specific Numbers

~~The configurations come straight from~~of the ~~HuggingFace~~decoder ~~model~~at ~~cards:~~any given encoder context length.

Full M2M-100 ~~418M~~:decoder 12KV footprint at T=128 token context and encoder ~~layers, 12 decoder layers, 16 attention heads, head_dim 64 (~~embed_dim 1024 / 16 heads). ~~M2M-100 1.2B~~~~: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 64.~~ ~~M2M-100 12B~~~~: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 256 (~~embed_dim 4096 / 16 heads).

At T_enc = T_dec = 128 ~~(a sentence-level translation working point):~~

~~M2M-100 418M~~

~~Precision~~ ~~Self-attn KV~~ ~~Cross-attn KV~~ ~~Total~~ ~~FP16~~ ~~6.29 MB~~ ~~6.29 MB~~ ~~12.58 MB~~ ~~INT8 KV~~ ~~3.15 MB~~ ~~3.15 MB~~ ~~6.29 MB~~

~~M2M-100 1.2B~~

~~Precision~~ ~~Self-attn KV~~ ~~Cross-attn KV~~ ~~Total~~ ~~FP16~~ ~~12.58 MB~~ ~~12.58 MB~~ ~~25.17 MB~~ ~~INT8 KV~~ ~~6.29 MB~~ ~~6.29 MB~~ ~~12.58 MB~~

~~M2M-100 12B~~~~, same shape: roughly~~ ~~96 MiB FP16~~ ~~(head_dim balloons to 256, which is the dominant scaling factor).~~

For sentence-level translation these numbers are small — they sit comfortably in DRAM next to ~840 MB of FP16 weights for the 418M model. The KV cache is not the bottleneck for short translation. Where it bites is when context ~~grows: at~~ T_enc = T_dec = 1024 ~~the 1.2B model's KV state crosses 200 MB at FP16, and the cross-attention component dominates because translating long source documents keeps that full encoder output live in memory the entire time.~~

The Full-MHA Tax — The Headline Insight

~~Here's the comparison that should be the takeaway from this chapter:~~

~~Per-token decoder self-attention KV bytes:~~L=128:

~~M2M-100~~Self-attention ~~1.2B~~ ~~at FP16:~~ 2 ·KV: 24 ·layers 16× ·128 64tokens ·× 24 KB = 98,30412.3 bytes/tokenMB
Cross-attention KV (encoder output): 24 layers × 128 source tokens × 4 KB = 12.3 MB

Total: ~25 MB per sequence (FP16)

Now compare to Phi-3-mini-3.8B, which uses GQA (grouped-query attention) with ~~GQA-~~8 atKV ~~FP16:~~heads 2instead ·of 3216:

Per token per layer: 1 × 8 ·× 9664 ·× 2 × 2 = ~~98,304~~2,048 ~~bytes/~~bytes = 2 KB per token per layer 32 layers × 2 KB = 64 KB per token At 128-token context: 128 × 64 KB = 8.2 MB (before any encoder overhead)

~~These~~So ~~are~~Phi-3-mini ~~identical~~saves to3× ~~the byte.~~

~~A 1.2-billion-parameter encoder-decoder translation model from 2020 has the same per-token decoder self-attention~~on KV footprint asper atoken, ~~modern 3.8-billion-parameter decoder-only LLM,~~ because ~~Phi-3~~it ~~uses Grouped Query Attention with one-quarter~~halves the KV ~~heads~~.head ~~And~~count. M2M-100 ~~carries~~has ~~cross-attention~~full KVMHA atand pays the ~~same~~bandwidth ~~per-layer~~price.

~~cost~~

The onAttention top, which Phi-3 does not have at all.Wall

The ~~architectural~~attention ~~conclusion~~wall is ~~direct:~~simple to state: at some context length, the KV cache's bandwidth demand exceeds what the NPU can sustain. On Lunar Lake with 136.5 GB/s platform bandwidth, and given the 18% utilization we saw in Chapter 1.3, the per-NPU effective bandwidth is roughly 136.5 × 0.18 ≈ ~25 GB/s available.

For M2M-100 decoder at FP16:

192 KB per token (self + cross attention, 48 decoder+encoder layers) At 6.10 tok/s: 192 KB × 6.10 = ~1.17 MB/s of KV cache bandwidth

This is well below the 25 GB/s ceiling, so the M2M-100 KV cache isn't the bottleneck yet. The wall appears at much larger context lengths or larger models.

The working hypothesis from Chapter 2.1 is that the KV cache wall appears somewhere between 2K and 8K tokens for typical 8B models on Lunar Lake, depending on model architecture. Intel's validated 8K context "preview" on Lunar Lake is ~~set~~right byat ~~attention~~that ~~design,~~edge. ~~not~~The ~~parameter~~wall ~~count~~.doesn't ~~Phi-3~~mean ~~deploys~~you can't have 8K; it means you're committing to ~~NPU~~recompute, ~~comfortably~~sliding atwindows, 4Kor ~~context.~~multi-GPU ~~M2M-100~~distribution ~~1.2B~~to atstay 1Kabove ~~context~~a ~~exerts~~latency ~~the~~floor.

~~same~~

Context per-tokenWindow vs. KV bandwidth pressure on the LPDDR5X bus.Cache

~~This~~A critical distinction: context window is what ~~we mean when we say M2M-100 is "expensive per parameter" — not in FLOPs or weight memory, but in~~ the ~~bandwidth~~model ~~its~~can ~~decoder~~attend ~~consumes per generated token. The fix is GQA. The fix requires retraining. Nobody has retrained M2M-100 with GQA. So we live with it.~~

Modern Attention Optimizations and Why M2M-100 Doesn't Get Them

~~The~~to; KV cache ~~footprint~~is ~~has~~what ~~driven~~you ~~roughly~~must ~~five years of architectural innovation~~keep in memory.

For a decoder-only ~~LLMs,~~model ~~and M2M-100 predates all of it:~~

~~Grouped Query Attention (GQA)~~ ~~shares K and V across groups of query heads — typically 4 or 8 query heads per KV head.~~like Llama ~~2 70B, Llama 3, Phi-~~3 ~~use GQA. Reduces KV size by~~ n_kv / n_heads~~. M2M-100 has no GQA.~~70B:

~~Multi-Query~~

~~Attention~~Context ~~(MQA)~~window: is8K ~~GQA's~~tokens ~~extreme — one KV head shared by all queries. Falcon-7B uses MQA. M2M-100 has no MQA.~~

~~Multi-head Latent Attention (MLA)~~ ~~compresses K and V into a low-rank latent space, decompressing only at attention time. DeepSeek-V2 and V3 use MLA. M2M-100 has no MLA.~~

KV cache ~~quantization~~for ~~drops~~full context: 70B parameters × 16 heads × 64 head_dim × 2 (K+V) × 2 bytes × 8K tokens ÷ (70B total params) = roughly 70–80 GB for a single sequence at full context.

That doesn't fit on a single Lunar Lake. The roofline says: if you want 8K context with 70B, you compress the ~~cache from FP16 to INT8~~model (quantize), shard it (multi-GPU), or ~~below)~~use a sliding window (throw away old context). ~~Halves bandwidth at modest quality cost. Works on any model.~~ ~~This is the lever you can pull for M2M-100.~~

The honest summary: of the four major KV optimizations, only the last one — cache quantization — is available to M2M-100. INT8 KV halves your bandwidth pressure and roughly doubles your effective context length before hitting the bandwidth wall. Use it.

The Bandwidth Wall, Quantified

Combine this section with Chapter 1.3's ceiling. Lunar Lake's LPDDR5X-8533 delivers 136.5 GB/s shared. For decode at sustained throughput, every weight has to be streamed every token. For an 8B INT4 model that's 4 GB, ceiling 34 tok/s.

~~The KV cache adds to this.~~ For M2M-100 1.2B at ~~FP16~~128 generating a long output, the per-token weight read is ~2.4 GB (the FP16 decoder weights), the per-token KV read grows from near-zero at token 1 to ~100 KB by token 1000, and the cross-attention KV is read in full every step. The effective bandwidth-per-token is dominated by weights for moderate contexts and only crosses over into KV-dominated regime above several thousand tokens of decoded output. For sentence-level translation this never matters. For document-level translation it sets the upper bound on practical context.

Does Any of This Matter for Short-Context Translation?

~~For a single English-to-French sentence (T_enc ≈ 32, T_dec ≈ 32), M2M-100 418M has~~ ~~about 3 MB of KV state in FP16~~ ~~— completely negligible against 840 MB of FP16 weights.~~ ~~The~~tokens, KV cache is ~~not~~25 MB, which fits easily. At 2K tokens, it's about 400 MB (2K ÷ 128 × 25 MB). At 8K, it's 1.6 GB — still under the ~~bottleneck~~4–8 ~~for M2M-100_418M on short inputs;~~GB weight ~~memory~~budget, ~~is.~~but now you're committing real DRAM.

SoThe ~~why~~practical ~~discuss~~implication: ~~it?~~the ~~Three~~agent's ~~reasons:~~

working-memory

~~Longer~~window ~~documents~~(what ~~matter.~~it ~~Paragraph-level~~can ~~translation at T = 512 puts you~~see in ~~the~~a ~~regime~~single ~~where~~prompt) is bounded by KV cache ~~starts~~size, tonot ~~compete~~by ~~with~~model ~~weight~~capability. ~~memory~~An 8B model trained on 8K context can't actually use that context on NPU if the KV cache doesn't fit.

Implications for bandwidth.Agent Document-levelDesign

~~translation~~

Three atconsequences Tflow =from ~~2048 is firmly KV-dominated. Many real translation workloads are not single sentences.~~this:

~~The~~1. ~~12B~~Bounded ~~variant~~context ~~matters.~~is a feature, not a limitation. ~~Cross-attention~~If KVyour ~~reaches~~agent 96loops ~~MiB~~(agent atthinks ~~T=128~~→ onacts ~~the~~→ ~~12B model,~~observes), and the ~~model~~context ~~already~~window ~~strains~~is ~~consumer~~fixed ~~NPU~~at, say, 1K tokens, then the agent's working memory atis ~~INT4.~~fixed. KVEvery isobservation older than 1K tokens falls off the ~~difference~~window. ~~between~~This ~~fitting~~forces ~~and~~a ~~not.~~design choice: either the agent uses only recent observations (myopic), or long-term memory lives outside the model in a vector store or database (Chapter 2.3).

~~The~~2. ~~principle~~KV ~~generalizes.~~cache reuse is precious. ~~Every other encoder-decoder seq2seq model — NLLB-200, MarianMT, FLAN-T5, Whisper — has~~In the ~~same~~M2M-100 ~~architectural~~pattern ~~problem and~~(encoder-decoder), the ~~same~~encoder ~~lack~~is ofcomputed ~~GQA.~~once; Ifthe ~~you~~KV ~~take~~cache ~~one~~is ~~thing~~reused ~~from~~throughout decode. In a chatbot where the user query is short but the response is long, this ~~chapter,~~is ~~it's~~efficient. In a long-conversation scenario where both sides grow, every new user message requires a re-encode. This is why copy-on-write KV cache techniques (keeping separate buffers for user messages that ~~modern attention optimizations~~ don't ~~apply~~change) matter.

3. The sliding-window technique (Phi Silica's N=64 approach from Chapter 1.3) is a deliberate trade: throw away the oldest tokens' KVs to ~~2020-era~~free ~~seq2seq~~,DRAM, ~~and~~then recompute them if you need to ~~plan~~backtrack. ~~around~~On NPU where compute is cheaper than bandwidth (relatively speaking), this is a valid trade. On GPU where compute is expensive relative to DRAM, it usually isn't.

How Intel's "8K Validated Preview" Works

Intel's announcement that Lunar Lake supports "8K context" (Chapter 1.2's static-shape discussion) is narrowly true: the compiler can emit a static-shape graph for 8K, and it runs without crashing. What's not guaranteed is latency.

The 8K window likely uses chunked prefill (process 1K chunks at a time) and either sliding-window KV for decode or hybrid compute-cache layering (let the CPU assist with KV management). The "preview" designation means it's not validated for production; the team is still characterizing it.

For agent design, treat 8K as the ceiling, not the target. A 1K–2K working memory is reliable; 4K–8K requires careful modeling and testing; beyond 8K requires either multi-GPU or architectural workarounds.

What This Section Bought You

You should now understand:

~~The~~ KV cache ~~formula~~footprint ~~for encoder-decoder models: self-attention plus cross-attention, both scaling~~scales with ~~layers,~~[seq_len, ~~heads,~~num_heads, head_dim, ~~and~~layers, ~~sequence~~dtype] ~~length~~

— M2M-100 1.2B ~~has~~at ~~the~~128 ~~same~~tokens ~~per-token~~is ~25 MB Full MHA (M2M-100) vs. GQA (Phi-3-mini) creates a 3× KV bandwidth ~~as Phi-3-mini-3.8B~~difference ~~despite~~— ~~being~~attention aarchitecture ~~third~~is ~~the parameter count, because Phi-3 uses GQA~~destiny The attention wall appears at 2K–8K tokens on Lunar Lake depending on model size KV cache ~~wall is set by attention design~~~~, not parameter count — and M2M-100's full MHA puts it permanently on the wrong side of the wall~~ ~~Only KV cache quantization is available as a lever~~ ~~for M2M-100; the modern optimizations (GQA, MQA, MLA) require retraining~~ ~~For short-context translation the KV cache is negligible~~ ~~vs weight memory; for long-context translation it dominates~~ ~~Cross-attention KV~~growth is the ~~M2M-100-specific~~per-token ~~cost~~latency ~~that~~problem; ~~adds~~context towindow ~~(not~~is ~~replaces)~~the ~~self-attention~~per-prompt problem Encoder KV ~~every~~reuse (encoder-decoder ~~step~~models) is a structural advantage Sliding-window KV trades compute for bandwidth — a valid move on NPU 8K context on Lunar Lake is validated-preview, not production; design for 1K–2K working memory Long-term memory for the agent lives outside the model — in SQLite, vector stores, or filesystems

The next section ~~moves from theory~~turns to ~~engineering:~~the ~~how~~agent's doreasoning ~~you~~loop: given bounded context and bounded KV cache, what patterns actually ~~manage KV cache on Intel NPU through OpenVINO GenAI, and what does the prefix-caching / chunked-prefill / static-shape stack do~~work for ~~you~~multi-step ~~(and to you)?~~agents?

Previous: Chapter 1: Foundations Next: 2.2 KV Cache Engineering