Skip to main content

2.1 Context Windows and the Memory Wall

2.1

The Contextagent's Windowsstate — what it remembers from past steps and thewhat Memoryit Walluses

Chapterto 1 establishedmake the constraints.next decision — is the bridge between hardware constraints and agent behavior. This chaptersection is about workingthe insidememory them.wall: Thewhy mostit importantexists, numberwhat anit means in numbers, and how to budget for it in the agent designerloop.

keeps

The intwo theirkey headstate mechanisms are KV cache (the prefill and decode phases' attention memory) and context window (the prompt that feeds the next prefill). They're distinct costs with different scaling properties, and conflating them is how much memory the KV cache eats per token, because that number multiplied by your context length is the wall you hit. On Intel NPU with M2M-100, that wall has a specificcommon shape,design and it's set by an architectural choice Meta made in 2020 that no amount of clever serving can paper over: M2M-100 uses full multi-head attention with no GQA.mistake.

The KV Cache Formulaand Its Footprint

The KV (key-value) cache is the core optimization of autoregressive LLM inference: instead of recomputing the attention keys and values for every token position on every decoding step, you compute them once and keep them in memory. On the second token, you use the KV from token 1 plus the new KV for token 2. On the third token, you use KVs from tokens 1–2 plus the new one. This is why decode is so much faster than prefill — you're amortizing the work.

The KV cache lives in DRAM and is dimensioned by [batch_size, num_heads, seq_len, head_dim]. For a typical transformer:

    batch_size: 1 on NPU (Chapter 1.3) num_heads: 16 (common) seq_len: grows from 1 to context_length as you decode head_dim: 64 (common)

    Per-token KV cache footprint = batch × num_heads × head_dim × 2 (K + V) × dtype_bytes.

    For anM2M-100 encoder-decoder1.2B transformer(16 likeheads, 64 head_dim) at FP16 (2 bytes):

      Per token per layer: 1 × 16 × 64 × 2 × 2 = 4,096 bytes = 4 KB per token per layer M2M-100,100 the1.2B per-stephas 24 encoder + 24 decoder statelayers; contains two KV caches: self-attention (the decoder attending to its own previous tokens) andkeeps cross-attention48 × 4 KB = 192 KB per token (theAt decoder128-token attendingcontext: to128 tokens × 192 KB = 24.6 MB per inference batch

      But M2M-100 is encoder-decoder, so there's a second KV cache: the encoder output).output, Both contribute, and for M2M-100 specifically both arewhich the samedecoder's per-layercross-attention size,reads becauseat theevery model doesn't compress KV heads.

      The formula is:

      KV_self  = 2 · L_dec · n_heads · head_dim · T_dec · sizeof(dtype)
      KV_cross = 2 · L_dec · n_heads · head_dim · T_enc · sizeof(dtype)
      Total    = KV_self + KV_cross
      

      The factor of 2 is for K and V tensors. L_dec is the number of decoder layers.step. The cross-attentionencoder KV is computed once over(during the full encoder output (length T_enc)prefill) and reused onthroughout everydecode, decoderso step;it doesn't grow with seq_len, but it's identical in size to the self-attention KV grows with T_dec as we generate.

      M2M-100 KV Footprint, Specific Numbers

      The configurations come straight fromof the HuggingFacedecoder modelat cards:any given encoder context length.

      Full M2M-100 418M:decoder 12KV footprint at T=128 token context and encoder layers, 12 decoder layers, 16 attention heads, head_dim 64 (embed_dim 1024 / 16 heads). M2M-100 1.2B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 64. M2M-100 12B: 24 encoder layers, 24 decoder layers, 16 attention heads, head_dim 256 (embed_dim 4096 / 16 heads).

      At T_enc = T_dec = 128 (a sentence-level translation working point):

      M2M-100 418M

      Precision Self-attn KV Cross-attn KV Total FP16 6.29 MB 6.29 MB 12.58 MB INT8 KV 3.15 MB 3.15 MB 6.29 MB

      M2M-100 1.2B

      Precision Self-attn KV Cross-attn KV Total FP16 12.58 MB 12.58 MB 25.17 MB INT8 KV 6.29 MB 6.29 MB 12.58 MB

      M2M-100 12B, same shape: roughly 96 MiB FP16 (head_dim balloons to 256, which is the dominant scaling factor).

      For sentence-level translation these numbers are small — they sit comfortably in DRAM next to ~840 MB of FP16 weights for the 418M model. The KV cache is not the bottleneck for short translation. Where it bites is when context grows: at T_enc = T_dec = 1024 the 1.2B model's KV state crosses 200 MB at FP16, and the cross-attention component dominates because translating long source documents keeps that full encoder output live in memory the entire time.

      The Full-MHA Tax — The Headline Insight

      Here's the comparison that should be the takeaway from this chapter:

      Per-token decoder self-attention KV bytes:L=128:

      • M2M-100Self-attention 1.2B at FP16: 2 ·KV: 24 ·layers 16× ·128 64tokens ·× 24 KB = 98,30412.3 bytes/tokenMB
      • Cross-attention KV (encoder output): 24 layers × 128 source tokens × 4 KB = 12.3 MB
      Total: ~25 MB per sequence (FP16)

      Now compare to Phi-3-mini-3.8B, which uses GQA (grouped-query attention) with GQA-8 atKV FP16:heads 2instead ·of 3216:

      ·
        Per token per layer: 1 × 8 ·× 9664 ·× 2 × 2 = 98,3042,048 bytes/bytes = 2 KB per token per layer 32 layers × 2 KB = 64 KB per token At 128-token context: 128 × 64 KB = 8.2 MB (before any encoder overhead)

        TheseSo arePhi-3-mini identicalsaves to the byte.

        A 1.2-billion-parameter encoder-decoder translation model from 2020 has the same per-token decoder self-attentionon KV footprint asper atoken, modern 3.8-billion-parameter decoder-only LLM, because Phi-3it uses Grouped Query Attention with one-quarterhalves the KV heads.head Andcount. M2M-100 carrieshas cross-attentionfull KVMHA atand pays the samebandwidth per-layerprice.

        cost

        The onAttention top, which Phi-3 does not have at all.Wall

        The architecturalattention conclusionwall is direct:simple to state: at some context length, the KV cache's bandwidth demand exceeds what the NPU can sustain. On Lunar Lake with 136.5 GB/s platform bandwidth, and given the 18% utilization we saw in Chapter 1.3, the per-NPU effective bandwidth is roughly 136.5 × 0.18 ≈ ~25 GB/s available.

        For M2M-100 decoder at FP16:

          192 KB per token (self + cross attention, 48 decoder+encoder layers) At 6.10 tok/s: 192 KB × 6.10 = ~1.17 MB/s of KV cache bandwidth

          This is well below the 25 GB/s ceiling, so the M2M-100 KV cache isn't the bottleneck yet. The wall appears at much larger context lengths or larger models.

          The working hypothesis from Chapter 2.1 is that the KV cache wall appears somewhere between 2K and 8K tokens for typical 8B models on Lunar Lake, depending on model architecture. Intel's validated 8K context "preview" on Lunar Lake is setright byat attentionthat design,edge. notThe parameterwall count.doesn't Phi-3mean deploysyou can't have 8K; it means you're committing to NPUrecompute, comfortablysliding atwindows, 4Kor context.multi-GPU M2M-100distribution 1.2Bto atstay 1Kabove contexta exertslatency thefloor.

          same

          Context per-tokenWindow vs. KV bandwidth pressure on the LPDDR5X bus.Cache

          ThisA critical distinction: context window is what we mean when we say M2M-100 is "expensive per parameter" — not in FLOPs or weight memory, but in the bandwidthmodel itscan decoderattend consumes per generated token. The fix is GQA. The fix requires retraining. Nobody has retrained M2M-100 with GQA. So we live with it.

          Modern Attention Optimizations and Why M2M-100 Doesn't Get Them

          Theto; KV cache footprintis haswhat drivenyou roughlymust five years of architectural innovationkeep in memory.

          For a decoder-only LLMs,model and M2M-100 predates all of it:

          Grouped Query Attention (GQA) shares K and V across groups of query heads — typically 4 or 8 query heads per KV head.like Llama 2 70B, Llama 3, Phi-3 use GQA. Reduces KV size by n_kv / n_heads. M2M-100 has no GQA.70B:

          Multi-Query

          AttentionContext (MQA)window: is8K GQA'stokens extreme — one KV head shared by all queries. Falcon-7B uses MQA. M2M-100 has no MQA.

          Multi-head Latent Attention (MLA) compresses K and V into a low-rank latent space, decompressing only at attention time. DeepSeek-V2 and V3 use MLA. M2M-100 has no MLA.

          KV cache quantizationfor dropsfull context: 70B parameters × 16 heads × 64 head_dim × 2 (K+V) × 2 bytes × 8K tokens ÷ (70B total params) = roughly 70–80 GB for a single sequence at full context.

          That doesn't fit on a single Lunar Lake. The roofline says: if you want 8K context with 70B, you compress the cache from FP16 to INT8model (quantize), shard it (multi-GPU), or below)use a sliding window (throw away old context). Halves bandwidth at modest quality cost. Works on any model. This is the lever you can pull for M2M-100.

          The honest summary: of the four major KV optimizations, only the last one — cache quantization — is available to M2M-100. INT8 KV halves your bandwidth pressure and roughly doubles your effective context length before hitting the bandwidth wall. Use it.

          The Bandwidth Wall, Quantified

          Combine this section with Chapter 1.3's ceiling. Lunar Lake's LPDDR5X-8533 delivers 136.5 GB/s shared. For decode at sustained throughput, every weight has to be streamed every token. For an 8B INT4 model that's 4 GB, ceiling 34 tok/s.

          The KV cache adds to this. For M2M-100 1.2B at FP16128 generating a long output, the per-token weight read is ~2.4 GB (the FP16 decoder weights), the per-token KV read grows from near-zero at token 1 to ~100 KB by token 1000, and the cross-attention KV is read in full every step. The effective bandwidth-per-token is dominated by weights for moderate contexts and only crosses over into KV-dominated regime above several thousand tokens of decoded output. For sentence-level translation this never matters. For document-level translation it sets the upper bound on practical context.

          Does Any of This Matter for Short-Context Translation?

          For a single English-to-French sentence (T_enc ≈ 32, T_dec ≈ 32), M2M-100 418M has about 3 MB of KV state in FP16 — completely negligible against 840 MB of FP16 weights. Thetokens, KV cache is not25 MB, which fits easily. At 2K tokens, it's about 400 MB (2K ÷ 128 × 25 MB). At 8K, it's 1.6 GB — still under the bottleneck4–8 for M2M-100_418M on short inputs;GB weight memorybudget, is.but now you're committing real DRAM.

          SoThe whypractical discussimplication: it?the Threeagent's reasons:

          working-memory

          Longerwindow documents(what matter.it Paragraph-levelcan translation at T = 512 puts yousee in thea regimesingle whereprompt) is bounded by KV cache startssize, tonot competeby withmodel weightcapability. memoryAn 8B model trained on 8K context can't actually use that context on NPU if the KV cache doesn't fit.

          Implications for bandwidth.Agent Document-levelDesign

          translation

          Three atconsequences Tflow =from 2048 is firmly KV-dominated. Many real translation workloads are not single sentences.this:

          The1. 12BBounded variantcontext matters.is a feature, not a limitation. Cross-attentionIf KVyour reachesagent 96loops MiB(agent atthinks T=128 onacts the 12B model,observes), and the modelcontext alreadywindow strainsis consumerfixed NPUat, say, 1K tokens, then the agent's working memory atis INT4.fixed. KVEvery isobservation older than 1K tokens falls off the differencewindow. betweenThis fittingforces anda not.design choice: either the agent uses only recent observations (myopic), or long-term memory lives outside the model in a vector store or database (Chapter 2.3).

          The2. principleKV generalizes.cache reuse is precious. Every other encoder-decoder seq2seq model — NLLB-200, MarianMT, FLAN-T5, Whisper — hasIn the sameM2M-100 architecturalpattern problem and(encoder-decoder), the sameencoder lackis ofcomputed GQA.once; Ifthe youKV takecache oneis thingreused fromthroughout decode. In a chatbot where the user query is short but the response is long, this chapter,is it'sefficient. In a long-conversation scenario where both sides grow, every new user message requires a re-encode. This is why copy-on-write KV cache techniques (keeping separate buffers for user messages that modern attention optimizations don't applychange) matter.

          3. The sliding-window technique (Phi Silica's N=64 approach from Chapter 1.3) is a deliberate trade: throw away the oldest tokens' KVs to 2020-erafree seq2seq,DRAM, andthen recompute them if you need to planbacktrack. aroundOn NPU where compute is cheaper than bandwidth (relatively speaking), this is a valid trade. On GPU where compute is expensive relative to DRAM, it usually isn't.

          How Intel's "8K Validated Preview" Works

          Intel's announcement that Lunar Lake supports "8K context" (Chapter 1.2's static-shape discussion) is narrowly true: the compiler can emit a static-shape graph for 8K, and it runs without crashing. What's not guaranteed is latency.

          The 8K window likely uses chunked prefill (process 1K chunks at a time) and either sliding-window KV for decode or hybrid compute-cache layering (let the CPU assist with KV management). The "preview" designation means it's not validated for production; the team is still characterizing it.

          For agent design, treat 8K as the ceiling, not the target. A 1K–2K working memory is reliable; 4K–8K requires careful modeling and testing; beyond 8K requires either multi-GPU or architectural workarounds.

          What This Section Bought You

          You should now understand:

          • The KV cache formulafootprint for encoder-decoder models: self-attention plus cross-attention, both scalingscales with layers,[seq_len, heads,num_heads, head_dim, andlayers, sequencedtype] length
          M2M-100 1.2B hasat the128 sametokens per-tokenis ~25 MB Full MHA (M2M-100) vs. GQA (Phi-3-mini) creates a 3× KV bandwidth as Phi-3-mini-3.8Bdifference despite beingattention aarchitecture thirdis the parameter count, because Phi-3 uses GQAdestiny The attention wall appears at 2K–8K tokens on Lunar Lake depending on model size KV cache wall is set by attention design, not parameter count — and M2M-100's full MHA puts it permanently on the wrong side of the wall Only KV cache quantization is available as a lever for M2M-100; the modern optimizations (GQA, MQA, MLA) require retraining For short-context translation the KV cache is negligible vs weight memory; for long-context translation it dominates Cross-attention KVgrowth is the M2M-100-specificper-token costlatency thatproblem; addscontext towindow (notis replaces)the self-attentionper-prompt problem Encoder KV everyreuse (encoder-decoder stepmodels) is a structural advantage Sliding-window KV trades compute for bandwidth — a valid move on NPU 8K context on Lunar Lake is validated-preview, not production; design for 1K–2K working memory Long-term memory for the agent lives outside the model — in SQLite, vector stores, or filesystems

          The next section moves from theoryturns to engineering:the howagent's doreasoning youloop: given bounded context and bounded KV cache, what patterns actually manage KV cache on Intel NPU through OpenVINO GenAI, and what does the prefix-caching / chunked-prefill / static-shape stack dowork for youmulti-step (and to you)?agents?


          Previous: Chapter 1: Foundations Next: 2.2 KV Cache Engineering