2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.2 KV Cache Engineering

~~The~~Theory ~~previous~~in ~~section~~2.1 ~~treated~~said the KV cache aswall ais ~~memory~~real ~~cost.~~but binds at long context. This section ~~treats~~is about the engineering: how OpenVINO GenAI actually manages KV state on Intel NPU, what the configuration knobs do, and which of the three confusingly-named caching mechanisms applies where. The OpenVINO stack has matured at roughly one major capability per quarterly release since 2024.4, and the surface area is now large enough that a lot of production code is misconfigured because the author conflated two layers of the cache stack.

The Three Caching Mechanisms (and Why They're Confused)

Before going further, untangle these three things, which all live in OpenVINO and all sound similar:

Model caching (CACHE_DIR). The compiled NPU blob is cached to disk so that subsequent runs skip the multi-second to multi-minute compile step. This is what Chapter 1.3's cold-start table was about. Universal across devices.

KV cache (the per-request decoder state). The standard transformer KV cache, held in OpenVINO stateful-model variables on NPU. Reused across generate() calls within a chat session via LLMPipeline.start_chat().

Prefix caching (cross-request reuse of the prefill phase). When two requests start with the same prompt prefix, the KV state from that prefix is reused. The --enable_prefix_caching flag in OVMS controls this — but it means different things on different devices, which we'll get to.

These are three independent layers. You can have any combination on or off. Production NPU agents typically want all three on.

What `LLMPipeline.start_chat()` Does

From the OpenVINO GenAI source, start_chat() opens a chat session bound to a single stateful compiled model. The KV state lives inside OpenVINO stateful-model variables — state tensors hidden from the IR I/O surface — and is reused across generate() calls. On each call the runtime invokes align_kv_cache_and_history(), which compares the new tokenized sequence to the cached state and submits only the divergent suffix. finish_chat() resets the state.

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline("ov_phi35_mini_int4", device="NPU",
                            CACHE_DIR=".ovcache")
pipe.start_chat()
print(pipe.generate("Hello!", max_new_tokens=50))    # full prefill
print(pipe.generate("And what is your name?",        # only the suffix prefills
                    max_new_tokens=50))
pipe.finish_chat()

On NPU specifically, the backend is StaticLLMPipeline (selected internally via is_npu_requested()), which uses fixed input shapes derived from MAX_PROMPT_LEN and MIN_RESPONSE_LEN, compiles through compile_decoder_for_npu(), and supports blob export/import via EXPORT_BLOB and BLOB_PATH for binary distribution.

The NPU LLM constraint nobody documents loudly enough: greedy decoding only on the classic static NPU pipeline. No beam search. OVMS 2025.4 added multinomial sampling, but beam search remains unsupported on NPU. For M2M-100 translation this matters because beam-4 is the standard high-quality decode setting for NMT — going greedy on NPU costs you roughly 0.3–0.8 BLEU on FLORES devtest, but it's the only option.

Prefix Caching — The Version-Stamped Reality

The single most-confused area in OpenVINO docs is prefix caching, because the same --enable_prefix_caching CLI flag drives two completely different mechanisms depending on target device.

On CPU/GPU through OVMS, the flag enables PagedAttention/Continuous-Batching prefix cache with configurable cache_size. Standard production LLM serving stuff. Works well, well-validated.

On NPU through OVMS, the flag drives the plugin-level NPUW_LLM_ENABLE_PREFIX_CACHING:YES (added in OpenVINO 2025.4), which reduces TTFT in long-chat scenarios on the static-shape Stateful pipeline.

The OVMS docs are explicit that on Stateful (NPU) servables, cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching, cache_eviction_config, and sparse_attention_config are ignored at the OVMS scheduling layer — but the 2025.4 demo command still passes --enable_prefix_caching true with --target_device NPU because OVMS now plumbs that flag through to NPUW_LLM_ENABLE_PREFIX_CACHING:YES. Both statements are simultaneously true and pertain to different layers of the stack. If your prefix-caching mental model breaks, this is usually why.

Prefix caching for encoder-decoder seq2seq models like M2M-100 is not documented on OVMS at all. It's a gap in the public record. The OpenVINO encoder for M2M-100 is single-pass static prefill anyway, so the question of "reuse encoder state across requests" maps differently — the encoder is cheap enough that caching it sequence-by-sequence is a smaller win than for an autoregressive LLM.

Chunked Prefill and the 8K "Ceiling"

OpenVINO 2025.3 introduced dynamic prompts on NPU by default through PREFILL_HINT=DYNAMIC with NPUW_LLM_PREFILL_CHUNK_SIZE=1024. Setting PREFILL_HINT=STATIC reverts to the 2025.2 fixed-shape behavior. PR #31687 ("NPUW: Automatically align MAX_PROMPT_LENGTH to CHUNK_SIZE") enforces the alignment constraint that MAX_PROMPT_LEN must be a multiple of NPUW_LLM_PREFILL_CHUNK_SIZE.

The 8K context limit is not a hard architectural ceiling. The 2025.3 release notes describe it as a ~~resource~~validated ~~you~~preview ~~can~~on ~~engineer.~~specific ~~Done~~hardware: ~~well,~~"Longer ~~cache~~contexts ~~reuse~~are ~~across~~available ~~turns~~as preview feature on 32GB Intel Core Ultra Series 2 (with prompt size up to 8..12K tokens)." The 2025.4 notes promote it to general availability on Lunar Lake. The cap is set by where chunked-prefill activation buffers fit in DDR; smaller-RAM SKUs cap lower, and ~~prefix~~Panther ~~sharing across sessions can cut prefill latency by 5–10x. Done poorly, you end up recomputing the same thing over and over while telling the user "Thinking…"~~

The CacheLake is Alreadypositioned Thereto —extend Don'tfurther Throw(no Itpublic Away

number

~~Here's~~yet). ~~the~~Production ~~most~~code ~~common~~should ~~waste in agent implementations: a multi-turn conversation where each turn re-tokenizes and re-prefills the entire history.~~

~~Turn 1 prompt:~~query [system] [tools] [user_msg_1]MAX_PROMPT_LEN ~~→ prefill → generate Turn 2 prompt:~~ [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] → ~~re-prefill the whole thing~~ ~~→ generate Turn 3 prompt:~~ [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] [assistant_2] [user_msg_3] → ~~re-prefill again~~ ~~→...~~

~~By turn 5, you're spending most of your TTFT recomputing context whose KV values you had perfectly valid copies of moments ago. On an NPU where prefill is compute-bound and runs~~ at ~~perhaps~~runtime ~~100–500~~rather ~~tokens/second,~~than ~~this~~hardcode ~~is brutal.~~8192.

The ~~fix~~properties isto ~~cache~~remember, ~~persistence:~~with ~~keep the KV cache resident across turns, and only prefill the new tokens at the end. This is sometimes called~~ ~~session caching~~ or ~~conversational caching~~.

~~The savings:~~defaults:

~~Scenario~~Property	~~Without session cache~~Default	~~With session cache~~Effect
~~Turn~~`MAX_PROMPT_LEN`

5,1024 4KMax input prompt tokens ofon ~~history~~static-shape ~~Prefill~~NPU ~~~4K + ~50 new = 4050 tokens~~ ~~Prefill 50 new tokens~~pipeline ~~TTFT~~MIN_RESPONSE_LEN on128 ~~200~~(was ~~tok/s~~150 pre-2025.3) Min new tokens reserved NPUW_LLM_PREFILL_CHUNK_SIZE 1024 Granularity of chunked prefill ~~~20 seconds~~PREFILL_HINT ~~~250~~DYNAMIC ms(since 2025.3) STATIC to revert to old behavior GENERATE_HINT FAST_COMPILE BEST_PERF for runtime perf at compile cost NPUW_LLM_ENABLE_PREFIX_CACHING NO Enables NPU prefix cache (2025.4+) CACHE_DIR unset Strongly recommended on NPU PERFORMANCE_HINT LATENCY THROUGHPUT allows up to 4 concurrent infer requests

The "Sequential Execution" Claim — What It Actually Means

OVMS docs state: "OpenVINO Model Server with NPU acceleration process the requests sequentially. For that reason, benchmarking should be performed in max_concurrency set to 1." This is not an NPU hardware limit. The NPU plugin advertises optimal_number_of_infer_requests = 4 in THROUGHPUT mode and exposes ov::range_for_async_infer_requests. PR #27875 even added an opt-in property NPU_RUN_INFERENCES_SEQUENTIALLY defaulting to false — the ~~single~~existence ~~biggest~~of an opt-in to force sequential proves the default is parallel-capable.

The truth: for OVMS on NPU, Stateful servables intentionally serialize at the request level because each NPU LLM session owns a state-variable instance, and the scheduling policy is designed for single-user AI-PC latency ~~win~~workloads. ~~available~~For indirect ~~most~~OpenVINO ~~NPU~~Runtime ~~agent~~use ~~implementations,~~(not OVMS), multiple InferRequest objects can be submitted async, and ~~most~~tile-level ~~teams~~parallelism ~~discover~~(ov::intel_npu::tiles) ~~they~~is ~~need~~real. itThe ~~after~~book ~~their~~distinguishes ~~first~~these ~~round~~layers ofbecause ~~user~~conflating ~~testing.~~them produces wrong mental models — and wrong mental models produce designs that try to parallelize something that won't parallelize, or refuse to parallelize something that would.

WhatEviction MakesPolicies This— TrickyMostly Not Your Problem on an NPU

IfThe ~~session~~standard ~~caching~~production isLLM soconcerns ~~obviously~~about ~~good,~~KV ~~why~~cache ~~isn't~~eviction it— ~~the~~LRU, ~~default?~~LFU, ~~Because~~FIFO, ~~NPU~~importance-weighted ~~runtimes~~retention ~~often~~— ~~make it hard:~~

~~Fixed-shape graphs~~~~: many NPU compilers want~~apply to ~~compile~~PagedAttention-based ~~the~~serving ~~model~~on ~~for a specific input shape (sequence length, batch size)~~CPU/GPU. AOn ~~growing~~NPU's ~~cache means the shape changes between calls, which can trigger recompilation or fall back to CPU.~~ ~~Static memory allocation~~~~: NPU memory is often pre-allocated in fixed blocks. A cache that grows arbitrarily doesn't fit that pattern.~~ ~~Pre-padded buffers~~~~: the workaround is to pre-allocate~~StaticLLMPipeline, the cache atis ~~maximum size~~per-request and ~~use~~bounded ~~attention~~by ~~masks~~MAX_PROMPT_LEN + MIN_RESPONSE_LEN. There's no global cache pool to ~~ignore~~evict ~~unused~~from. ~~slots.~~Eviction happens when the session ends (finish_chat() or process exit).

This ~~wastes~~simplifies ~~memory~~a ~~for~~lot. ~~short~~It ~~sessions.~~

also means Noyou ~~direct~~can't share KV cache ~~access~~:across ~~some~~users on NPU ~~runtimes expose model inference but not~~ the ~~underlying~~way ~~cache~~you ~~tensors, making external persistence impossible without runtime patches.~~

~~The state of the art~~would on ~~each~~a ~~platform~~server-side ~~varies.~~GPU ~~Core~~deployment. ~~ML's~~Single-user ~~stateful~~AI-PC ~~prediction APIs, ONNX Runtime's~~ RunWithBinding ~~and IO binding, and OpenVINO's StateAPI~~workloads are the ~~kinds~~design center.

Context Management for M2M-100 Translation

For sentence-level English-to-French translation, encoder input rarely exceeds 64 tokens and decoder output rarely exceeds 96. With MAX_PROMPT_LEN=128 and MIN_RESPONSE_LEN=128, the entire context budget fits comfortably under any NPU's static-shape envelope, on any generation. Chunked prefill, prefix caching, the 8K ceiling — none of ~~mechanisms you'll need to use. None of them is plug-and-play; expect to read the runtime documentation carefully and test on your specific NPU before assuming~~ it ~~works.~~

matters

Prefixfor Cachingsentence Across Sessions

~~Session caching reuses the cache within one conversation.~~ ~~Prefix caching~~MT. ~~reuses it across conversations, sharing the KV values of common prefixes.~~

The ~~most~~discussion ~~common prefix~~belongs in ~~an agent is~~ the ~~system prompt plus tool definitions. These are often 500–2000 tokens and identical across every session. Prefilling them on every cold start is pure waste — they never change between users.~~

A prefix cache stores the precomputed KV tensors for that fixed prefix and reattaches them when a new session begins. The cost is one prefill at deployment time, paid back across every session that follows.

~~This is harder to implement than session caching~~chapter because:

M2M-100 generalizes to document-level translation at T = 1K–2K, where the constraints start to bite

The ~~cache~~constraints ~~tensors~~transfer ~~are~~directly ~~large~~to other agentic seq2seq workloads — summarization, retrieval-augmented translation, ASR-translation pipelines — where context grows The OpenVINO version compatibility and configuration story applies to every NPU-served LLM, not just translation. If you also run Phi-3.5-mini for an "explain this translation" tool (~~hundreds~~the worked example in Chapter 5.2), all of ~~MB) and must persist to disk~~ ~~They're tied to~~ the ~~exact~~above ~~model and exact tokenization — a tokenizer change invalidates the cache~~ ~~Loading them must be faster than recomputing them, which is not automatic on slow storage~~applies

~~When it works, prefix caching takes cold-start TTFT from "wait two seconds before~~For the ~~first~~worked ~~character~~example, ~~appears"~~the topractical answer is "~~respond~~configure ~~instantly.~~for short context, don't overthink it." For ~~chat-style agents, that crosses a~~your real ~~perceptual threshold.~~

Eviction: Whenagent, the Cachepractical Won'tanswer Fit

might

Inbe ~~a long session, the cache eventually grows past your memory budget. You need an eviction policy. The naive choice — drop the oldest tokens — usually works worse than people expect, because of the~~ ~~attention sink~~ ~~phenomenon mentioned in 2.1: the first few tokens carry disproportionate weight,~~different, and ~~dropping them degrades output quality.~~

~~Better policies in roughly increasing order of complexity:~~

~~Sliding window with sinks~~~~: keep the first N tokens (the "sink") and the most recent M tokens, drop everything in between. Simple, effective for chat, and easy to implement.~~

~~Importance-based eviction~~~~: track per-token attention scores during generation and preferentially keep tokens that other tokens attended to. More accurate, but adds bookkeeping overhead and is harder to vectorize on the NPU.~~

~~Hierarchical summarization~~~~: when older context overflows, summarize it into a few tokens and use that summary going forward. This blends caching with the agent's own memory system, which we'll come back to in 2.3.~~

~~Whatever you choose,~~ ~~make eviction explicit and logged~~~~. Silent eviction produces mysterious quality regressions: the agent suddenly forgets something it said three turns ago and the user blames the model.~~

When You Have Multiple Agents in One Process

If your application runs more than one agent — say, a small classifier on the cascade pattern plus a larger reasoning model — they compete for the same NPU memory. The cache management becomes multi-tenant:

~~Cold-swap~~~~: only one model resident at a time, swap in the other on demand. Simple, but you pay model-load latency on every switch.~~ ~~Warm-coexist~~~~: both models resident, sharing memory budget. Faster switching but tighter constraints on each model's working set.~~ ~~Time-share with checkpointing~~~~: serialize the inactive model's state to host memory, restore on switch. Hybrid approach used when the NPU can't hold both models simultaneously.~~

Most edge SoCs can only run one large model at a time on the NPU. If your architecture needs two, design the orchestration around model swaps from the start — don't discover this when integration testing.

A Diagnostic for Cache Health

~~If you've been profiling well, you can answer these questions about your agent right now:~~

~~What percentage of TTFT is prefill compute~~ ~~vs. tokenization, vs. model load, vs. cache management overhead?~~ ~~How does TTFT scale with turn number~~ ~~in a typical conversation? If turn 10 takes 3x as long as turn 1, you don't have effective session caching.~~ ~~What's the cache hit rate on prefix caching~~ ifnow you have ~~it? Below 80% means your prefix isn't actually fixed and you're paying for a cache that doesn't help.~~ ~~At what session length does memory pressure trigger eviction~~~~, and what does your agent's quality look like just after that boundary?~~

~~If you can't answer these, instrument~~ the ~~runtime until you can. The cache is invisible by default, which is exactly why it's where the easy wins hide.~~levers.

What This Section Bought You

~~Cache~~You ~~engineering~~should isnow ~~one of those areas where the difference between "we set up an agent" and "we shipped an agent" is most visible.~~understand:

~~Session~~Three independent caching layers: model caching (CACHE_DIR), KV cache (stateful variables), prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING)

LLMPipeline.start_chat() is how you keep KV state across turns ison NPU; the ~~single~~runtime ~~highest-leverage~~auto-aligns ~~latency~~to ~~optimization~~the ~~available~~new sequence Greedy-only on NPU's classic static pipeline — beam search is on the iGPU or CPU, not NPU Prefix caching ~~for~~means ~~fixed~~different ~~system~~things ~~prompts~~on ~~and~~CPU/GPU ~~tool~~vs ~~definitions~~NPU ~~eliminates~~through ~~cold-start~~OVMS; ~~prefill~~the ~~waste~~same flag, different mechanisms The 8K context "limit" is a validated preview, not a hard ceiling; query MAX_PROMPT_LEN at runtime The "sequential execution" claim is an OVMS scheduling policy, not an NPU ~~runtime~~hardware ~~constraints make this harder than it sounds~~limit — ~~pre-allocated~~direct ~~buffers,~~Runtime ~~fixed~~use ~~shapes,~~can ~~and~~submit ~~limited~~multiple ~~cache~~async ~~access are common obstacles~~InferRequests ~~Eviction~~For ~~needs~~sentence-level toM2M-100 betranslation, ~~deliberate~~none of this matters; —the ~~keep~~constraints ~~sinks,~~bind ~~log~~at ~~evictions,~~document-level and ~~design~~on ~~eviction~~other ~~policy~~agentic ~~alongside the rest of the agent~~workloads

~~With~~The ~~memory~~next ~~and~~section ~~cache~~closes ~~engineering~~Chapter in2 ~~hand,~~by wemoving ~~can~~from ~~finally turn~~state to ~~the~~decision-making: ~~part~~given ~~most~~a ~~people~~working ~~think~~M2M-100 of+ asNPU ~~"the~~setup, ~~agent":~~what ~~the~~does a reasoning ~~loop,~~loop ~~tool selection,~~cost, and how ~~decisions~~do ~~get~~you ~~made~~bound ~~within these tight constraints.~~it?

Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint