2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.2 KV Cache Engineering

Theory in 2.1 said the KV cache wall is real but binds at long context. This section is about the engineering: how OpenVINO GenAI actually manages KV state on Intel NPU, what the configuration knobs do, and which of the three confusingly-named caching mechanisms applies where. The OpenVINO stack has matured at roughly one major capability per quarterly release since 2024.4, and the surface area is now large enough that a lot of production code is misconfigured because the author conflated two layers of the cache stack.

The Three Caching Mechanisms (and Why They're Confused)

Before going further, untangle these three things, which all live in OpenVINO and all sound similar:

Model caching (CACHE_DIR). The compiled NPU blob is cached to disk so that subsequent runs skip the multi-second to multi-minute compile step. This is what Chapter 1.3's cold-start table was about. Universal across devices.

KV cache (the per-request decoder state). The standard transformer KV cache, held in OpenVINO stateful-model variables on NPU. Reused across generate() calls within a chat session via LLMPipeline.start_chat().

Prefix caching (cross-request reuse of the prefill phase). When two requests start with the same prompt prefix, the KV state from that prefix is reused. The --enable_prefix_caching flag in OVMS controls this — but it means different things on different devices, which we'll get to.

These are three independent layers. You can have any combination on or off. Production NPU agents typically want all three on.

What `LLMPipeline.start_chat()` Does

From the OpenVINO GenAI source, start_chat() opens a chat session bound to a single stateful compiled model. The KV state lives inside OpenVINO stateful-model variables — state tensors hidden from the IR I/O surface — and is reused across generate() calls. On each call the runtime invokes align_kv_cache_and_history(), which compares the new tokenized sequence to the cached state and submits only the divergent suffix. finish_chat() resets the state.

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline("ov_phi35_mini_int4", device="NPU",
                            CACHE_DIR=".ovcache")
pipe.start_chat()
print(pipe.generate("Hello!", max_new_tokens=50))    # full prefill
print(pipe.generate("And what is your name?",        # only the suffix prefills
                    max_new_tokens=50))
pipe.finish_chat()

On NPU specifically, the backend is StaticLLMPipeline (selected internally via is_npu_requested()), which uses fixed input shapes derived from MAX_PROMPT_LEN and MIN_RESPONSE_LEN, compiles through compile_decoder_for_npu(), and supports blob export/import via EXPORT_BLOB and BLOB_PATH for binary distribution.

The NPU LLM constraint nobody documents loudly enough: greedy decoding only on the classic static NPU pipeline. No beam search. OVMS 2025.4 added multinomial sampling, but beam search remains unsupported on NPU. For M2M-100 translation this matters because beam-4 is the standard high-quality decode setting for NMT — going greedy on NPU costs you roughly 0.3–0.8 BLEU on FLORES devtest, but it's the only option.

Prefix Caching — The Version-Stamped Reality

The single most-confused area in OpenVINO docs is prefix caching, because the same --enable_prefix_caching CLI flag drives two completely different mechanisms depending on target device.

On CPU/GPU through OVMS, the flag enables PagedAttention/Continuous-Batching prefix cache with configurable cache_size. Standard production LLM serving stuff. Works well, well-validated.

On NPU through OVMS, the flag drives the plugin-level NPUW_LLM_ENABLE_PREFIX_CACHING:YES (added in OpenVINO 2025.4), which reduces TTFT in long-chat scenarios on the static-shape Stateful pipeline.

The OVMS docs are explicit that on Stateful (NPU) servables, cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching, cache_eviction_config, and sparse_attention_config are ignored at the OVMS scheduling layer — but the 2025.4 demo command still passes --enable_prefix_caching true with --target_device NPU because OVMS now plumbs that flag through to NPUW_LLM_ENABLE_PREFIX_CACHING:YES. Both statements are simultaneously true and pertain to different layers of the stack. If your prefix-caching mental model breaks, this is usually why.

Prefix caching for encoder-decoder seq2seq models like M2M-100 is not documented on OVMS at all. It's a gap in the public record. The OpenVINO encoder for M2M-100 is single-pass static prefill anyway, so the question of "reuse encoder state across requests" maps differently — the encoder is cheap enough that caching it sequence-by-sequence is a smaller win than for an autoregressive LLM.

Chunked Prefill and the 8K "Ceiling"

OpenVINO 2025.3 introduced dynamic prompts on NPU by default through PREFILL_HINT=DYNAMIC with NPUW_LLM_PREFILL_CHUNK_SIZE=1024. Setting PREFILL_HINT=STATIC reverts to the 2025.2 fixed-shape behavior. PR #31687 ("NPUW: Automatically align MAX_PROMPT_LENGTH to CHUNK_SIZE") enforces the alignment constraint that MAX_PROMPT_LEN must be a multiple of NPUW_LLM_PREFILL_CHUNK_SIZE.

The 8K context limit is not a hard architectural ceiling. The 2025.3 release notes describe it as a validated preview on specific hardware: "Longer contexts are available as preview feature on 32GB Intel Core Ultra Series 2 (with prompt size up to 8..12K tokens)." The 2025.4 notes promote it to general availability on Lunar Lake. The cap is set by where chunked-prefill activation buffers fit in DDR; smaller-RAM SKUs cap lower, and Panther Lake is positioned to extend further (no public number yet). Production code should query MAX_PROMPT_LEN at runtime rather than hardcode 8192.

The properties to remember, with defaults:

Property	Default	Effect
`MAX_PROMPT_LEN`	1024	Max input prompt tokens on static-shape NPU pipeline
`MIN_RESPONSE_LEN`	128 (was 150 pre-2025.3)	Min new tokens reserved
`NPUW_LLM_PREFILL_CHUNK_SIZE`	1024	Granularity of chunked prefill
`PREFILL_HINT`	`DYNAMIC` (since 2025.3)	`STATIC` to revert to old behavior
`GENERATE_HINT`	`FAST_COMPILE`	`BEST_PERF` for runtime perf at compile cost
`NPUW_LLM_ENABLE_PREFIX_CACHING`	`NO`	Enables NPU prefix cache (2025.4+)
`CACHE_DIR`	unset	Strongly recommended on NPU
`PERFORMANCE_HINT`	`LATENCY`	`THROUGHPUT` allows up to 4 concurrent infer requests

The "Sequential Execution" Claim — What It Actually Means

OVMS docs state: "OpenVINO Model Server with NPU acceleration process the requests sequentially. For that reason, benchmarking should be performed in max_concurrency set to 1." This is not an NPU hardware limit. The NPU plugin advertises optimal_number_of_infer_requests = 4 in THROUGHPUT mode and exposes ov::range_for_async_infer_requests. PR #27875 even added an opt-in property NPU_RUN_INFERENCES_SEQUENTIALLY defaulting to false — the existence of an opt-in to force sequential proves the default is parallel-capable.

The truth: for OVMS on NPU, Stateful servables intentionally serialize at the request level because each NPU LLM session owns a state-variable instance, and the scheduling policy is designed for single-user AI-PC latency workloads. For direct OpenVINO Runtime use (not OVMS), multiple InferRequest objects can be submitted async, and tile-level parallelism (ov::intel_npu::tiles) is real. The book distinguishes these layers because conflating them produces wrong mental models — and wrong mental models produce designs that try to parallelize something that won't parallelize, or refuse to parallelize something that would.

Eviction Policies — Mostly Not Your Problem on NPU

The standard production LLM concerns about KV cache eviction — LRU, LFU, FIFO, importance-weighted retention — apply to PagedAttention-based serving on CPU/GPU. On NPU's StaticLLMPipeline, the cache is per-request and bounded by MAX_PROMPT_LEN + MIN_RESPONSE_LEN. There's no global cache pool to evict from. Eviction happens when the session ends (finish_chat() or process exit).

This simplifies a lot. It also means you can't share KV cache across users on NPU the way you would on a server-side GPU deployment. Single-user AI-PC workloads are the design center.

Context Management for M2M-100 Translation

For sentence-level English-to-French translation, encoder input rarely exceeds 64 tokens and decoder output rarely exceeds 96. With MAX_PROMPT_LEN=128 and MIN_RESPONSE_LEN=128, the entire context budget fits comfortably under any NPU's static-shape envelope, on any generation. Chunked prefill, prefix caching, the 8K ceiling — none of it matters for sentence MT.

The discussion belongs in the chapter because:

M2M-100 generalizes to document-level translation at T = 1K–2K, where the constraints start to bite
The constraints transfer directly to other agentic seq2seq workloads — summarization, retrieval-augmented translation, ASR-translation pipelines — where context grows
The OpenVINO version compatibility and configuration story applies to every NPU-served LLM, not just translation. If you also run Phi-3.5-mini for an "explain this translation" tool (the worked example in Chapter 5.2), all of the above applies

For the worked example, the practical answer is "configure for short context, don't overthink it." For your real agent, the practical answer might be different, and now you have the levers.

What This Section Bought You

You should now understand:

Three independent caching layers: model caching (CACHE_DIR), KV cache (stateful variables), prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING)
LLMPipeline.start_chat() is how you keep KV state across turns on NPU; the runtime auto-aligns to the new sequence
Greedy-only on NPU's classic static pipeline — beam search is on the iGPU or CPU, not NPU
Prefix caching means different things on CPU/GPU vs NPU through OVMS; the same flag, different mechanisms
The 8K context "limit" is a validated preview, not a hard ceiling; query MAX_PROMPT_LEN at runtime
The "sequential execution" claim is an OVMS scheduling policy, not an NPU hardware limit — direct Runtime use can submit multiple async InferRequests
For sentence-level M2M-100 translation, none of this matters; the constraints bind at document-level and on other agentic workloads

The next section closes Chapter 2 by moving from state to decision-making: given a working M2M-100 + NPU setup, what does a reasoning loop cost, and how do you bound it?

Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint