2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing
2.2 KV Cache Engineering
TheTheory previousin section2.1 treatedsaid the KV cache aswall ais memoryreal cost.but binds at long context. This section treatsis about the engineering: how OpenVINO GenAI actually manages KV state on Intel NPU, what the configuration knobs do, and which of the three confusingly-named caching mechanisms applies where. The OpenVINO stack has matured at roughly one major capability per quarterly release since 2024.4, and the surface area is now large enough that a lot of production code is misconfigured because the author conflated two layers of the cache stack.
The Three Caching Mechanisms (and Why They're Confused)
Before going further, untangle these three things, which all live in OpenVINO and all sound similar:
Model caching (CACHE_DIR). The compiled NPU blob is cached to disk so that subsequent runs skip the multi-second to multi-minute compile step. This is what Chapter 1.3's cold-start table was about. Universal across devices.
KV cache (the per-request decoder state). The standard transformer KV cache, held in OpenVINO stateful-model variables on NPU. Reused across generate() calls within a chat session via LLMPipeline.start_chat().
Prefix caching (cross-request reuse of the prefill phase). When two requests start with the same prompt prefix, the KV state from that prefix is reused. The --enable_prefix_caching flag in OVMS controls this — but it means different things on different devices, which we'll get to.
These are three independent layers. You can have any combination on or off. Production NPU agents typically want all three on.
What LLMPipeline.start_chat() Does
From the OpenVINO GenAI source, start_chat() opens a chat session bound to a single stateful compiled model. The KV state lives inside OpenVINO stateful-model variables — state tensors hidden from the IR I/O surface — and is reused across generate() calls. On each call the runtime invokes align_kv_cache_and_history(), which compares the new tokenized sequence to the cached state and submits only the divergent suffix. finish_chat() resets the state.
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline("ov_phi35_mini_int4", device="NPU",
CACHE_DIR=".ovcache")
pipe.start_chat()
print(pipe.generate("Hello!", max_new_tokens=50)) # full prefill
print(pipe.generate("And what is your name?", # only the suffix prefills
max_new_tokens=50))
pipe.finish_chat()
On NPU specifically, the backend is StaticLLMPipeline (selected internally via is_npu_requested()), which uses fixed input shapes derived from MAX_PROMPT_LEN and MIN_RESPONSE_LEN, compiles through compile_decoder_for_npu(), and supports blob export/import via EXPORT_BLOB and BLOB_PATH for binary distribution.
The NPU LLM constraint nobody documents loudly enough: greedy decoding only on the classic static NPU pipeline. No beam search. OVMS 2025.4 added multinomial sampling, but beam search remains unsupported on NPU. For M2M-100 translation this matters because beam-4 is the standard high-quality decode setting for NMT — going greedy on NPU costs you roughly 0.3–0.8 BLEU on FLORES devtest, but it's the only option.
Prefix Caching — The Version-Stamped Reality
The single most-confused area in OpenVINO docs is prefix caching, because the same --enable_prefix_caching CLI flag drives two completely different mechanisms depending on target device.
On CPU/GPU through OVMS, the flag enables PagedAttention/Continuous-Batching prefix cache with configurable cache_size. Standard production LLM serving stuff. Works well, well-validated.
On NPU through OVMS, the flag drives the plugin-level NPUW_LLM_ENABLE_PREFIX_CACHING:YES (added in OpenVINO 2025.4), which reduces TTFT in long-chat scenarios on the static-shape Stateful pipeline.
The OVMS docs are explicit that on Stateful (NPU) servables, cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching, cache_eviction_config, and sparse_attention_config are ignored at the OVMS scheduling layer — but the 2025.4 demo command still passes --enable_prefix_caching true with --target_device NPU because OVMS now plumbs that flag through to NPUW_LLM_ENABLE_PREFIX_CACHING:YES. Both statements are simultaneously true and pertain to different layers of the stack. If your prefix-caching mental model breaks, this is usually why.
Prefix caching for encoder-decoder seq2seq models like M2M-100 is not documented on OVMS at all. It's a gap in the public record. The OpenVINO encoder for M2M-100 is single-pass static prefill anyway, so the question of "reuse encoder state across requests" maps differently — the encoder is cheap enough that caching it sequence-by-sequence is a smaller win than for an autoregressive LLM.
Chunked Prefill and the 8K "Ceiling"
OpenVINO 2025.3 introduced dynamic prompts on NPU by default through PREFILL_HINT=DYNAMIC with NPUW_LLM_PREFILL_CHUNK_SIZE=1024. Setting PREFILL_HINT=STATIC reverts to the 2025.2 fixed-shape behavior. PR #31687 ("NPUW: Automatically align MAX_PROMPT_LENGTH to CHUNK_SIZE") enforces the alignment constraint that MAX_PROMPT_LEN must be a multiple of NPUW_LLM_PREFILL_CHUNK_SIZE.
The 8K context limit is not a hard architectural ceiling. The 2025.3 release notes describe it as a resourcevalidated youpreview canon engineer.specific Donehardware: well,"Longer cachecontexts reuseare acrossavailable turnsas preview feature on 32GB Intel Core Ultra Series 2 (with prompt size up to 8..12K tokens)." The 2025.4 notes promote it to general availability on Lunar Lake. The cap is set by where chunked-prefill activation buffers fit in DDR; smaller-RAM SKUs cap lower, and prefixPanther sharing across sessions can cut prefill latency by 5–10x. Done poorly, you end up recomputing the same thing over and over while telling the user "Thinking…"
The CacheLake is Alreadypositioned Thereto —extend Don'tfurther Throw(no Itpublic Away
number Here'syet). theProduction mostcode commonshould waste in agent implementations: a multi-turn conversation where each turn re-tokenizes and re-prefills the entire history.
Turn 1 prompt:query [system] [tools] [user_msg_1]MAX_PROMPT_LEN → prefill → generate
Turn 2 prompt: [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] → re-prefill the whole thing → generate
Turn 3 prompt: [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] [assistant_2] [user_msg_3] → re-prefill again →...
By turn 5, you're spending most of your TTFT recomputing context whose KV values you had perfectly valid copies of moments ago. On an NPU where prefill is compute-bound and runs at perhapsruntime 100–500rather tokens/second,than thishardcode is brutal.8192.
The fixproperties isto cacheremember, persistence:with keep the KV cache resident across turns, and only prefill the new tokens at the end. This is sometimes called session caching or conversational caching.
The savings:defaults:
MAX_PROMPT_LEN |
MIN_RESPONSE_LEN
NPUW_LLM_PREFILL_CHUNK_SIZE
1024
Granularity of chunked prefill
PREFILL_HINT
DYNAMIC STATIC to revert to old behavior
GENERATE_HINT
FAST_COMPILE
BEST_PERF for runtime perf at compile cost
NPUW_LLM_ENABLE_PREFIX_CACHING
NO
Enables NPU prefix cache (2025.4+)
CACHE_DIR
unset
Strongly recommended on NPU
PERFORMANCE_HINT
LATENCY
THROUGHPUT allows up to 4 concurrent infer requests
The "Sequential Execution" Claim — What It Actually Means
OVMS docs state: "OpenVINO Model Server with NPU acceleration process the requests sequentially. For that reason, benchmarking should be performed in max_concurrency set to 1." This is not an NPU hardware limit. The NPU plugin advertises optimal_number_of_infer_requests = 4 in THROUGHPUT mode and exposes ov::range_for_async_infer_requests. PR #27875 even added an opt-in property NPU_RUN_INFERENCES_SEQUENTIALLY defaulting to false — the singleexistence biggestof an opt-in to force sequential proves the default is parallel-capable.
The truth: for OVMS on NPU, Stateful servables intentionally serialize at the request level because each NPU LLM session owns a state-variable instance, and the scheduling policy is designed for single-user AI-PC latency winworkloads. availableFor indirect mostOpenVINO NPURuntime agentuse implementations,(not OVMS), multiple InferRequest objects can be submitted async, and mosttile-level teamsparallelism discover(ov::intel_npu::tiles) theyis needreal. itThe afterbook theirdistinguishes firstthese roundlayers ofbecause userconflating testing.them produces wrong mental models — and wrong mental models produce designs that try to parallelize something that won't parallelize, or refuse to parallelize something that would.
WhatEviction MakesPolicies This— TrickyMostly Not Your Problem on an NPU
IfThe sessionstandard cachingproduction isLLM soconcerns obviouslyabout good,KV whycache isn'teviction it— theLRU, default?LFU, BecauseFIFO, NPUimportance-weighted runtimesretention often— make it hard:
StaticLLMPipeline, the cache MAX_PROMPT_LEN + MIN_RESPONSE_LEN. There's no global cache pool to finish_chat() or process exit).
This wastessimplifies memorya forlot. shortIt sessions.
The state of the artwould on eacha platformserver-side varies.GPU Coredeployment. ML'sSingle-user statefulAI-PC prediction APIs, ONNX Runtime's RunWithBinding and IO binding, and OpenVINO's StateAPIworkloads are the kindsdesign center.
Context Management for M2M-100 Translation
For sentence-level English-to-French translation, encoder input rarely exceeds 64 tokens and decoder output rarely exceeds 96. With MAX_PROMPT_LEN=128 and MIN_RESPONSE_LEN=128, the entire context budget fits comfortably under any NPU's static-shape envelope, on any generation. Chunked prefill, prefix caching, the 8K ceiling — none of mechanisms you'll need to use. None of them is plug-and-play; expect to read the runtime documentation carefully and test on your specific NPU before assuming it works.
Prefixfor Cachingsentence Across Sessions
Session caching reuses the cache within one conversation. Prefix cachingMT. reuses it across conversations, sharing the KV values of common prefixes.
The mostdiscussion common prefixbelongs in an agent is the system prompt plus tool definitions. These are often 500–2000 tokens and identical across every session. Prefilling them on every cold start is pure waste — they never change between users.
A prefix cache stores the precomputed KV tensors for that fixed prefix and reattaches them when a new session begins. The cost is one prefill at deployment time, paid back across every session that follows.
This is harder to implement than session cachingchapter because:
- M2M-100 generalizes to document-level translation at T = 1K–2K, where the constraints start to bite
When it works, prefix caching takes cold-start TTFT from "wait two seconds beforeFor the firstworked characterexample, appears"the topractical answer is "respondconfigure instantly.for short context, don't overthink it." For chat-style agents, that crosses ayour real perceptual threshold.
Eviction: Whenagent, the Cachepractical Won'tanswer Fit
might Inbe a long session, the cache eventually grows past your memory budget. You need an eviction policy. The naive choice — drop the oldest tokens — usually works worse than people expect, because of the attention sink phenomenon mentioned in 2.1: the first few tokens carry disproportionate weight,different, and dropping them degrades output quality.
Better policies in roughly increasing order of complexity:
Sliding window with sinks: keep the first N tokens (the "sink") and the most recent M tokens, drop everything in between. Simple, effective for chat, and easy to implement.
Importance-based eviction: track per-token attention scores during generation and preferentially keep tokens that other tokens attended to. More accurate, but adds bookkeeping overhead and is harder to vectorize on the NPU.
Hierarchical summarization: when older context overflows, summarize it into a few tokens and use that summary going forward. This blends caching with the agent's own memory system, which we'll come back to in 2.3.
Whatever you choose, make eviction explicit and logged. Silent eviction produces mysterious quality regressions: the agent suddenly forgets something it said three turns ago and the user blames the model.
When You Have Multiple Agents in One Process
If your application runs more than one agent — say, a small classifier on the cascade pattern plus a larger reasoning model — they compete for the same NPU memory. The cache management becomes multi-tenant:
Most edge SoCs can only run one large model at a time on the NPU. If your architecture needs two, design the orchestration around model swaps from the start — don't discover this when integration testing.
A Diagnostic for Cache Health
If you've been profiling well, you can answer these questions about your agent right now:
If you can't answer these, instrument the runtime until you can. The cache is invisible by default, which is exactly why it's where the easy wins hide.levers.
What This Section Bought You
CacheYou engineeringshould isnow one of those areas where the difference between "we set up an agent" and "we shipped an agent" is most visible.understand:
SessionThree independent caching layers: model caching (CACHE_DIR), KV cache (stateful variables), prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING)
LLMPipeline.start_chat() is how you keep KV state across turns MAX_PROMPT_LEN at runtime
The "sequential execution" claim is an OVMS scheduling policy, not an NPU WithThe memorynext andsection cachecloses engineeringChapter in2 hand,by wemoving canfrom finally turnstate to thedecision-making: partgiven mosta peopleworking thinkM2M-100 of+ asNPU "thesetup, agent":what thedoes a reasoning loop,loop tool selection,cost, and how decisionsdo getyou madebound within these tight constraints.it?
Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint