2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing
2.2
The KVdistinction Cache Engineering
Theory in 2.1 said thebetween KV cache wall(what you keep in memory) and KV cache bandwidth (what you stream per token) is subtle and worth being precise about, because it sets the operational window for what an agent can do in real but binds at long context.time. This section isdescends aboutinto the engineering:implementation howdetails: what does KV cache engineering look like in practice, and where do the OpenVINO APIs and caching layers fit?
Stateful KV Caching: In-Memory and On-Disk
OpenVINO's LLMPipeline (for decoder-only models) and the older OpenVINO 2025.3 GenAI actuallyinterface managesexpose KV caching through stateful models that hold KV state onacross Intelmultiple NPU,infer() whatcalls.
A stateless forward pass recomputes the configurationfull knobscontext do,on andevery whichtoken:
outputs = model(prompt_tokens + [new_token]) # Expensive at each step
A stateful forward pass reuses KV from the threeprevious confusingly-namedstep:
# mechanismsFirst appliescall where.(prefill): starts the chat session, returns KV state internally
outputs = model.start_chat(prompt_tokens)
# Subsequent calls (decode): feed only the new token, read cached KV
for step in range(num_steps):
outputs = model.generate_next(new_token)
# The OpenVINOmodel's stackinternal hasKV maturedstate atgrows: roughly[1, one1, majorstep+1, capabilityhead_dim] perfor quarterlyself-attention
release# sinceEach 2024.4, and the surface areastep is nowO(1) largein enoughcontext thatlength, anot lotO(seq_len)
of
This is misconfigured because the author conflated two layers of the cache stack.
The Three Caching Mechanisms (and Why They're Confused)
Before going further, untangle these three things, which all liveexposed in OpenVINO and all sound similar:
Model caching (CACHE_DIR). The compiled NPU blob is cached to disk so that subsequent runs skip the multi-second to multi-minute compile step. This is what Chapter 1.3's cold-start table was about. Universal across devices.
KV cache (the per-request decoder state). The standard transformer KV cache, held in OpenVINO stateful-model variables on NPU. Reused across generate() calls within a chat session via LLMPipeline.start_chat() and LLMPipeline.finish_chat(), or via the lower-level stateful pipeline API that manages the KV variable allocation.
On-disk KV caching is a feature of OpenVINO 2025.4+: the prefix cache (Chapter 2.2's cached KV across different prompts with shared prefixes) can be memory-mapped to disk, reducing hot DRAM footprint. This is not the same as KV cache spilling; it's a deliberate optimization for scenarios with many similar prompts (e.g., RAG where the retrieval context is shared).
The Three Layers of Caching
OpenVINO has three distinct caching mechanisms that developers often confuse:
1. Model caching (CACHE_DIR). The compiled blob (the IR XML + weights compiled to NPU bytecode) is written to disk on first compilation, then loaded from disk on subsequent runs. This is handled by setting CACHE_DIR environment variable or via core.set_property("CACHE_DIR", path). Runtime: saves 30–60 seconds on cold start, costs ~1–3 seconds on warm start (load from disk, validate, run). Scope: global per model, not per-session.
2. KV cache (stateful model state). The key-value cache for attention is held in memory as model variables. Managed via model.start_chat() and model.finish_chat() for LLMPipeline, or directly via InferRequest variable state for lower-level APIs. Runtime: O(seq_len × head_size) memory per layer, amortized O(1) per token decode. Scope: per-session (one chat session = one KV state buffer).
3. Prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING). A newer feature (cross-request2025.4+) reusethat of the prefill phase). When two requests start with the same prompt prefix,caches the KV stateof fromcommon prompt prefixes across different requests. If you make multiple requests that share a long context prefix (e.g., system prompt + retrieved documents), the KV for the prefix is computed once and reused. TheMechanism differs --enable_prefix_cachingflagper indevice: OVMSon controls this — butCPU/GPU it meansuses copy-on-write; on NPU it's a different thingspath through the compiler. Runtime: saves recompute on differentshared devices,prefixes, whichcosts we'llextra getmemory to.for the cache table. Scope: global per model (shared across all sessions).
These are three independent layers.orthogonal. You can have anymodel combinationcaching (bytecode on ordisk) off.+ ProductionKV NPUcaching agents(current typicallysession's wantattention memory) + prefix caching (shared prompt prefixes across sessions), all threeat on.once. The confusion arises because they all have "cache" in the name and all improve performance, but at different scopes.
WhatKV LLMPipeline.start_chat()Cache DoesPrecision and Quantization
From the OpenVINO GenAI source, start_chat() opens a chat session bound to a single stateful compiled model. The KV statecache livesis almost always kept in insideFP16 or higher precision on NPU, even if weights are INT4 or INT8. Why? Because the attention mechanism (the softmax in particular) is sensitive to numerical precision; quantizing the KV to INT8 often causes noticeable degradation in output quality, particularly on longer contexts where accumulated rounding error matters.
The exception is NF4 weights + FP16 KV (Lunar Lake NPU 4 only, 2025.3+), where the weights are NF4 and the KV is held at FP16. This is a documented combination; going further (e.g., INT4 KV) is not validated and likely to cause accuracy loss.
For M2M-100 1.2B at 128 tokens:
For an 8B model at 2K context:
OVMS (OpenVINO stateful-modelModel variablesServer) —and stateSequential tensorsExecution
A caveat from the IR I/O surface — and is reused across generate() calls. On each call the runtime invokes align_kv_cache_and_history(), which compares the new tokenized sequence to the cached state and submits only the divergent suffix. finish_chat() resets the state.
import openvino_genai as ov_genai
pipe = ov_genai.LLMPipeline("ov_phi35_mini_int4", device="NPU",
CACHE_DIR=".ovcache")
pipe.start_chat()
print(pipe.generate("Hello!", max_new_tokens=50)) # full prefill
print(pipe.generate("And what is your name?", # only the suffix prefills
max_new_tokens=50))
pipe.finish_chat()
On NPU specifically, the backend is StaticLLMPipeline (selected internally via is_npu_requested()), which uses fixed input shapes derived from MAX_PROMPT_LEN and MIN_RESPONSE_LEN, compiles through compile_decoder_for_npu(), and supports blob export/import via EXPORT_BLOB and BLOB_PATH for binary distribution.
The NPU LLM constraint nobody documents loudly enough: greedy decoding only on the classic static NPU pipeline. No beam search. OVMS 2025.4 added multinomial sampling, but beam search remains unsupported on NPU. For M2M-100 translation this matters because beam-4 is the standard high-quality decode setting for NMT — going greedy on NPU costs you roughly 0.3–0.8 BLEU on FLORES devtest, but it's the only option.
Prefix Caching — The Version-Stamped Reality
The single most-confused area in OpenVINO docs is prefix caching, because the same --enable_prefix_caching CLI flag drives two completely different mechanisms depending on target device.
On CPU/GPU through OVMS, the flag enables PagedAttention/Continuous-Batching prefix cache with configurable cache_size. Standard production LLM serving stuff. Works well, well-validated.
On NPU through OVMS, the flag drives the plugin-level NPUW_LLM_ENABLE_PREFIX_CACHING:YES (added in OpenVINO 2025.4), which reduces TTFT in long-chat scenarios on the static-shape Stateful pipeline.
The OVMS docs are explicit that on Stateful (NPU) servables, cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching, cache_eviction_config, and sparse_attention_config aredocumentation: ignored at the OVMS scheduling layer — but the 2025.4 demo command still passes --enable_prefix_caching true with --target_device NPU because OVMS now plumbs that flag through to NPUW_LLM_ENABLE_PREFIX_CACHING:YES. Both statements are simultaneously true and pertain to different layers of the stack. If your prefix-caching mental model breaks, this is usually why.
Prefix caching for encoder-decoder seq2seq models like M2M-100 is not documented on OVMS at all. It's a gap in the public record. The OpenVINO encoder for M2M-100 is single-pass static prefill anyway, so the question of "reuse encoder state across requests" maps differently — the encoder is cheap enough that caching it sequence-by-sequence is a smaller win than for an autoregressive LLM.
Chunked Prefill and the 8K "Ceiling"
OpenVINO 2025.3 introduced dynamic prompts on NPU by default through PREFILL_HINT=DYNAMIC with NPUW_LLM_PREFILL_CHUNK_SIZE=1024. Setting PREFILL_HINT=STATIC reverts to the 2025.2 fixed-shape behavior. PR #31687 ("NPUW: Automatically align MAX_PROMPT_LENGTH to CHUNK_SIZE") enforces the alignment constraint that MAX_PROMPT_LEN must be a multiple of NPUW_LLM_PREFILL_CHUNK_SIZE.
The 8K context limit is not a hard architectural ceiling. The 2025.3 release notes describe it as a validated preview on specific hardware: "Longer contexts are available as preview feature on 32GB Intel Core Ultra Series 2 (with prompt size up to 8..12K tokens)." The 2025.4 notes promote it to general availability on Lunar Lake. The cap is set by where chunked-prefill activation buffers fit in DDR; smaller-RAM SKUs cap lower, and Panther Lake is positioned to extend further (no public number yet). Production code should query MAX_PROMPT_LEN at runtime rather than hardcode 8192.
The properties to remember, with defaults:
MAX_PROMPT_LENMIN_RESPONSE_LENNPUW_LLM_PREFILL_CHUNK_SIZEPREFILL_HINTDYNAMICSTATICGENERATE_HINTFAST_COMPILEBEST_PERFNPUW_LLM_ENABLE_PREFIX_CACHINGNOCACHE_DIRPERFORMANCE_HINTLATENCYTHROUGHPUTThe "Sequential Execution" Claim — What It Actually Means
OVMS docs state: "OpenVINO Model Server (OVMS) with NPU accelerationStateful models has a "process the requests sequentially.sequentially" Forpolicy. thatSome reason,readers benchmarkinginterpret shouldthis beas performed in max_concurrency set to 1." This is not anthe NPU hardware limit.can only process one request at a time." That's misleading.
What it actually means: the OVMS scheduler for NPU Stateful servables is currently single-threaded, so requests are queued and handled one at a time. The NPU pluginhardware advertisesitself supports multiple concurrent inference requests (via async in optimal_number_of_infer_requests = 4InferRequestTHROUGHPUTthe modenative API), tile-level parallelism, and exposesfrequency ov::range_for_async_infer_requests.scaling. PR #27875 even added an opt-in property NPU_RUN_INFERENCES_SEQUENTIALLY defaulting to false — the existence of an opt-in to forceThe sequential proves the default is parallel-capable.
The truth: for OVMS on NPU, Stateful servables intentionally serialize at the request level because each NPU LLM session owns a state-variable instance, and the scheduling policy is designed for single-user AI-PC latency workloads. Fora directscheduler choice in OVMS, not a hardware limitation.
If you're using the native OpenVINO Runtime useAPI directly (not OVMS), multiple InferRequest objectsyou can beuse submittedasync async,requests and tile-levelparallelize parallelisminference. (ov::intel_npu::tiles)OVMS is real.the Thehigher-level bookserving distinguisheslayer; theseif layersyou're becausebuilding conflatingan themagent producessystem wrongin-process mental(which modelsis —typical for edge/on-device agents), you're likely using the Runtime API and wrongdon't mentalhit modelsthis produceconstraint.
KV Cache Memory Lifecycle
For a long-running agent that trycycles tothrough parallelizemultiple somethingrequests that(interact won'twith parallelize,user, orcall refusea totool, parallelizeobserve, somethingreason, that would.
Eviction Policies — Mostly Not Your Problem on NPU
The standard production LLM concerns aboutrepeat), KV cache evictionmanagement —matters:
# LFU,Pseudocode FIFO,for importance-weightedagent retentionloop
—model apply= ov.LLMPipeline(...)
for i in range(num_steps):
# Prefill: prompt grows with accumulated observations
outputs = model.start_chat(accumulated_prompt) # Allocates KV state
for j in range(decode_tokens):
# Decode: uses cached KV
outputs = model.generate_next()
# Finish: release KV state
model.finish_chat() # Clears the KV buffer
# Between steps: observations are appended to PagedAttention-basedaccumulated_prompt
serving# onaccumulated_prompt CPU/GPU.grows; On NPU's StaticLLMPipeline, theKV cache is per-requestdiscarded and boundedrecreated byon next prefill
At each , MAX_PROMPT_LENstart_chat()+a MIN_RESPONSE_LEN.fresh There'sKV noallocation globalis cachemade. poolIf your accumulated prompt has grown to evict2K from. Eviction happens whentokens, the sessionKV endsallocation (is 2K-sized and you're committed to that footprint until finish_chat(). orIf processthe exit).next step's prompt is 3K tokens, a new 3K allocation is made.
ThisFor simplifieslong-running aagents, lot. It alsothis means you can't shareaccumulate unbounded history within a single KV cachebuffer; acrossyou usershave onto NPUeither:
Context ManagementImplications for M2M-100 TranslationDeployment
ForM2M-100 sentence-levelis English-to-Frenchan translation,encoder-decoder, so the KV lifecycle is:
MAX_PROMPT_LEN=128MIN_RESPONSE_LEN=128The discussionencoder belongsKV indoesn't theget chapterreused because:
Forbrings theits workedown example,encoder theKV. practical answerThis is "configurewhy forbatching shortM2M-100 context,(or don'tany overthinkseq2seq) it."is Forawkward youron realNPU agent, the practical answer might be different, and now— you havecan't thetrivially levers.share encoder KV across different inputs.
What This Section Bought You
You should now understand:
- Stateful KV caching via
start_chat()/finish_chat()amortizes prefill cost across decode steps
CACHE_DIRNPUW_LLM_ENABLE_PREFIX_CACHINGLLMPipeline.start_chat()KV cache is MAX_PROMPT_LENRuntime start_chat() time; unbounded history requires external memory
M2M-100's encoder KV is The next section closesapplies Chapterall 2of by moving from statethis to decision-making:the agent's reasoning loop: given abounded workingcontext M2M-100and +bounded NPUKV setup,cache, what does a reasoning looparchitectures cost,actually and how do you bound it?work?
Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint