2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.2

The KVdistinction ~~Cache Engineering~~

~~Theory in 2.1 said the~~between KV cache ~~wall~~(what you keep in memory) and KV cache bandwidth (what you stream per token) is subtle and worth being precise about, because it sets the operational window for what an agent can do in real ~~but binds at long context.~~time. This section isdescends ~~about~~into the ~~engineering:~~implementation ~~how~~details: what does KV cache engineering look like in practice, and where do the OpenVINO APIs and caching layers fit?

Stateful KV Caching: In-Memory and On-Disk

OpenVINO's LLMPipeline (for decoder-only models) and the older OpenVINO 2025.3 GenAI ~~actually~~interface ~~manages~~expose KV caching through stateful models that hold KV state onacross ~~Intel~~multiple ~~NPU,~~infer() ~~what~~calls.

A stateless forward pass recomputes the ~~configuration~~full ~~knobs~~context ~~do,~~on ~~and~~every ~~which~~token:

outputs = model(prompt_tokens + [new_token])  # Expensive at each step

A stateful forward pass reuses KV from the ~~three~~previous ~~confusingly-named~~step:

~~caching~~

# mechanismsFirst appliescall where.(prefill): starts the chat session, returns KV state internally
outputs = model.start_chat(prompt_tokens)

# Subsequent calls (decode): feed only the new token, read cached KV
for step in range(num_steps):
    outputs = model.generate_next(new_token)
    # The OpenVINOmodel's stackinternal hasKV maturedstate atgrows: roughly[1, one1, majorstep+1, capabilityhead_dim] perfor quarterlyself-attention
    release# sinceEach 2024.4, and the surface areastep is nowO(1) largein enoughcontext thatlength, anot lotO(seq_len)
of

~~production code~~

This is ~~misconfigured because the author conflated two layers of the cache stack.~~

The Three Caching Mechanisms (and Why They're Confused)

~~Before going further, untangle these three things, which all live~~exposed in OpenVINO ~~and all sound similar:~~

~~Model caching~~ (CACHE_DIR~~). The compiled NPU blob is cached to disk so that subsequent runs skip the multi-second to multi-minute compile step. This is what Chapter 1.3's cold-start table was about. Universal across devices.~~

~~KV cache~~ ~~(the per-request decoder state). The standard transformer KV cache, held in OpenVINO stateful-model variables on NPU. Reused across~~ generate() ~~calls within a chat session~~ via LLMPipeline.start_chat() and LLMPipeline.finish_chat(), or via the lower-level stateful pipeline API that manages the KV variable allocation.

On-disk KV caching is a feature of OpenVINO 2025.4+: the prefix cache (Chapter 2.2's cached KV across different prompts with shared prefixes) can be memory-mapped to disk, reducing hot DRAM footprint. This is not the same as KV cache spilling; it's a deliberate optimization for scenarios with many similar prompts (e.g., RAG where the retrieval context is shared).

The Three Layers of Caching

OpenVINO has three distinct caching mechanisms that developers often confuse:

1. Model caching (CACHE_DIR). The compiled blob (the IR XML + weights compiled to NPU bytecode) is written to disk on first compilation, then loaded from disk on subsequent runs. This is handled by setting CACHE_DIR environment variable or via core.set_property("CACHE_DIR", path). Runtime: saves 30–60 seconds on cold start, costs ~1–3 seconds on warm start (load from disk, validate, run). Scope: global per model, not per-session.

2. KV cache (stateful model state). The key-value cache for attention is held in memory as model variables. Managed via model.start_chat() and model.finish_chat() for LLMPipeline, or directly via InferRequest variable state for lower-level APIs. Runtime: O(seq_len × head_size) memory per layer, amortized O(1) per token decode. Scope: per-session (one chat session = one KV state buffer).

3. Prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING). A newer feature (~~cross-request~~2025.4+) ~~reuse~~that ~~of the prefill phase). When two requests start with the same prompt prefix,~~caches the KV ~~state~~of ~~from~~common prompt prefixes across different requests. If you make multiple requests that share a long context prefix (e.g., system prompt + retrieved documents), the KV for the prefix is computed once and reused. ~~The~~Mechanism --enable_prefix_cachingdiffers ~~flag~~per indevice: ~~OVMS~~on ~~controls this — but~~CPU/GPU it ~~means~~uses copy-on-write; on NPU it's a different ~~things~~path through the compiler. Runtime: saves recompute on ~~different~~shared ~~devices,~~prefixes, ~~which~~costs ~~we'll~~extra ~~get~~memory ~~to.~~for the cache table. Scope: global per model (shared across all sessions).

These are ~~three independent layers.~~orthogonal. You can have ~~any~~model ~~combination~~caching (bytecode on ordisk) ~~off.~~+ ~~Production~~KV ~~NPU~~caching ~~agents~~(current ~~typically~~session's ~~want~~attention memory) + prefix caching (shared prompt prefixes across sessions), all ~~three~~at ~~on.~~once. The confusion arises because they all have "cache" in the name and all improve performance, but at different scopes.

WhatKV `LLMPipeline.start_chat()`Cache DoesPrecision and Quantization

~~From the OpenVINO GenAI source,~~ start_chat() ~~opens a chat session bound to a single stateful compiled model.~~ The KV ~~state~~cache ~~lives~~is almost always kept in ~~inside~~FP16 or higher precision on NPU, even if weights are INT4 or INT8. Why? Because the attention mechanism (the softmax in particular) is sensitive to numerical precision; quantizing the KV to INT8 often causes noticeable degradation in output quality, particularly on longer contexts where accumulated rounding error matters.

The exception is NF4 weights + FP16 KV (Lunar Lake NPU 4 only, 2025.3+), where the weights are NF4 and the KV is held at FP16. This is a documented combination; going further (e.g., INT4 KV) is not validated and likely to cause accuracy loss.

For M2M-100 1.2B at 128 tokens:

Weights at INT4: 600 MB KV cache at FP16: 25 MB Total hot memory: ~625 MB (fits comfortably)

For an 8B model at 2K context:

Weights at INT4: 4 GB KV cache at FP16: ~400 MB (rough estimate for 8B with GQA) Total: ~4.4 GB (fits within Lunar Lake's 16 GB, but now memory bandwidth contention becomes real)

OVMS (OpenVINO stateful-modelModel variablesServer) —and stateSequential tensorsExecution

~~hidden~~

A caveat from the ~~IR I/O surface — and is reused across~~ generate() ~~calls. On each call the runtime invokes~~ align_kv_cache_and_history()~~, which compares the new tokenized sequence to the cached state and submits only the divergent suffix.~~ finish_chat() ~~resets the state.~~

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline("ov_phi35_mini_int4", device="NPU",
                            CACHE_DIR=".ovcache")
pipe.start_chat()
print(pipe.generate("Hello!", max_new_tokens=50))    # full prefill
print(pipe.generate("And what is your name?",        # only the suffix prefills
                    max_new_tokens=50))
pipe.finish_chat()

~~On NPU specifically, the backend is~~ StaticLLMPipeline ~~(selected internally via~~ is_npu_requested()~~), which uses fixed input shapes derived from~~ MAX_PROMPT_LEN ~~and~~ MIN_RESPONSE_LEN~~, compiles through~~ compile_decoder_for_npu()~~, and supports blob export/import via~~ EXPORT_BLOB ~~and~~ BLOB_PATH ~~for binary distribution.~~

~~The NPU LLM constraint nobody documents loudly enough~~: greedy decoding only on the classic static NPU pipeline. No beam search. OVMS 2025.4 added multinomial sampling, but beam search remains unsupported on NPU. For M2M-100 translation this matters because beam-4 is the standard high-quality decode setting for NMT — going greedy on NPU costs you roughly 0.3–0.8 BLEU on FLORES devtest, but it's the only option.

Prefix Caching — The Version-Stamped Reality

~~The single most-confused area in OpenVINO docs is prefix caching, because the same~~ --enable_prefix_caching ~~CLI flag drives two completely different mechanisms depending on target device.~~

~~On CPU/GPU through OVMS~~~~, the flag enables PagedAttention/Continuous-Batching prefix cache with configurable~~ cache_size~~. Standard production LLM serving stuff. Works well, well-validated.~~

~~On NPU through OVMS~~~~, the flag drives the plugin-level~~ NPUW_LLM_ENABLE_PREFIX_CACHING:YES ~~(added in OpenVINO 2025.4), which reduces TTFT in long-chat scenarios on the static-shape Stateful pipeline.~~

~~The OVMS docs are explicit that on Stateful (NPU) servables,~~ cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching, cache_eviction_config~~, and~~ sparse_attention_config ~~are~~documentation: ~~ignored at the OVMS scheduling layer~~ ~~— but the 2025.4 demo command still passes~~ --enable_prefix_caching true ~~with~~ --target_device NPU ~~because OVMS now plumbs that flag through to~~ NPUW_LLM_ENABLE_PREFIX_CACHING:YES~~. Both statements are simultaneously true and pertain to different layers of the stack. If your prefix-caching mental model breaks, this is usually why.~~

~~Prefix caching for encoder-decoder seq2seq models like M2M-100 is not documented on OVMS at all.~~ It's a gap in the public record. The OpenVINO encoder for M2M-100 is single-pass static prefill anyway, so the question of "reuse encoder state across requests" maps differently — the encoder is cheap enough that caching it sequence-by-sequence is a smaller win than for an autoregressive LLM.

Chunked Prefill and the 8K "Ceiling"

~~OpenVINO 2025.3 introduced~~ ~~dynamic prompts on NPU by default~~ ~~through~~ PREFILL_HINT=DYNAMIC ~~with~~ NPUW_LLM_PREFILL_CHUNK_SIZE=1024~~. Setting~~ PREFILL_HINT=STATIC ~~reverts to the 2025.2 fixed-shape behavior. PR #31687 (~~~~"NPUW: Automatically align MAX_PROMPT_LENGTH to CHUNK_SIZE"~~~~) enforces the alignment constraint that~~ MAX_PROMPT_LEN ~~must be a multiple of~~ NPUW_LLM_PREFILL_CHUNK_SIZE.

~~The~~ ~~8K context limit~~ is ~~not a hard architectural ceiling~~~~. The 2025.3 release notes describe it as a~~ ~~validated preview~~ ~~on specific hardware:~~ ~~"Longer contexts are available as preview feature on 32GB Intel Core Ultra Series 2 (with prompt size up to 8..12K tokens)."~~ The 2025.4 notes promote it to general availability on Lunar Lake. The cap is set by where chunked-prefill activation buffers fit in DDR; smaller-RAM SKUs cap lower, and Panther Lake is positioned to extend further (no public number yet). ~~Production code should query~~ MAX_PROMPT_LEN ~~at runtime rather than hardcode 8192.~~

~~The properties to remember, with defaults:~~

~~Property~~ ~~Default~~ ~~Effect~~ MAX_PROMPT_LEN ~~1024~~ ~~Max input prompt tokens on static-shape NPU pipeline~~ MIN_RESPONSE_LEN ~~128 (was 150 pre-2025.3)~~ ~~Min new tokens reserved~~ NPUW_LLM_PREFILL_CHUNK_SIZE ~~1024~~ ~~Granularity of chunked prefill~~ PREFILL_HINT DYNAMIC ~~(since 2025.3)~~ STATIC ~~to revert to old behavior~~ GENERATE_HINT FAST_COMPILE BEST_PERF ~~for runtime perf at compile cost~~ NPUW_LLM_ENABLE_PREFIX_CACHING NO ~~Enables NPU prefix cache (2025.4+)~~ CACHE_DIR ~~unset~~ ~~Strongly recommended on NPU~~ PERFORMANCE_HINT LATENCY THROUGHPUT ~~allows up to 4 concurrent infer requests~~

The "Sequential Execution" Claim — What It Actually Means

~~OVMS docs state:~~ "OpenVINO Model Server (OVMS) with NPU ~~acceleration~~Stateful models has a "process ~~the~~ requests ~~sequentially.~~sequentially" ~~For~~policy. ~~that~~Some ~~reason,~~readers ~~benchmarking~~interpret ~~should~~this beas ~~performed in max_concurrency set to 1.~~" ~~This is~~ ~~not an~~the NPU hardware ~~limit~~.can only process one request at a time." That's misleading.

What it actually means: the OVMS scheduler for NPU Stateful servables is currently single-threaded, so requests are queued and handled one at a time. The NPU ~~plugin~~hardware ~~advertises~~itself supports multiple concurrent inference requests (via async optimal_number_of_infer_requests = 4InferRequest in ~~THROUGHPUT~~the ~~mode~~native API), tile-level parallelism, and ~~exposes~~frequency ov::range_for_async_infer_requests.scaling. ~~PR #27875 even added an~~ ~~opt-in property~~ NPU_RUN_INFERENCES_SEQUENTIALLY ~~defaulting to false~~ ~~— the existence of an opt-in to force~~The sequential ~~proves the default is parallel-capable.~~

~~The truth: for OVMS on NPU, Stateful servables~~ ~~intentionally~~ ~~serialize at the request level because each NPU LLM session owns a state-variable instance, and the scheduling~~ policy is ~~designed for single-user AI-PC latency workloads. For~~a ~~direct~~scheduler choice in OVMS, not a hardware limitation.

If you're using the native OpenVINO Runtime ~~use~~API directly (not OVMS), ~~multiple~~ InferRequest ~~objects~~you can beuse ~~submitted~~async ~~async,~~requests and ~~tile-level~~parallelize ~~parallelism~~inference. (ov::intel_npu::tiles)OVMS is ~~real.~~the ~~The~~higher-level ~~book~~serving ~~distinguishes~~layer; ~~these~~if ~~layers~~you're ~~because~~building ~~conflating~~an ~~them~~agent ~~produces~~system ~~wrong~~in-process ~~mental~~(which ~~models~~is —typical for edge/on-device agents), you're likely using the Runtime API and ~~wrong~~don't ~~mental~~hit ~~models~~this ~~produce~~constraint.

~~designs~~

KV Cache Memory Lifecycle

For a long-running agent that ~~try~~cycles tothrough ~~parallelize~~multiple ~~something~~requests ~~that~~(interact ~~won't~~with ~~parallelize,~~user, orcall ~~refuse~~a totool, ~~parallelize~~observe, ~~something~~reason, ~~that would.~~

Eviction Policies — Mostly Not Your Problem on NPU

~~The standard production LLM concerns about~~repeat), KV cache ~~eviction~~management —matters:

~~LRU,~~

# LFU,Pseudocode FIFO,for importance-weightedagent retentionloop
—model apply= ov.LLMPipeline(...)
for i in range(num_steps):
    # Prefill: prompt grows with accumulated observations
    outputs = model.start_chat(accumulated_prompt)  # Allocates KV state
    
    for j in range(decode_tokens):
        # Decode: uses cached KV
        outputs = model.generate_next()
    
    # Finish: release KV state
    model.finish_chat()  # Clears the KV buffer
    
    # Between steps: observations are appended to PagedAttention-basedaccumulated_prompt
    serving# onaccumulated_prompt CPU/GPU.grows; On NPU's StaticLLMPipeline, theKV cache is per-requestdiscarded and boundedrecreated byon next prefill

At each MAX_PROMPT_LENstart_chat(), +a ~~MIN_RESPONSE_LEN~~.fresh ~~There's~~KV noallocation ~~global~~is ~~cache~~made. ~~pool~~If your accumulated prompt has grown to ~~evict~~2K ~~from. Eviction happens when~~tokens, the ~~session~~KV ~~ends~~allocation (is 2K-sized and you're committed to that footprint until finish_chat(). orIf ~~process~~the ~~exit).~~next step's prompt is 3K tokens, a new 3K allocation is made.

~~This~~For ~~simplifies~~long-running aagents, ~~lot. It also~~this means you can't ~~share~~accumulate unbounded history within a single KV ~~cache~~buffer; ~~across~~you ~~users~~have onto ~~NPU~~either:

Truncate the ~~way~~context ~~you~~window ~~would~~(recent-only onhistory, amyopic ~~server-side~~agent) ~~GPU~~Use ~~deployment.~~external ~~Single-user~~long-term ~~AI-PC~~memory ~~workloads~~(vector ~~are~~store) ~~the~~and ~~design~~retrieve ~~center.~~into fresh prefill (stateless from KV perspective, but stateful in application logic) Use sliding-window KV (drop oldest tokens, recompute if needed)

Context ManagementImplications for M2M-100 TranslationDeployment

~~For~~M2M-100 ~~sentence-level~~is ~~English-to-French~~an ~~translation,~~encoder-decoder, so the KV lifecycle is:

Encoder prefill: source text is encoded once, encoder ~~input~~KV ~~rarely~~is ~~exceeds 64 tokens~~computed and ~~decoder~~held ~~output rarely exceeds 96. With~~ MAX_PROMPT_LEN=128 ~~and~~ MIN_RESPONSE_LEN=128,for the entire ~~context~~decode ~~budget~~phase ~~fits~~Decoder ~~comfortably~~decode: ~~under~~new ~~any~~target ~~NPU's~~tokens ~~static-shape~~are ~~envelope,~~generated; ondecoder ~~any~~self-attention ~~generation.~~KV ~~Chunked~~grows, ~~prefill,~~cross-attention ~~prefix~~KV ~~caching,~~is ~~the~~reused 8Kfrom ~~ceiling~~encoder ~~— none of it matters for sentence MT.~~

The ~~discussion~~encoder ~~belongs~~KV indoesn't ~~the~~get ~~chapter~~reused ~~because:~~

across

multiple ~~M2M-100~~different ~~generalizes~~source sentences; it's specific to ~~document-level~~that ~~translation~~encode-decode ~~at T = 1K–2K, where the constraints start to bite~~ ~~The constraints~~ ~~transfer directly to other agentic seq2seq workloads~~ ~~— summarization, retrieval-augmented translation, ASR-translation pipelines — where context grows~~ ~~The OpenVINO version compatibility and configuration story applies to~~ ~~every NPU-served LLM~~~~, not just translation.~~pair. If you ~~also~~have ~~run~~a ~~Phi-3.5-mini for an "explain this translation" tool (the worked example in Chapter 5.2), all~~batch of ~~the~~translation ~~above~~requests, ~~applies~~each one

~~For~~brings ~~the~~its ~~worked~~own ~~example,~~encoder ~~the~~KV. ~~practical answer~~This is ~~"configure~~why ~~for~~batching ~~short~~M2M-100 ~~context,~~(or ~~don't~~any ~~overthink~~seq2seq) ~~it."~~is ~~For~~awkward ~~your~~on ~~real~~NPU ~~agent, the practical answer might be different, and now~~— you ~~have~~can't ~~the~~trivially ~~levers.~~share encoder KV across different inputs.

What This Section Bought You

You should now understand:

Stateful KV caching via start_chat() / finish_chat() amortizes prefill cost across decode steps

Three ~~independent~~orthogonal caching layers: model ~~caching~~cache (CACHE_DIR)bytecode), KV cache (~~stateful~~session ~~variables)~~state), prefix ~~caching~~cache (NPUW_LLM_ENABLE_PREFIX_CACHING)shared prefix KV) LLMPipeline.start_chat()KV cache is ~~how~~kept ~~you~~at ~~keep~~FP16+, KVeven ~~state~~when ~~across~~weights ~~turns~~are onINT4, ~~NPU;~~for ~~the~~numerical ~~runtime auto-aligns to the new sequence~~stability ~~Greedy-only~~OVMS onsequential ~~NPU's classic static pipeline~~execution ~~— beam search is on the iGPU or CPU, not NPU~~ ~~Prefix caching means different things on CPU/GPU vs NPU~~ ~~through OVMS; the same flag, different mechanisms~~ ~~The 8K context "limit"~~ is a ~~validated~~scheduler ~~preview,~~policy, not a ~~hard~~hardware ~~ceiling~~;limit; ~~query~~native MAX_PROMPT_LENRuntime atAPI ~~runtime~~supports async ~~The~~KV ~~"sequential~~cache ~~execution"~~allocation ~~claim~~commits to context length at start_chat() time; unbounded history requires external memory M2M-100's encoder KV is ~~an OVMS scheduling policy,~~per-request, not anshared ~~NPU~~across ~~hardware~~requests ~~limit~~— this is why seq2seq batching is complex Long-term agent memory lives outside the model — ~~direct~~KV ~~Runtime~~cache ~~use~~is ~~can~~working ~~submit~~memory ~~multiple async InferRequests~~ ~~For sentence-level M2M-100 translation, none of this matters~~~~; the constraints bind at document-level and on other agentic workloads~~only

The next section ~~closes~~applies ~~Chapter~~all 2of ~~by moving from state~~this to ~~decision-making:~~the agent's reasoning loop: given abounded ~~working~~context ~~M2M-100~~and +bounded ~~NPU~~KV ~~setup,~~cache, what ~~does a~~ reasoning ~~loop~~architectures ~~cost,~~actually ~~and how do you bound it?~~work?

Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint