Skip to main content

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.2 KV Cache Engineering

TheTheory previousin section2.1 treatedsaid the KV cache aswall ais memoryreal cost.but binds at long context. This section treatsis about the engineering: how OpenVINO GenAI actually manages KV state on Intel NPU, what the configuration knobs do, and which of the three confusingly-named caching mechanisms applies where. The OpenVINO stack has matured at roughly one major capability per quarterly release since 2024.4, and the surface area is now large enough that a lot of production code is misconfigured because the author conflated two layers of the cache stack.

The Three Caching Mechanisms (and Why They're Confused)

Before going further, untangle these three things, which all live in OpenVINO and all sound similar:

Model caching (CACHE_DIR). The compiled NPU blob is cached to disk so that subsequent runs skip the multi-second to multi-minute compile step. This is what Chapter 1.3's cold-start table was about. Universal across devices.

KV cache (the per-request decoder state). The standard transformer KV cache, held in OpenVINO stateful-model variables on NPU. Reused across generate() calls within a chat session via LLMPipeline.start_chat().

Prefix caching (cross-request reuse of the prefill phase). When two requests start with the same prompt prefix, the KV state from that prefix is reused. The --enable_prefix_caching flag in OVMS controls this — but it means different things on different devices, which we'll get to.

These are three independent layers. You can have any combination on or off. Production NPU agents typically want all three on.

What LLMPipeline.start_chat() Does

From the OpenVINO GenAI source, start_chat() opens a chat session bound to a single stateful compiled model. The KV state lives inside OpenVINO stateful-model variables — state tensors hidden from the IR I/O surface — and is reused across generate() calls. On each call the runtime invokes align_kv_cache_and_history(), which compares the new tokenized sequence to the cached state and submits only the divergent suffix. finish_chat() resets the state.

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline("ov_phi35_mini_int4", device="NPU",
                            CACHE_DIR=".ovcache")
pipe.start_chat()
print(pipe.generate("Hello!", max_new_tokens=50))    # full prefill
print(pipe.generate("And what is your name?",        # only the suffix prefills
                    max_new_tokens=50))
pipe.finish_chat()

On NPU specifically, the backend is StaticLLMPipeline (selected internally via is_npu_requested()), which uses fixed input shapes derived from MAX_PROMPT_LEN and MIN_RESPONSE_LEN, compiles through compile_decoder_for_npu(), and supports blob export/import via EXPORT_BLOB and BLOB_PATH for binary distribution.

The NPU LLM constraint nobody documents loudly enough: greedy decoding only on the classic static NPU pipeline. No beam search. OVMS 2025.4 added multinomial sampling, but beam search remains unsupported on NPU. For M2M-100 translation this matters because beam-4 is the standard high-quality decode setting for NMT — going greedy on NPU costs you roughly 0.3–0.8 BLEU on FLORES devtest, but it's the only option.

Prefix Caching — The Version-Stamped Reality

The single most-confused area in OpenVINO docs is prefix caching, because the same --enable_prefix_caching CLI flag drives two completely different mechanisms depending on target device.

On CPU/GPU through OVMS, the flag enables PagedAttention/Continuous-Batching prefix cache with configurable cache_size. Standard production LLM serving stuff. Works well, well-validated.

On NPU through OVMS, the flag drives the plugin-level NPUW_LLM_ENABLE_PREFIX_CACHING:YES (added in OpenVINO 2025.4), which reduces TTFT in long-chat scenarios on the static-shape Stateful pipeline.

The OVMS docs are explicit that on Stateful (NPU) servables, cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching, cache_eviction_config, and sparse_attention_config are ignored at the OVMS scheduling layer — but the 2025.4 demo command still passes --enable_prefix_caching true with --target_device NPU because OVMS now plumbs that flag through to NPUW_LLM_ENABLE_PREFIX_CACHING:YES. Both statements are simultaneously true and pertain to different layers of the stack. If your prefix-caching mental model breaks, this is usually why.

Prefix caching for encoder-decoder seq2seq models like M2M-100 is not documented on OVMS at all. It's a gap in the public record. The OpenVINO encoder for M2M-100 is single-pass static prefill anyway, so the question of "reuse encoder state across requests" maps differently — the encoder is cheap enough that caching it sequence-by-sequence is a smaller win than for an autoregressive LLM.

Chunked Prefill and the 8K "Ceiling"

OpenVINO 2025.3 introduced dynamic prompts on NPU by default through PREFILL_HINT=DYNAMIC with NPUW_LLM_PREFILL_CHUNK_SIZE=1024. Setting PREFILL_HINT=STATIC reverts to the 2025.2 fixed-shape behavior. PR #31687 ("NPUW: Automatically align MAX_PROMPT_LENGTH to CHUNK_SIZE") enforces the alignment constraint that MAX_PROMPT_LEN must be a multiple of NPUW_LLM_PREFILL_CHUNK_SIZE.

The 8K context limit is not a hard architectural ceiling. The 2025.3 release notes describe it as a resourcevalidated youpreview canon engineer.specific Donehardware: well,"Longer cachecontexts reuseare acrossavailable turnsas preview feature on 32GB Intel Core Ultra Series 2 (with prompt size up to 8..12K tokens)." The 2025.4 notes promote it to general availability on Lunar Lake. The cap is set by where chunked-prefill activation buffers fit in DDR; smaller-RAM SKUs cap lower, and prefixPanther sharing across sessions can cut prefill latency by 5–10x. Done poorly, you end up recomputing the same thing over and over while telling the user "Thinking…"

The CacheLake is Alreadypositioned Thereto extend Don'tfurther Throw(no Itpublic Away

number

Here'syet). theProduction mostcode commonshould waste in agent implementations: a multi-turn conversation where each turn re-tokenizes and re-prefills the entire history.

Turn 1 prompt:query [system] [tools] [user_msg_1]MAX_PROMPT_LEN → prefill → generate Turn 2 prompt: [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] → re-prefill the whole thing → generate Turn 3 prompt: [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] [assistant_2] [user_msg_3] → re-prefill again →...

By turn 5, you're spending most of your TTFT recomputing context whose KV values you had perfectly valid copies of moments ago. On an NPU where prefill is compute-bound and runs at perhapsruntime 100–500rather tokens/second,than thishardcode is brutal.8192.

The fixproperties isto cacheremember, persistence:with keep the KV cache resident across turns, and only prefill the new tokens at the end. This is sometimes called session caching or conversational caching.

The savings:defaults:

ScenarioProperty Without session cacheDefault With session cacheEffect
TurnMAX_PROMPT_LEN
5,1024 4KMax input prompt tokens ofon historystatic-shape PrefillNPU ~4K + ~50 new = 4050 tokens Prefill 50 new tokenspipeline TTFTMIN_RESPONSE_LEN on128 200(was tok/s150 pre-2025.3) Min new tokens reserved NPUW_LLM_PREFILL_CHUNK_SIZE 1024 Granularity of chunked prefill ~20 secondsPREFILL_HINT ~250DYNAMIC ms(since 2025.3) STATIC to revert to old behavior GENERATE_HINT FAST_COMPILE BEST_PERF for runtime perf at compile cost NPUW_LLM_ENABLE_PREFIX_CACHING NO Enables NPU prefix cache (2025.4+) CACHE_DIR unset Strongly recommended on NPU PERFORMANCE_HINT LATENCY THROUGHPUT allows up to 4 concurrent infer requests

The "Sequential Execution" Claim — What It Actually Means

OVMS docs state: "OpenVINO Model Server with NPU acceleration process the requests sequentially. For that reason, benchmarking should be performed in max_concurrency set to 1." This is not an NPU hardware limit. The NPU plugin advertises optimal_number_of_infer_requests = 4 in THROUGHPUT mode and exposes ov::range_for_async_infer_requests. PR #27875 even added an opt-in property NPU_RUN_INFERENCES_SEQUENTIALLY defaulting to false — the singleexistence biggestof an opt-in to force sequential proves the default is parallel-capable.

The truth: for OVMS on NPU, Stateful servables intentionally serialize at the request level because each NPU LLM session owns a state-variable instance, and the scheduling policy is designed for single-user AI-PC latency winworkloads. availableFor indirect mostOpenVINO NPURuntime agentuse implementations,(not OVMS), multiple InferRequest objects can be submitted async, and mosttile-level teamsparallelism discover(ov::intel_npu::tiles) theyis needreal. itThe afterbook theirdistinguishes firstthese roundlayers ofbecause userconflating testing.them produces wrong mental models — and wrong mental models produce designs that try to parallelize something that won't parallelize, or refuse to parallelize something that would.

WhatEviction MakesPolicies This TrickyMostly Not Your Problem on an NPU

IfThe sessionstandard cachingproduction isLLM soconcerns obviouslyabout good,KV whycache isn'teviction it theLRU, default?LFU, BecauseFIFO, NPUimportance-weighted runtimesretention often make it hard:

    Fixed-shape graphs: many NPU compilers wantapply to compilePagedAttention-based theserving modelon for a specific input shape (sequence length, batch size)CPU/GPU. AOn growingNPU's cache means the shape changes between calls, which can trigger recompilation or fall back to CPU. Static memory allocation: NPU memory is often pre-allocated in fixed blocks. A cache that grows arbitrarily doesn't fit that pattern. Pre-padded buffers: the workaround is to pre-allocateStaticLLMPipeline, the cache atis maximum sizeper-request and usebounded attentionby masksMAX_PROMPT_LEN + MIN_RESPONSE_LEN. There's no global cache pool to ignoreevict unusedfrom. slots.Eviction happens when the session ends (finish_chat() or process exit).

    This wastessimplifies memorya forlot. shortIt sessions.

    also means Noyou directcan't share KV cache access:across someusers on NPU runtimes expose model inference but not the underlyingway cacheyou tensors, making external persistence impossible without runtime patches.

    The state of the artwould on eacha platformserver-side varies.GPU Coredeployment. ML'sSingle-user statefulAI-PC prediction APIs, ONNX Runtime's RunWithBinding and IO binding, and OpenVINO's StateAPIworkloads are the kindsdesign center.

    Context Management for M2M-100 Translation

    For sentence-level English-to-French translation, encoder input rarely exceeds 64 tokens and decoder output rarely exceeds 96. With MAX_PROMPT_LEN=128 and MIN_RESPONSE_LEN=128, the entire context budget fits comfortably under any NPU's static-shape envelope, on any generation. Chunked prefill, prefix caching, the 8K ceiling — none of mechanisms you'll need to use. None of them is plug-and-play; expect to read the runtime documentation carefully and test on your specific NPU before assuming it works.

    matters

    Prefixfor Cachingsentence Across Sessions

    Session caching reuses the cache within one conversation. Prefix cachingMT. reuses it across conversations, sharing the KV values of common prefixes.

    The mostdiscussion common prefixbelongs in an agent is the system prompt plus tool definitions. These are often 500–2000 tokens and identical across every session. Prefilling them on every cold start is pure waste — they never change between users.

    A prefix cache stores the precomputed KV tensors for that fixed prefix and reattaches them when a new session begins. The cost is one prefill at deployment time, paid back across every session that follows.

    This is harder to implement than session cachingchapter because:

    • M2M-100 generalizes to document-level translation at T = 1K–2K, where the constraints start to bite
    The cacheconstraints tensorstransfer aredirectly largeto other agentic seq2seq workloads — summarization, retrieval-augmented translation, ASR-translation pipelines — where context grows The OpenVINO version compatibility and configuration story applies to every NPU-served LLM, not just translation. If you also run Phi-3.5-mini for an "explain this translation" tool (hundredsthe worked example in Chapter 5.2), all of MB) and must persist to disk They're tied to the exactabove model and exact tokenization — a tokenizer change invalidates the cache Loading them must be faster than recomputing them, which is not automatic on slow storageapplies

    When it works, prefix caching takes cold-start TTFT from "wait two seconds beforeFor the firstworked characterexample, appears"the topractical answer is "respondconfigure instantly.for short context, don't overthink it." For chat-style agents, that crosses ayour real perceptual threshold.

    Eviction: Whenagent, the Cachepractical Won'tanswer Fit

    might

    Inbe a long session, the cache eventually grows past your memory budget. You need an eviction policy. The naive choice — drop the oldest tokens — usually works worse than people expect, because of the attention sink phenomenon mentioned in 2.1: the first few tokens carry disproportionate weight,different, and dropping them degrades output quality.

    Better policies in roughly increasing order of complexity:

    Sliding window with sinks: keep the first N tokens (the "sink") and the most recent M tokens, drop everything in between. Simple, effective for chat, and easy to implement.

    Importance-based eviction: track per-token attention scores during generation and preferentially keep tokens that other tokens attended to. More accurate, but adds bookkeeping overhead and is harder to vectorize on the NPU.

    Hierarchical summarization: when older context overflows, summarize it into a few tokens and use that summary going forward. This blends caching with the agent's own memory system, which we'll come back to in 2.3.

    Whatever you choose, make eviction explicit and logged. Silent eviction produces mysterious quality regressions: the agent suddenly forgets something it said three turns ago and the user blames the model.

    When You Have Multiple Agents in One Process

    If your application runs more than one agent — say, a small classifier on the cascade pattern plus a larger reasoning model — they compete for the same NPU memory. The cache management becomes multi-tenant:

      Cold-swap: only one model resident at a time, swap in the other on demand. Simple, but you pay model-load latency on every switch. Warm-coexist: both models resident, sharing memory budget. Faster switching but tighter constraints on each model's working set. Time-share with checkpointing: serialize the inactive model's state to host memory, restore on switch. Hybrid approach used when the NPU can't hold both models simultaneously.

      Most edge SoCs can only run one large model at a time on the NPU. If your architecture needs two, design the orchestration around model swaps from the start — don't discover this when integration testing.

      A Diagnostic for Cache Health

      If you've been profiling well, you can answer these questions about your agent right now:

        What percentage of TTFT is prefill compute vs. tokenization, vs. model load, vs. cache management overhead? How does TTFT scale with turn number in a typical conversation? If turn 10 takes 3x as long as turn 1, you don't have effective session caching. What's the cache hit rate on prefix caching ifnow you have it? Below 80% means your prefix isn't actually fixed and you're paying for a cache that doesn't help. At what session length does memory pressure trigger eviction, and what does your agent's quality look like just after that boundary?

        If you can't answer these, instrument the runtime until you can. The cache is invisible by default, which is exactly why it's where the easy wins hide.levers.

        What This Section Bought You

        CacheYou engineeringshould isnow one of those areas where the difference between "we set up an agent" and "we shipped an agent" is most visible.understand:

        • SessionThree independent caching layers: model caching (CACHE_DIR), KV cache (stateful variables), prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING)
        LLMPipeline.start_chat() is how you keep KV state across turns ison NPU; the singleruntime highest-leverageauto-aligns latencyto optimizationthe availablenew sequence Greedy-only on NPU's classic static pipeline — beam search is on the iGPU or CPU, not NPU Prefix caching formeans fixeddifferent systemthings promptson andCPU/GPU toolvs definitionsNPU eliminatesthrough cold-startOVMS; prefillthe wastesame flag, different mechanisms The 8K context "limit" is a validated preview, not a hard ceiling; query MAX_PROMPT_LEN at runtime The "sequential execution" claim is an OVMS scheduling policy, not an NPU runtimehardware constraints make this harder than it soundslimitpre-allocateddirect buffers,Runtime fixeduse shapes,can andsubmit limitedmultiple cacheasync access are common obstaclesInferRequests EvictionFor needssentence-level toM2M-100 betranslation, deliberatenone of this matters; the keepconstraints sinks,bind logat evictions,document-level and designon evictionother policyagentic alongside the rest of the agentworkloads

        WithThe memorynext andsection cachecloses engineeringChapter in2 hand,by wemoving canfrom finally turnstate to thedecision-making: partgiven mosta peopleworking thinkM2M-100 of+ asNPU "thesetup, agent":what thedoes a reasoning loop,loop tool selection,cost, and how decisionsdo getyou madebound within these tight constraints.it?


        Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint