Skip to main content

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.2

The KVdistinction Cache Engineering

Theory in 2.1 said thebetween KV cache wall(what you keep in memory) and KV cache bandwidth (what you stream per token) is subtle and worth being precise about, because it sets the operational window for what an agent can do in real but binds at long context.time. This section isdescends aboutinto the engineering:implementation howdetails: what does KV cache engineering look like in practice, and where do the OpenVINO APIs and caching layers fit?

Stateful KV Caching: In-Memory and On-Disk

OpenVINO's LLMPipeline (for decoder-only models) and the older OpenVINO 2025.3 GenAI actuallyinterface managesexpose KV caching through stateful models that hold KV state onacross Intelmultiple NPU,infer() whatcalls.

A stateless forward pass recomputes the configurationfull knobscontext do,on andevery whichtoken:

of
outputs = model(prompt_tokens + [new_token])  # Expensive at each step

A stateful forward pass reuses KV from the threeprevious confusingly-namedstep:

caching
# mechanismsFirst appliescall where.(prefill): starts the chat session, returns KV state internally
outputs = model.start_chat(prompt_tokens)

# Subsequent calls (decode): feed only the new token, read cached KV
for step in range(num_steps):
    outputs = model.generate_next(new_token)
    # The OpenVINOmodel's stackinternal hasKV maturedstate atgrows: roughly[1, one1, majorstep+1, capabilityhead_dim] perfor quarterlyself-attention
    release# sinceEach 2024.4, and the surface areastep is nowO(1) largein enoughcontext thatlength, anot lotO(seq_len)
of
production code

This is misconfigured because the author conflated two layers of the cache stack.

The Three Caching Mechanisms (and Why They're Confused)

Before going further, untangle these three things, which all liveexposed in OpenVINO and all sound similar:

Model caching (CACHE_DIR). The compiled NPU blob is cached to disk so that subsequent runs skip the multi-second to multi-minute compile step. This is what Chapter 1.3's cold-start table was about. Universal across devices.

KV cache (the per-request decoder state). The standard transformer KV cache, held in OpenVINO stateful-model variables on NPU. Reused across generate() calls within a chat session via LLMPipeline.start_chat() and LLMPipeline.finish_chat(), or via the lower-level stateful pipeline API that manages the KV variable allocation.

On-disk KV caching is a feature of OpenVINO 2025.4+: the prefix cache (Chapter 2.2's cached KV across different prompts with shared prefixes) can be memory-mapped to disk, reducing hot DRAM footprint. This is not the same as KV cache spilling; it's a deliberate optimization for scenarios with many similar prompts (e.g., RAG where the retrieval context is shared).

The Three Layers of Caching

OpenVINO has three distinct caching mechanisms that developers often confuse:

1. Model caching (CACHE_DIR). The compiled blob (the IR XML + weights compiled to NPU bytecode) is written to disk on first compilation, then loaded from disk on subsequent runs. This is handled by setting CACHE_DIR environment variable or via core.set_property("CACHE_DIR", path). Runtime: saves 30–60 seconds on cold start, costs ~1–3 seconds on warm start (load from disk, validate, run). Scope: global per model, not per-session.

2. KV cache (stateful model state). The key-value cache for attention is held in memory as model variables. Managed via model.start_chat() and model.finish_chat() for LLMPipeline, or directly via InferRequest variable state for lower-level APIs. Runtime: O(seq_len × head_size) memory per layer, amortized O(1) per token decode. Scope: per-session (one chat session = one KV state buffer).

3. Prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING). A newer feature (cross-request2025.4+) reusethat of the prefill phase). When two requests start with the same prompt prefix,caches the KV stateof fromcommon prompt prefixes across different requests. If you make multiple requests that share a long context prefix (e.g., system prompt + retrieved documents), the KV for the prefix is computed once and reused. TheMechanism --enable_prefix_cachingdiffers flagper indevice: OVMSon controls this — butCPU/GPU it meansuses copy-on-write; on NPU it's a different thingspath through the compiler. Runtime: saves recompute on differentshared devices,prefixes, whichcosts we'llextra getmemory to.for the cache table. Scope: global per model (shared across all sessions).

These are three independent layers.orthogonal. You can have anymodel combinationcaching (bytecode on ordisk) off.+ ProductionKV NPUcaching agents(current typicallysession's wantattention memory) + prefix caching (shared prompt prefixes across sessions), all threeat on.once. The confusion arises because they all have "cache" in the name and all improve performance, but at different scopes.

WhatKV LLMPipeline.start_chat()Cache DoesPrecision and Quantization

From the OpenVINO GenAI source, start_chat() opens a chat session bound to a single stateful compiled model. The KV statecache livesis almost always kept in insideFP16 or higher precision on NPU, even if weights are INT4 or INT8. Why? Because the attention mechanism (the softmax in particular) is sensitive to numerical precision; quantizing the KV to INT8 often causes noticeable degradation in output quality, particularly on longer contexts where accumulated rounding error matters.

The exception is NF4 weights + FP16 KV (Lunar Lake NPU 4 only, 2025.3+), where the weights are NF4 and the KV is held at FP16. This is a documented combination; going further (e.g., INT4 KV) is not validated and likely to cause accuracy loss.

For M2M-100 1.2B at 128 tokens:

    Weights at INT4: 600 MB KV cache at FP16: 25 MB Total hot memory: ~625 MB (fits comfortably)

    For an 8B model at 2K context:

      Weights at INT4: 4 GB KV cache at FP16: ~400 MB (rough estimate for 8B with GQA) Total: ~4.4 GB (fits within Lunar Lake's 16 GB, but now memory bandwidth contention becomes real)

      OVMS (OpenVINO stateful-modelModel variablesServer) and stateSequential tensorsExecution

      hidden

      A caveat from the IR I/O surface — and is reused across generate() calls. On each call the runtime invokes align_kv_cache_and_history(), which compares the new tokenized sequence to the cached state and submits only the divergent suffix. finish_chat() resets the state.

      import openvino_genai as ov_genai
      
      pipe = ov_genai.LLMPipeline("ov_phi35_mini_int4", device="NPU",
                                  CACHE_DIR=".ovcache")
      pipe.start_chat()
      print(pipe.generate("Hello!", max_new_tokens=50))    # full prefill
      print(pipe.generate("And what is your name?",        # only the suffix prefills
                          max_new_tokens=50))
      pipe.finish_chat()
      

      On NPU specifically, the backend is StaticLLMPipeline (selected internally via is_npu_requested()), which uses fixed input shapes derived from MAX_PROMPT_LEN and MIN_RESPONSE_LEN, compiles through compile_decoder_for_npu(), and supports blob export/import via EXPORT_BLOB and BLOB_PATH for binary distribution.

      The NPU LLM constraint nobody documents loudly enough: greedy decoding only on the classic static NPU pipeline. No beam search. OVMS 2025.4 added multinomial sampling, but beam search remains unsupported on NPU. For M2M-100 translation this matters because beam-4 is the standard high-quality decode setting for NMT — going greedy on NPU costs you roughly 0.3–0.8 BLEU on FLORES devtest, but it's the only option.

      Prefix Caching — The Version-Stamped Reality

      The single most-confused area in OpenVINO docs is prefix caching, because the same --enable_prefix_caching CLI flag drives two completely different mechanisms depending on target device.

      On CPU/GPU through OVMS, the flag enables PagedAttention/Continuous-Batching prefix cache with configurable cache_size. Standard production LLM serving stuff. Works well, well-validated.

      On NPU through OVMS, the flag drives the plugin-level NPUW_LLM_ENABLE_PREFIX_CACHING:YES (added in OpenVINO 2025.4), which reduces TTFT in long-chat scenarios on the static-shape Stateful pipeline.

      The OVMS docs are explicit that on Stateful (NPU) servables, cache_size, dynamic_split_fuse, max_num_batched_tokens, max_num_seq, enable_prefix_caching, cache_eviction_config, and sparse_attention_config aredocumentation: ignored at the OVMS scheduling layer — but the 2025.4 demo command still passes --enable_prefix_caching true with --target_device NPU because OVMS now plumbs that flag through to NPUW_LLM_ENABLE_PREFIX_CACHING:YES. Both statements are simultaneously true and pertain to different layers of the stack. If your prefix-caching mental model breaks, this is usually why.

      Prefix caching for encoder-decoder seq2seq models like M2M-100 is not documented on OVMS at all. It's a gap in the public record. The OpenVINO encoder for M2M-100 is single-pass static prefill anyway, so the question of "reuse encoder state across requests" maps differently — the encoder is cheap enough that caching it sequence-by-sequence is a smaller win than for an autoregressive LLM.

      Chunked Prefill and the 8K "Ceiling"

      OpenVINO 2025.3 introduced dynamic prompts on NPU by default through PREFILL_HINT=DYNAMIC with NPUW_LLM_PREFILL_CHUNK_SIZE=1024. Setting PREFILL_HINT=STATIC reverts to the 2025.2 fixed-shape behavior. PR #31687 ("NPUW: Automatically align MAX_PROMPT_LENGTH to CHUNK_SIZE") enforces the alignment constraint that MAX_PROMPT_LEN must be a multiple of NPUW_LLM_PREFILL_CHUNK_SIZE.

      The 8K context limit is not a hard architectural ceiling. The 2025.3 release notes describe it as a validated preview on specific hardware: "Longer contexts are available as preview feature on 32GB Intel Core Ultra Series 2 (with prompt size up to 8..12K tokens)." The 2025.4 notes promote it to general availability on Lunar Lake. The cap is set by where chunked-prefill activation buffers fit in DDR; smaller-RAM SKUs cap lower, and Panther Lake is positioned to extend further (no public number yet). Production code should query MAX_PROMPT_LEN at runtime rather than hardcode 8192.

      The properties to remember, with defaults:

      Property Default Effect MAX_PROMPT_LEN 1024 Max input prompt tokens on static-shape NPU pipeline MIN_RESPONSE_LEN 128 (was 150 pre-2025.3) Min new tokens reserved NPUW_LLM_PREFILL_CHUNK_SIZE 1024 Granularity of chunked prefill PREFILL_HINT DYNAMIC (since 2025.3) STATIC to revert to old behavior GENERATE_HINT FAST_COMPILE BEST_PERF for runtime perf at compile cost NPUW_LLM_ENABLE_PREFIX_CACHING NO Enables NPU prefix cache (2025.4+) CACHE_DIR unset Strongly recommended on NPU PERFORMANCE_HINT LATENCY THROUGHPUT allows up to 4 concurrent infer requests

      The "Sequential Execution" Claim — What It Actually Means

      OVMS docs state: "OpenVINO Model Server (OVMS) with NPU accelerationStateful models has a "process the requests sequentially.sequentially" Forpolicy. thatSome reason,readers benchmarkinginterpret shouldthis beas performed in max_concurrency set to 1." This is not anthe NPU hardware limit.can only process one request at a time." That's misleading.

      What it actually means: the OVMS scheduler for NPU Stateful servables is currently single-threaded, so requests are queued and handled one at a time. The NPU pluginhardware advertisesitself supports multiple concurrent inference requests (via async optimal_number_of_infer_requests = 4InferRequest in THROUGHPUTthe modenative API), tile-level parallelism, and exposesfrequency ov::range_for_async_infer_requests.scaling. PR #27875 even added an opt-in property NPU_RUN_INFERENCES_SEQUENTIALLY defaulting to false — the existence of an opt-in to forceThe sequential proves the default is parallel-capable.

      The truth: for OVMS on NPU, Stateful servables intentionally serialize at the request level because each NPU LLM session owns a state-variable instance, and the scheduling policy is designed for single-user AI-PC latency workloads. Fora directscheduler choice in OVMS, not a hardware limitation.

      If you're using the native OpenVINO Runtime useAPI directly (not OVMS), multiple InferRequest objectsyou can beuse submittedasync async,requests and tile-levelparallelize parallelisminference. (ov::intel_npu::tiles)OVMS is real.the Thehigher-level bookserving distinguisheslayer; theseif layersyou're becausebuilding conflatingan themagent producessystem wrongin-process mental(which modelsis typical for edge/on-device agents), you're likely using the Runtime API and wrongdon't mentalhit modelsthis produceconstraint.

      designs

      KV Cache Memory Lifecycle

      For a long-running agent that trycycles tothrough parallelizemultiple somethingrequests that(interact won'twith parallelize,user, orcall refusea totool, parallelizeobserve, somethingreason, that would.

      Eviction Policies — Mostly Not Your Problem on NPU

      The standard production LLM concerns aboutrepeat), KV cache evictionmanagement matters:

      LRU,
      # LFU,Pseudocode FIFO,for importance-weightedagent retentionloop
      model apply= ov.LLMPipeline(...)
      for i in range(num_steps):
          # Prefill: prompt grows with accumulated observations
          outputs = model.start_chat(accumulated_prompt)  # Allocates KV state
          
          for j in range(decode_tokens):
              # Decode: uses cached KV
              outputs = model.generate_next()
          
          # Finish: release KV state
          model.finish_chat()  # Clears the KV buffer
          
          # Between steps: observations are appended to PagedAttention-basedaccumulated_prompt
          serving# onaccumulated_prompt CPU/GPU.grows; On NPU's StaticLLMPipeline, theKV cache is per-requestdiscarded and boundedrecreated byon next prefill
      

      At each MAX_PROMPT_LENstart_chat(), +a MIN_RESPONSE_LEN.fresh There'sKV noallocation globalis cachemade. poolIf your accumulated prompt has grown to evict2K from. Eviction happens whentokens, the sessionKV endsallocation (is 2K-sized and you're committed to that footprint until finish_chat(). orIf processthe exit).next step's prompt is 3K tokens, a new 3K allocation is made.

      ThisFor simplifieslong-running aagents, lot. It alsothis means you can't shareaccumulate unbounded history within a single KV cachebuffer; acrossyou usershave onto NPUeither:

        Truncate the waycontext youwindow would(recent-only onhistory, amyopic server-sideagent) GPUUse deployment.external Single-userlong-term AI-PCmemory workloads(vector arestore) theand designretrieve center.into fresh prefill (stateless from KV perspective, but stateful in application logic) Use sliding-window KV (drop oldest tokens, recompute if needed)

        Context ManagementImplications for M2M-100 TranslationDeployment

        ForM2M-100 sentence-levelis English-to-Frenchan translation,encoder-decoder, so the KV lifecycle is:

          Encoder prefill: source text is encoded once, encoder inputKV rarelyis exceeds 64 tokenscomputed and decoderheld output rarely exceeds 96. With MAX_PROMPT_LEN=128 and MIN_RESPONSE_LEN=128,for the entire contextdecode budgetphase fitsDecoder comfortablydecode: undernew anytarget NPU'stokens static-shapeare envelope,generated; ondecoder anyself-attention generation.KV Chunkedgrows, prefill,cross-attention prefixKV caching,is thereused 8Kfrom ceilingencoder — none of it matters for sentence MT.

          The discussionencoder belongsKV indoesn't theget chapterreused because:

          across
            multiple M2M-100different generalizessource sentences; it's specific to document-levelthat translationencode-decode at T = 1K–2K, where the constraints start to bite The constraints transfer directly to other agentic seq2seq workloads — summarization, retrieval-augmented translation, ASR-translation pipelines — where context grows The OpenVINO version compatibility and configuration story applies to every NPU-served LLM, not just translation.pair. If you alsohave runa Phi-3.5-mini for an "explain this translation" tool (the worked example in Chapter 5.2), allbatch of thetranslation aboverequests, applieseach one

            Forbrings theits workedown example,encoder theKV. practical answerThis is "configurewhy forbatching shortM2M-100 context,(or don'tany overthinkseq2seq) it."is Forawkward youron realNPU agent, the practical answer might be different, and now you havecan't thetrivially levers.share encoder KV across different inputs.

            What This Section Bought You

            You should now understand:

            • Stateful KV caching via start_chat() / finish_chat() amortizes prefill cost across decode steps
            Three independentorthogonal caching layers: model cachingcache (CACHE_DIR)bytecode), KV cache (statefulsession variables)state), prefix cachingcache (NPUW_LLM_ENABLE_PREFIX_CACHING)shared prefix KV) LLMPipeline.start_chat()KV cache is howkept youat keepFP16+, KVeven statewhen acrossweights turnsare onINT4, NPU;for thenumerical runtime auto-aligns to the new sequencestability Greedy-onlyOVMS onsequential NPU's classic static pipelineexecution — beam search is on the iGPU or CPU, not NPU Prefix caching means different things on CPU/GPU vs NPU through OVMS; the same flag, different mechanisms The 8K context "limit" is a validatedscheduler preview,policy, not a hardhardware ceiling;limit; querynative MAX_PROMPT_LENRuntime atAPI runtimesupports async TheKV "sequentialcache execution"allocation claimcommits to context length at start_chat() time; unbounded history requires external memory M2M-100's encoder KV is an OVMS scheduling policy,per-request, not anshared NPUacross hardwarerequests limit— this is why seq2seq batching is complex Long-term agent memory lives outside the modeldirectKV Runtimecache useis canworking submitmemory multiple async InferRequests For sentence-level M2M-100 translation, none of this matters; the constraints bind at document-level and on other agentic workloadsonly

            The next section closesapplies Chapterall 2of by moving from statethis to decision-making:the agent's reasoning loop: given abounded workingcontext M2M-100and +bounded NPUKV setup,cache, what does a reasoning looparchitectures cost,actually and how do you bound it?work?


            Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint