2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing
2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing
The previous section treated the KV cache as a memory cost. This section treats it as a resource you can engineer. Done well, cache reuse across turns and prefix sharing across sessions can cut prefill latency by 5–10x. Done poorly, you end up recomputing the same thing over and over while telling the user "Thinking…"
The Cache is Already There — Don't Throw It Away
Here's the most common waste in agent implementations: a multi-turn conversation where each turn re-tokenizes and re-prefills the entire history.
Turn 1 prompt: [system] [tools] [user_msg_1] → prefill → generate
Turn 2 prompt: [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] → re-prefill the whole thing → generate
Turn 3 prompt: [system] [tools] [user_msg_1] [assistant_1] [user_msg_2] [assistant_2] [user_msg_3] → re-prefill again →...
By turn 5, you're spending most of your TTFT recomputing context whose KV values you had perfectly valid copies of moments ago. On an NPU where prefill is compute-bound and runs at perhaps 100–500 tokens/second, this is brutal.
The fix is cache persistence: keep the KV cache resident across turns, and only prefill the new tokens at the end. This is sometimes called session caching or conversational caching.
The savings:
| Scenario | Without session cache | With session cache |
|---|---|---|
| Turn 5, 4K tokens of history | Prefill ~4K + ~50 new = 4050 tokens | Prefill 50 new tokens |
| TTFT on 200 tok/s prefill | ~20 seconds | ~250 ms |
This is the single biggest latency win available in most NPU agent implementations, and most teams discover they need it after their first round of user testing.
What Makes This Tricky on an NPU
If session caching is so obviously good, why isn't it the default? Because NPU runtimes often make it hard:
- Fixed-shape graphs: many NPU compilers want to compile the model for a specific input shape (sequence length, batch size). A growing cache means the shape changes between calls, which can trigger recompilation or fall back to CPU.
- Static memory allocation: NPU memory is often pre-allocated in fixed blocks. A cache that grows arbitrarily doesn't fit that pattern.
- Pre-padded buffers: the workaround is to pre-allocate the cache at maximum size and use attention masks to ignore unused slots. This wastes memory for short sessions.
- No direct cache access: some NPU runtimes expose model inference but not the underlying cache tensors, making external persistence impossible without runtime patches.
The state of the art on each platform varies. Core ML's stateful prediction APIs, ONNX Runtime's RunWithBinding and IO binding, and OpenVINO's StateAPI are the kinds of mechanisms you'll need to use. None of them is plug-and-play; expect to read the runtime documentation carefully and test on your specific NPU before assuming it works.
Prefix Caching Across Sessions
Session caching reuses the cache within one conversation. Prefix caching reuses it across conversations, sharing the KV values of common prefixes.
The most common prefix in an agent is the system prompt plus tool definitions. These are often 500–2000 tokens and identical across every session. Prefilling them on every cold start is pure waste — they never change between users.
A prefix cache stores the precomputed KV tensors for that fixed prefix and reattaches them when a new session begins. The cost is one prefill at deployment time, paid back across every session that follows.
This is harder to implement than session caching because:
- The cache tensors are large (hundreds of MB) and must persist to disk
- They're tied to the exact model and exact tokenization — a tokenizer change invalidates the cache
- Loading them must be faster than recomputing them, which is not automatic on slow storage
When it works, prefix caching takes cold-start TTFT from "wait two seconds before the first character appears" to "respond instantly." For chat-style agents, that crosses a real perceptual threshold.
Eviction: When the Cache Won't Fit
In a long session, the cache eventually grows past your memory budget. You need an eviction policy. The naive choice — drop the oldest tokens — usually works worse than people expect, because of the attention sink phenomenon mentioned in 2.1: the first few tokens carry disproportionate weight, and dropping them degrades output quality.
Better policies in roughly increasing order of complexity:
Sliding window with sinks: keep the first N tokens (the "sink") and the most recent M tokens, drop everything in between. Simple, effective for chat, and easy to implement.
Importance-based eviction: track per-token attention scores during generation and preferentially keep tokens that other tokens attended to. More accurate, but adds bookkeeping overhead and is harder to vectorize on the NPU.
Hierarchical summarization: when older context overflows, summarize it into a few tokens and use that summary going forward. This blends caching with the agent's own memory system, which we'll come back to in 2.3.
Whatever you choose, make eviction explicit and logged. Silent eviction produces mysterious quality regressions: the agent suddenly forgets something it said three turns ago and the user blames the model.
When You Have Multiple Agents in One Process
If your application runs more than one agent — say, a small classifier on the cascade pattern plus a larger reasoning model — they compete for the same NPU memory. The cache management becomes multi-tenant:
- Cold-swap: only one model resident at a time, swap in the other on demand. Simple, but you pay model-load latency on every switch.
- Warm-coexist: both models resident, sharing memory budget. Faster switching but tighter constraints on each model's working set.
- Time-share with checkpointing: serialize the inactive model's state to host memory, restore on switch. Hybrid approach used when the NPU can't hold both models simultaneously.
Most edge SoCs can only run one large model at a time on the NPU. If your architecture needs two, design the orchestration around model swaps from the start — don't discover this when integration testing.
A Diagnostic for Cache Health
If you've been profiling well, you can answer these questions about your agent right now:
- What percentage of TTFT is prefill compute vs. tokenization, vs. model load, vs. cache management overhead?
- How does TTFT scale with turn number in a typical conversation? If turn 10 takes 3x as long as turn 1, you don't have effective session caching.
- What's the cache hit rate on prefix caching if you have it? Below 80% means your prefix isn't actually fixed and you're paying for a cache that doesn't help.
- At what session length does memory pressure trigger eviction, and what does your agent's quality look like just after that boundary?
If you can't answer these, instrument the runtime until you can. The cache is invisible by default, which is exactly why it's where the easy wins hide.
What This Section Bought You
Cache engineering is one of those areas where the difference between "we set up an agent" and "we shipped an agent" is most visible.
- Session caching across turns is the single highest-leverage latency optimization available
- Prefix caching for fixed system prompts and tool definitions eliminates cold-start prefill waste
- NPU runtime constraints make this harder than it sounds — pre-allocated buffers, fixed shapes, and limited cache access are common obstacles
- Eviction needs to be deliberate — keep sinks, log evictions, and design eviction policy alongside the rest of the agent
With memory and cache engineering in hand, we can finally turn to the part most people think of as "the agent": the reasoning loop, tool selection, and how decisions get made within these tight constraints.
Previous: 2.1 Context Windows and the Memory Wall Next: 2.3 Reasoning Loops Under Constraint