1.3 Latency, Throughput, and Hardware-Aware Patterns
1.3 Latency, Throughput, and Hardware-Aware Patterns
Architecture and constraints set the rules. Performance is what your users actually feel. This section is about the latency profile of Intel NPU specifically — what the published numbers say, what the structure of those numbers implies for agent design, and where the gaps in the public record sit.
TTFT and ITL on Intel Core NPU
The two numbers that matter for an interactive agent are Time To First Token (TTFT) — how long the user waits before anything appears — and Inter-Token Latency (ITL), also called per-token decode latency — how fast text streams after generation starts. These are not the same regime: TTFT is compute-bound (matmul-heavy prefill on the full prompt); ITL is memory-bandwidth-bound (one matmul per token, but every weight has to be streamed from DRAM).
Two Intel-published anchor benchmarks, both worth memorizing:
DeepSeek-R1-Distill-Llama-8B INT4 on Core Ultra 7 NPU, from the OpenVINO Model Hub (Feb 2025): 6.10 tok/s decode, 163.10 ms per-token latency. The same model on the same SoC's iGPU reaches 12.80 tok/s. The iGPU is 2.1× faster than the NPU for 8B INT4 decode. Intel's CES 2026 marketing claim that Panther Lake NPU beats Jetson Orin AGX on DeepSeek-Llama-8B first-token latency is comparative-only; absolute milliseconds are not published.
Llama 2 7B on Core Ultra Series 2 NPU, from MLPerf Client v0.6: TTFT 1.09 s, throughput 18.55 tok/s. The 3× gap between this number and DeepSeek's 6.10 tok/s on the same hardware class reflects model-specific differences (Llama 2 7B vs 8B, more recent driver, possibly different KV quantization configuration). The conservative 6.10 tok/s figure is the better anchor for reasoning-model workloads.
Use 6 tok/s as your back-of-envelope number for an 8B INT4 model decoding on Intel NPU. Use 18 tok/s for a well-validated, smaller model like Llama 2 7B. The truth for any specific deployment is somewhere in between, and the only way to know is to measure on your hardware.
The TTFT-vs-ITL Distinction
Why does the regime split matter? Because the optimization techniques are different.
For TTFT, the matmul has the full prompt to chew on, so it's compute-bound. The NPU's MAC array shines here. Lunar Lake's 48 TOPS works in your favor; quantization to INT4 helps mostly by shrinking the weight memory traffic, not by speeding compute. Phi Silica reports TTFT 230 ms for short prompts (Snapdragon X Elite, but the architectural lesson generalizes) and Llama 2 7B on Lunar Lake NPU reports 1.09 s.
For ITL, every token requires streaming the entire weight tensor through the MAC array once. At 4 GB INT4 weights and Lunar Lake's 136.5 GB/s LPDDR5X ceiling, the theoretical floor is 136.5 / 4 = 34 tok/s. The 6.10 tok/s observed equals about 18% of that ceiling, eaten by NPU scheduling quota, driver overhead, and the small constants in real workloads. You cannot quantize your way past this ceiling; you can only halve the weight memory by going INT4, which roughly halves decode latency relative to INT8.
The architectural lesson is direct: don't expect NPU decode to ever feel like a fast cloud LLM. Treat 6–20 tok/s as the design budget for any reasoning-style workload.
Cold Start
Cold start is dominated by the first compile, where the NPU plugin tiles the graph, decides SRAM allocation, and emits a binary blob. On Intel hardware the rule of thumb is:
| Class | Cold compile (no blob) | Warm import (cached) |
|---|---|---|
| Small CV classifier | <1 s | ~100 ms |
| Whisper / MusicGen / Demucs | 10–30 s (Audacity docs) | 1–3 s |
| 3B–8B LLM INT4 | 30 s to several minutes (IPEX-LLM quickstart) | <3 s (Markaicode) |
The IPEX-LLM NPU quickstart documents the multi-minute first-run delay verbatim: "When running specific GGUF models on NPU for the first time, you might notice delays up to several minutes before the first token is generated." That's the cost of compiling the entire model graph into NPU-tiled blobs. Subsequent runs hit CACHE_DIR and skip compilation. OpenVINO 2025.4 specifically improved this by memory-mapping cached models in the Level Zero context to eliminate an in-memory copy.
For M2M-100 specifically, the encoder compile is fast (a single static-shape encoder is a small graph) and the decoder with-past compile takes longer (more complex graph, more shapes to consider). Pad your first-run latency budget accordingly.
The user-facing lesson is the one Audacity gets right: tell the user. The plugin documentation says explicitly "10 to 30 seconds the first time you run this effect." That's the right pattern. Hiding cold-start by pretending it's instant produces an experience that feels broken on first use.
The Cascade Pattern
The dominant agent-architecture pattern on Intel SoCs is the cascade: a small, cheap model handles the common case; a larger, expensive model handles only what the small one couldn't. This is not novel — cascades exist in cloud serving too — but the Intel single-die integration makes the device-routing version of the pattern especially natural.
The cleanest published Intel example is the Hugging Face × Intel "Qwen3-8B Agent" blog: Qwen3-8B INT4 target on iGPU, Qwen3-0.6B INT8 draft on the same iGPU, 1.3–1.4× speedup via speculative decoding for a smolagents-based reasoning agent. Intel motivates it as: "agentic applications rely on reasoning models that produce 'thinking aloud' traces… making inference speed critical to responsiveness." The pattern generalizes:
- Small-NPU + Big-iGPU: cheap classification or routing on NPU (5–20 ms per call, sustained low power), heavy generation on iGPU when the agent decides it's needed
- Small-NPU draft + Big-NPU target (speculative decoding): the small draft model proposes tokens that the larger target model verifies in parallel. OpenVINO 2025.4 sanctioned this with Phi-3-mini FastDraft on Hugging Face, though no Intel benchmark has been published for it yet
- Big-NPU prefill + Big-CPU decode: the Phi Silica pattern. NPU eats the compute-bound prompt; CPU streams the decode, reusing the NPU's KV cache
The device-priority string AUTO:NPU,GPU,CPU is the most common cascade entry point in OpenVINO. The runtime selects the highest-priority compatible device per subgraph, falling back automatically when a device is unavailable or doesn't support a given op.
Phi Silica as the Canonical Reference
The single best-documented production NPU agent is Microsoft's Phi Silica, a 3.3B-parameter Phi-3.5-mini derivative shipping in Copilot+ Windows. The published numbers (Windows Experience Blog, December 2024): TTFT 230 ms for short prompts, 20 tok/s throughput, 2K context (4K coming), 4.8 mWh per context-processing operation on Snapdragon X Elite.
What matters for this book is the architecture, which is exactly what we're recommending for M2M-100:
- Tokenizer, embedding, and LM head on CPU — these are lookup-bound or have shapes the NPU dislikes
- Transformer block on NPU — sustained matmul, the NPU's sweet spot
- KV cache held in CPU memory via a sliding window with N=64, escaping the static-shape constraint
- Long prompts decomposed into 64-token chunks for prefill, an early form of chunked prefill
- Speculative decoding with a smaller draft model amplifying NPU throughput
Click to Do, the Copilot+ UI affordance that uses Phi Silica, routes through fixed prompt templates. There is no learned router despite community speculation — Microsoft has been explicit about this. The lesson generalizes: for NPU agents, template the prompt, don't ask the model to also do prompt routing. Routing is cheap, NPU calls are not.
What Hasn't Been Published
Honest gaps that should color how confidently you cite numbers in this book:
- No Intel-published Phi Silica numbers on Intel hardware. All Phi Silica metrics in circulation are from Snapdragon X Elite. Phi Silica reached Intel Copilot+ PCs through Windows Updates (KB5079266, KB5084176, KB5089866) during 2025, but the comparative performance data isn't in the public record.
- No published TTFT for DeepSeek-Distill-Llama-8B on Core Ultra Series 3; the CES 2026 claim is comparative-only against Jetson Orin AGX.
- No published M2M-100-on-NPU performance numbers of any kind — no tok/s, no TTFT, no memory footprint. M2M-100 is not in any OpenVINO Model Hub NPU benchmark.
- No published quantitative Phi-3-mini-on-NPU numbers from Intel/Hugging Face, despite multiple how-to walkthroughs.
- No published agent-loop or ReAct-loop latency benchmarks on Intel NPU. The estimates we'll produce in Chapter 2.3 are extrapolations from the two anchor benchmarks above, presented as such.
If you encounter precise numbers that aren't in the table at the top of this section, they're almost certainly extrapolation, not measurement. Treat them accordingly.
What This Section Bought You
You should now understand:
- TTFT is compute-bound, ITL is memory-bandwidth-bound — different regimes, different optimizations
- The Lunar Lake decode ceiling is ~34 tok/s theoretical for an 8B INT4 model; observed is ~6 tok/s, eaten by overhead
- iGPU decode is 2.1× faster than NPU decode on the same Core Ultra SoC for 8B models — the NPU's win is power per watt, not speed
- Cold start is dominated by first compile: 10–30 s for media models, minutes for LLMs without
CACHE_DIR - The cascade pattern is the Intel-native agent architecture — small-on-NPU + big-on-iGPU, or speculative decoding within a single device
- Phi Silica is the reference deployment: CPU tokenizer/embedding/LM-head + NPU transformer + CPU decode with KV reuse, all published in the Windows Experience Blog
- Templated prompts beat learned routers for NPU-bound agents — every avoidable NPU call wastes the budget
Chapter 1 ends here. Chapter 2 turns from the model to the agent: given a system that can run M2M-100 on Intel NPU, how do we manage state, context, and decision-making inside the constraints we've now mapped?
Previous: 1.2 Computational Constraints & Model Optimization Next: Chapter 2: Agent State & Decision-Making on Constrained Hardware