1.3 Latency, Throughput, and Hardware-Aware Patterns
1.3 Latency, Throughput, and Hardware-Aware Patterns
ForArchitecture anand agent,constraints timeset the rules. Performance is thewhat user-facingyour currency.users Throughputactually tells you how busy the hardware is; latency tells you whether the user is still paying attention.feel. This section givesis youabout the vocabularylatency profile of Intel NPU specifically — what the published numbers say, what the structure of those numbers implies for agent design, and where the patternsgaps toin reasonthe aboutpublic bothrecord sit.
TTFT and ITL on Intel Core NPU hardware.
The Two Latencies You Actually Care About
ForThe generativetwo agents,numbers therethat matter for an interactive agent are two distinct latency metrics — confuse them and you'll optimize the wrong thing.
Time toTo First Token (TTFT) is— how long the user waits before anything appears.appears It's dominated by the prefill phase: processing the input prompt— and warming the KV cache. TTFT is what shapes the user's perception of responsiveness.
Inter-Token Latency (ITL), sometimesalso called timeper-token decode latency — how fast text streams after generation starts. These are not the same regime: TTFT is compute-bound (matmul-heavy prefill on the full prompt); ITL is memory-bandwidth-bound (one matmul per outputtoken, token,but every weight has to be streamed from DRAM).
Two Intel-published anchor benchmarks, both worth memorizing:
DeepSeek-R1-Distill-Llama-8B INT4 on Core Ultra 7 NPU, from the OpenVINO Model Hub (Feb 2025): 6.10 tok/s decode, 163.10 ms per-token latency. The same model on the same SoC's iGPU reaches 12.80 tok/s. The iGPU is how2.1× longfaster eachthan subsequentthe NPU for 8B INT4 decode. Intel's CES 2026 marketing claim that Panther Lake NPU beats Jetson Orin AGX on DeepSeek-Llama-8B first-token latency is comparative-only; absolute milliseconds are not published.
Llama 2 7B on Core Ultra Series 2 NPU, from MLPerf Client v0.6: TTFT 1.09 s, throughput 18.55 tok/s. The 3× gap between this number and DeepSeek's 6.10 tok/s on the same hardware class reflects model-specific differences (Llama 2 7B vs 8B, more recent driver, possibly different KV quantization configuration). The conservative 6.10 tok/s figure is the better anchor for reasoning-model workloads.
Use 6 tok/s as your back-of-envelope number for an 8B INT4 model decoding on Intel NPU. Use 18 tok/s for a well-validated, smaller model like Llama 2 7B. The truth for any specific deployment is somewhere in between, and the only way to know is to measure on your hardware.
The TTFT-vs-ITL Distinction
Why does the regime split matter? Because the optimization techniques are different.
For TTFT, the matmul has the full prompt to chew on, so it's compute-bound. The NPU's MAC array shines here. Lunar Lake's 48 TOPS works in your favor; quantization to INT4 helps mostly by shrinking the weight memory traffic, not by speeding compute. Phi Silica reports TTFT 230 ms for short prompts (Snapdragon X Elite, but the architectural lesson generalizes) and Llama 2 7B on Lunar Lake NPU reports 1.09 s.
For ITL, every token takesrequires streaming the entire weight tensor through the MAC array once. At 4 GB INT4 weights and Lunar Lake's 136.5 GB/s LPDDR5X ceiling, the theoretical floor is 136.5 / 4 = 34 tok/s. The 6.10 tok/s observed equals about 18% of that ceiling, eaten by NPU scheduling quota, driver overhead, and the small constants in real workloads. You cannot quantize your way past this ceiling; you can only halve the weight memory by going INT4, which roughly halves decode latency relative to generate.INT8.
The determinesarchitectural whetherlesson is direct: don't expect NPU decode to ever feel like a fast cloud LLM. Treat 6–20 tok/s as the streameddesign responsebudget feelsfor fluidany orreasoning-style stutters.workload.
Cold Start
TheseCold twostart haveis completelydominated differentby bottlenecks:the first compile, where the NPU plugin tiles the graph, decides SRAM allocation, and emits a binary blob. On Intel hardware the rule of thumb is:
AThe modelIPEX-LLM thatNPU benchmarksquickstart welldocuments the multi-minute first-run delay verbatim: "When running specific GGUF models on TTFTNPU canfor stillthe feelfirst terribletime, you might notice delays up to useseveral if its ITL is high, and vice versa. Measure both.
Why Decode is Memory-Bound
On an NPU running a transformer model, generating each new token requires reading every weight at least once and the entire KV cache for every attention head. The compute itself — multiplying a single-token query against the cached keys and values — finishes longminutes before the datafirst has been read from memory.
Thistoken is the regime most NPU agents live in. It has two important implications:
This is why the industry obsesses over 4-bit quantization at the edge. It's not vanity — it's the difference between a fluid response and one that stutters.
Cold Start vs. Steady State
NPU agents pay a one-time cost on cold start that doesn't appear in steady-state benchmarks:
On a flagship mobile NPU, cold start for a 1B-parameter model is often in the 500ms–2s range. For a laptop NPU loading a 7B model, it can be 5–10 seconds.
You handle cold start with one or more of:
Three Hardware-Aware Design Patterns
The constraints above shape a small set of architectural patterns that consistently work for NPU-based agents. You'll see these recur throughout the book.
Pattern 1: The Cascade
Use a small, fast model to decide whether the larger model needs to be invoked at all.
user query → classifier (tiny, ~10ms)
├── trivial / cached → templated response
├── needs reasoning → NPU LLM
└── needs world knowledge → cloud LLM
This pattern works because the routing decision is almost always cheaper than the answer. A 50M-parameter classifier can handle 80–90% of traffic in many agent domains (greetings, simple lookups, repeated queries) without ever waking the larger model.
Pattern 2: Tool-First Reasoning
Push computation off the NPU and into tools that run on the CPU or remotely.
The NPU model's job is to decide which tool to call and how to format the result for the user. The actual work — database lookups, calculations, retrieval, API calls — happens elsewhere. This keeps the NPU on what it'That's good at (language understanding and generation) and avoids stuffing world knowledge into a model that can't hold it.
Chapter 3 covers this in detail, but the principle starts here: the NPU model should be the orchestrator, not the database.
Pattern 3: Speculative Decoding
Run a small "draft" model on the CPU or NPU that generates several tokens ahead, then verify them in parallel with the larger model. When the draft is right (often 60–80% of the time for natural language), you get multiple tokens per NPU forward pass.
Speculative decoding can deliver 2–3x effective speedup on decode, at the cost of additionalcompiling the entire model loadgraph into NPU-tiled blobs. Subsequent runs hit CACHE_DIR and orchestrationskip complexity.compilation. It'sOpenVINO increasingly2025.4 standardspecifically improved this by memory-mapping cached models in productionthe NPULevel stacks,Zero context to eliminate an in-memory copy.
For M2M-100 specifically, the encoder compile is fast (a single static-shape encoder is a small graph) and worththe knowingdecoder aboutwith-past evencompile iftakes you'relonger not(more implementingcomplex itgraph, yourself.
Ashapes Profilingto Discipline
consider). IfPad youyour takefirst-run onlylatency onebudget habit from this chapter, make it this: never reason about NPU performance from a spec sheet.accordingly.
The actualuser-facing workflowlesson looksis like:the one Audacity gets right: tell the user. The plugin documentation says explicitly "10 to 30 seconds the first time you run this effect." That's the right pattern. Hiding cold-start by pretending it's instant produces an experience that feels broken on first use.
The
Cascade PatternThe dominant agent-architecture pattern on Intel SoCs is the Definecascade: a representativesmall, workloadcheap model handles the common case; a larger, expensive model handles only what the small one couldn't. This is not novel — realcascades prompts,exist realin toolcloud calls,serving realtoo session— lengths.
The cleanest published Intel example is the Hugging Face × Intel Measure"Qwen3-8B TTFTAgent" andblog: ITLQwen3-8B separately,INT4 p50 and p95,target on eachiGPU, targetQwen3-0.6B device.
Mostmotivates NPUit SDKsas: (Core"agentic MLapplications Tools,rely OpenVINO'son benchmark_app,reasoning QNNmodels profiler,that ONNXproduce Runtime'thinking profiler)aloud' emittraces… per-operatormaking timings.inference Usespeed them.critical to responsiveness." The intuitionpattern you build from real profiling data is worth more than any rule of thumb in this book — including the ones in this chapter.
Wrapping Up Chapter 1
You now have the foundations. To recap:generalizes:
NPUsSmall-NPUare+integer-first,Big-iGPU:memory-constrainedcheapacceleratorsclassificationbuiltorforroutinginference,onnotNPUtraining(5–20 ms per call, sustained low power), heavy generation on iGPU when the agent decides it's neededThreeSmall-NPUconstraintsdraftgovern+everyBig-NPUdeploymenttarget (speculative decoding):memory,theoperatorsmallcoverage,draftnumericalmodelprecisionproposes tokens that the larger target model verifies in parallel. OpenVINO 2025.4 sanctioned this with Phi-3-mini FastDraft on Hugging Face, though no Intel benchmark has been published for it yetQuantizationBig-NPUisn'tprefilloptional+ Big-CPU decode— it's: theentryPhiticket,Silicaandpattern.INT4NPUiseats thepracticalcompute-boundnormprompt;forCPULLMs atstreams theedge
The restdevice-priority string AUTO:NPU,GPU,CPU is the most common cascade entry point in OpenVINO. The runtime selects the highest-priority compatible device per subgraph, falling back automatically when a device is unavailable or doesn't support a given op.
Phi Silica as the Canonical Reference
The single best-documented production NPU agent is Microsoft's Phi Silica, a 3.3B-parameter Phi-3.5-mini derivative shipping in Copilot+ Windows. The published numbers (Windows Experience Blog, December 2024): TTFT 230 ms for short prompts, 20 tok/s throughput, 2K context (4K coming), 4.8 mWh per context-processing operation on Snapdragon X Elite.
What matters for this book is the architecture, which is exactly what we're recommending for M2M-100:
Click to Do, the bookCopilot+ buildsUI affordance that uses Phi Silica, routes through fixed prompt templates. There is no learned router despite community speculation — Microsoft has been explicit about this. The lesson generalizes: for NPU agents, template the prompt, don't ask the model to also do prompt routing. Routing is cheap, NPU calls are not.
What Hasn't Been Published
Honest gaps that should color how confidently you cite numbers in this book:
If you encounter precise numbers that aren't in the table at the top of this section, they're almost certainly extrapolation, not measurement. Treat them accordingly.
What This Section Bought You
You should now understand:
CACHE_DIR
The cascade pattern is the Intel-native agent architecture — small-on-NPU + big-on-iGPU, or speculative decoding within a single device
Phi Silica is the reference deployment: CPU tokenizer/embedding/LM-head + NPU transformer + CPU decode with KV reuse, all published in the Windows Experience Blog
Templated prompts beat learned routers for NPU-bound agents — every avoidable NPU call wastes the budget
Chapter 1 ends here. Chapter 2 divesturns intofrom the model to the agent: given a system that can run M2M-100 on Intel NPU, how todo we manage agent state —state, context, memory, and reasoningdecision-making loops — within these constraints. Chapter 3 turns to tool design. Chapter 4 covers deployment, observability, andinside the operationalconstraints reality of running agents in production. Chapter 5 closes with case studies from teams who'we've shippednow real NPU agents and what they learned the hard way.
If you're going to do one thing before moving on: pick a target NPU, pick a candidate model, and actually measure TTFT and ITL on it. Everything that follows will land harder if you have those numbers in hand.mapped?
Previous: 1.2 Computational Constraints & Model Optimization Next: Chapter 2: Agent State & Decision-Making on Constrained Hardware