1.3 Latency, Throughput, and Hardware-Aware Patterns
1.3
The Latency, Throughput, and Hardware-Aware Patterns
Architecturearchitecture and constraints from Chapters 1.1 and 1.2 set the rules. Performance is what your users actually feel.ceiling. This section is about themeasuring it: what does a real model's latency profile oflook like on Intel NPUhardware, specificallyhow —does that latency break down, and what does that imply for the agent loop design patterns Chapter 2 will develop?
We use two published numbersbenchmarks say,as anchors: Llama 2 7B on MLPerf Client v0.6, measured by Intel on a Core Ultra Series 1 processor, and DeepSeek-R1-Distill-Llama-8B INT4 on OpenVINO Model Hub, both real data points that set the floor and ceiling for what theyou structurecan of those numbers implies for agent design, and where the gaps in the public record sit.expect.
The Two Key Latency Metrics: TTFT and ITL
Model inference latency on Intelaccelerators Coreis NPUtraditionally
Thequoted as a single number (e.g., "inference takes 50 ms"). That's been obsolete for over a decade in LLM contexts because LLMs have two numbersphases thatwith matterradically fordifferent ancharacteristics.
Time To First Time-To-First-Token (TTFT) —is howthe latency of the prefill phase: the time from when you send the prompt to when the model emits the first output token. The prompt is static, potentially long (hundreds of tokens), and the userentire waitscomputation beforeis anythingon appearsthe critical path — andyou can't generate a second token until the first one exists. TTFT is compute-bound.
Inter-Token Latency (ITL), alsois calledthe per-latency of each subsequent token in the decode latencyphase. —The howdecoder fastsees text streams after generation starts. These are notonly the samenew regime:token TTFTslot plus the KV cache, and the computation is compute-boundroughly (matmul-heavyconstant prefillper onnew the full prompt);token. ITL is memory-bandwidth-bound (oneon matmul per token, but every weight has to be streamed from DRAM).NPU.
TwoOn Intel-Intel Core Ultra with Lunar Lake, the published anchorbenchmarks benchmarks,nail boththis worthsplit:
Llama 2 7B on MLPerf Client v0.6 (Intel internal, Core Ultra Series 1 Meteor Lake):
DeepSeek-R1-Distill-Llama-8B INT4 on Core Ultra 7 NPU, from the OpenVINO Model Hub (Febpublic 2025)benchmark, Intel NUC 14 Pro with Lunar Lake):
The 2.8× gap between Llama 2 (18.55 tok/s) and DeepSeek-Distill-8B (6.10 mstok/s) per-tokenis latencyreal. A naive explanation is parameter count: 7B vs 8B is 14% more matmul. But the gap is closer to 3×, not 14%, which means something structural is different. The honest answer: these are measured on different hardware revisions (Series 1 Meteor Lake vs Series 2 Lunar Lake is a 4× MACs gain), different quantization targets (Llama 2 at FP16? INT8?), and different workload assumptions (batch size, prompt length). The samebenchmarks modelare not apples-to-apples; treat them as reference ranges.
The Roofline: Hardware Limits
The sustainable throughput on theIntel same SoC's iGPU reaches 12.80 tok/s. The iGPUNPU is 2.1×bounded faster thanby the LPDDR5X bandwidth ceiling from Chapter 1.1: 136.5 GB/s platform-wide shared among CPU, iGPU, and NPU. forNo device gets the full 136.5 GB/s; the actual per-device quota depends on driver scheduling and competing loads.
For an 8B INT4 decode.model:
This is roughly 18% of platform peak bandwidth. The NPU is not starving, but it's not saturating the betterbus anchoreither. forThe reasoning-modelgap workloads.
Use24.4 6 tok/GB/s asand your136.5 back-of-envelopeGB/s numberis forscheduling anoverhead, 8Bdriver INT4latency, and contention from other agents on the SoC (CPU, iGPU). The roofline model decodingsays: onif Intelyou NPU.could Useeliminate all contention and overhead, you'd hit bandwidth saturation at roughly 18(136.5 GB/s) / (4 GB model weight) = 34 tok/s for— aabout well-validated,5.5× smallerhigher modelthan likewhat's Llamameasured. 2That 7B. The truth for any specific deploymentgap is somewhere in between,real and the only way to know is to measure on your hardware.
The TTFT-vs-ITL Distinction
Why does the regime split matter? Because the optimization techniques are different.structural.
For TTFT, the matmul has the full prompt to chew on, so it's compute-bound. The NPU'spractical MAC array shines here. Lunar Lake's 48 TOPS works in your favor; quantization to INT4 helps mostly by shrinking the weight memory traffic, not by speeding compute. Phi Silica reportsimplication: TTFTyou 230cannot msexpect forsustained shortdecode promptsspeeds (Snapdragonabove X15–20 Elite, but the architectural lesson generalizes) and Llama 2 7Btok/s on Lunar Lake NPU reportsfor 1.09reasonable s.
Formodels. ITL,Going every tokenfaster requires streamingeither a smaller model, lower precision (NF4, FP8 on NPU 5), or moving decode to the entireiGPU.
Comparing tensorto throughiGPU
The same Core Ultra platform has an Xe2 iGPU (Lunar Lake) or Xe1 iGPU (Meteor Lake). The iGPU is not on the MAC array once. At 4 GB INT4 weights and Lunar Lake'ssame 136.5 GB/s LPDDR5Xbandwidth ceiling, the theoretical floor is 136.5 / 4 = 34 tok/s. The 6.10 tok/s observed equals about 18% of that ceiling, eaten by NPU scheduling quota, driver overhead, and the small constants in real workloads. You cannot quantize your way past this ceiling; you can only halve the weight memory by going INT4, which roughly halves decode latency relative to INT8.
The architectural lesson is direct: don't expect NPU decode to ever feel like a fast cloud LLM. Treat 6–20 tok/sconstraint as the design budget for any reasoning-style workload.
Cold Start
Cold start is dominated by the first compile, where the NPU plugin— tilesit thehas graph,its decidesown SRAM allocation, and emits a binary blob. On Intel hardware the rule of thumb is:
The IPEX-LLM NPU quickstart documents the multi-minute first-run delay verbatim: "When running specific GGUF models on NPU for the first time, you might notice delays up to several minutes before the first token is generated." That's the cost of compiling the entire model graph into NPU-tiled blobs. Subsequent runs hit — andCACHE_DIR skip compilation. OpenVINO 2025.4 specifically improved this by memory-mapping cached models in the Level Zero context to eliminate an in-memory copy.
For M2M-100 specifically, the encoder compile is fast (a single static-shape encoder is a small graph) and the decoder with-past compile takes longer (more complex graph, more shapes to consider). Pad your first-run latency budget accordingly.
The user-facing lesson is the one Audacity gets right: tell the user. The plugin documentation says explicitly "10 to 30 seconds the first time you run this effect." That's the right pattern. Hiding cold-start by pretending it's instantsubstantially producesfaster anfor experiencedecode that feels broken on first use.
The Cascade Pattern
The dominant agent-architecture pattern on Intel SoCs is the cascade: a small, cheap model handles the common case; a larger, expensive model handles only what the small one couldn't. This is not novel — cascades exist in cloud serving too — but the Intel single-die integration makes the device-routing version of the pattern especially natural.workloads.
The cleanest published Intel example is the Hugging Face × Intel "Qwen3-8B Agent" blog: Qwen3-8B INT4 target on iGPU, Qwen3-0.6B INT8 draft onOn the same iGPU,hardware (Core Ultra Series 2), Llama 2 7B typically reaches 1.3–1.4×~40 tok/s on iGPU (measured by community benchmarks; Intel does not publish iGPU LLM numbers). That's a 2.1× speedup viaover speculativeNPU decodingfor decode. For prefill (TTFT), the gap is wider: iGPU TTFT is typically 300–400 ms for a smolagents-based128-token reasoningprompt, agent. Intel motivates it as: "agentic applications rely on reasoning models that produce 'thinking aloud' traces… making inference speed critical to responsiveness." The pattern generalizes:
The device-priority string AUTO:NPU,GPU,CPU is the most common cascade entry point in OpenVINO. The runtime selects the highest-priority compatible device per subgraph, falling back automatically when a device is unavailable or doesn't support a given op.
Phi Silica as the Canonical Reference
The single best-documented production NPU agent is Microsoft'svs Phi1.09 Silica, a 3.3B-parameter Phi-3.5-mini derivative shipping in Copilot+ Windows. The published numbers (Windows Experience Blog, December 2024): TTFT 230 ms for short prompts, 20 tok/s throughput, 2K context (4K coming), 4.8 mWh per context-processing operation on Snapdragon X Elite.
What matters for this book is the architecture, which is exactly what we're recommending for M2M-100:
The hybrid story emerges: if you can split the workload with prefill on NPU and decode on iGPU, you get 2.1× throughput for the large constant-cost phase (decode) and take the NPU's sweethit spot
What Phi Silica Tells Us
Microsoft's Phi Silica is the closest public reference architecture for an NPU-targeted LLM, deployed on Snapdragon X (Qualcomm NPU, not Intel). The published numbers are TTFT 230 ms, 20 tok/s sustained on a 2K context window. The architecture is: CPU tokenizer + embedding + LM-head, NPU transformer blocks, CPU decode with N=64 KV sliding window.
This is instructive not because Snapdragon X hardware maps cleanly to Intel NPU (it doesn't), but because it shows what real deployed decisions look like: encoder on accelerator, decoder split between accelerator and CPU, because the decode phase's structure (lots of memory, little compute per token) is where the accelerator's architecture breaks down.
Phi Silica also exposes the sliding-window KV cache heldtechnique: instead of keeping the full context KV in CPUmemory, keep only the most recent N tokens (here N=64). This trades recompute (re-running attention over discarded context) for memory viabandwidth. For NPU where bandwidth is the constraint, this trade-off wins. The Llama 2 and DeepSeek-Distill benchmarks above use full KV caches. If they switched to sliding-window N=128, ITL would drop materially, but context awareness would degrade after 128 tokens. This is a slidingtuning windowknob with N=64, escapingfor the static-shapeagent's constraint
Architecture-Specific Wisdom
Three things deserve to be nailed down because they're easy to get wrong:
LongBatching promptsdoesn't decomposedhelp intoon 64-token chunksNPU for prefill,decode. anOn earlyGPU, formyou ofcan chunkedbatch prefill
LPDDR5X speed is shared, not divisible. amplifyingThe 136.5 GB/s includes all traffic: CPU instruction fetches, iGPU reads, NPU throughput
Clickmemory traffic. If the CPU is running code and the iGPU is running a concurrent task, the NPU's available bandwidth drops. If you want predictable NPU performance, you need to Do,account for potential contention. The Phi Silica sliding-window approach partly exists to reduce bandwidth hunger, reducing contention sensitivity.
Compile-time overhead is real. The first invocation of a compiled model on NPU takes 30–60 seconds (from Chapter 1.2 cold-start benchmarks). Subsequent invocations take <3 seconds (warm start, cached to disk via CACHE_DIR). This cost is amortized over the Copilot+model's UIlifetime affordancein thatproduction, usesbut Phifor Silica,development routesand throughshort-running fixedagents, promptit's templates.a Theregotcha. isAlways noset learnedCACHE_DIR routerto despitea communitypersistent speculationlocation; otherwise you pay the cold-start penalty on every process restart.
The Agent-Loop Latency Budget
A 5-step agent loop — Microsoft has been explicit about this. The lesson generalizes: for NPU agents, templatewhere the prompt,agent don'treasons, asktakes an action, observes the modelresult, toand alsorepeats do— promptlooks routing.like Routing is cheap, NPU calls are not.
What Hasn't Been Published
Honest gaps that should color how confidently you cite numbersthis in thislatency book:terms:
NoStepIntel-published1Phiprefill:Silica512-tokennumbersaccumulated prompt, ~4 seconds TTFT
On iGPU: ~35 seconds.
This is the roofline for agent patterns on Intel hardware.NPU, Alland Phiit's Silicathe metricswhy inbehind circulationthe areChapter from2 Snapdragonreasoning-architecture Xrecommendations: Elite. Phi Silica reached Intel Copilot+ PCs through Windows UpdatesReAct (KB5079266,which KB5084176,is KB5089866)inherently duringloopy) 2025,doesn't fit the latency budget, but thesingle-shot comparativeand performancecascade datapatterns isn't in the public record.
If you encounter precise numbers that aren't in the table at the top of this section, they're almost certainly extrapolation, not measurement. Treat them accordingly.do.
What This Section Bought You
You should now understand:
- TTFT
is compute-bound,and ITLisarememory-bandwidth-boundtwo distinct metrics with different hardware bottlenecks —differentcomputeregimes,vsdifferentbandwidth
CACHE_DIRCACHE_DIR always
A 5-step ReAct loop takes ~70–75 seconds on NPU, which is the Chapter 12 ends here. Chapter 2now turns from thehardware to software: given these latency budgets, how does model tostate the(KV agent:cache, givenattention amemory) systemfactor thatinto can run M2M-100 on Intel NPU, how do we manage state, context,design, and decision-makingwhat insidereasoning thearchitectures constraintsactually we'vework nowwithin mapped?constraint?
Previous: 1.2 Computational Constraints & Model Optimization
Next: Chapter 2: Agent State & Decision-Making on Constrained Hardware