1.3 Latency, Throughput, and Hardware-Aware Patterns

1.3

The ~~Latency, Throughput, and Hardware-Aware Patterns~~

~~Architecture~~architecture and constraints from Chapters 1.1 and 1.2 set the ~~rules. Performance is what your users actually feel.~~ceiling. This section is about ~~the~~measuring it: what does a real model's latency profile oflook like on Intel ~~NPU~~hardware, ~~specifically~~how —does that latency break down, and what does that imply for the agent loop design patterns Chapter 2 will develop?

We use two published ~~numbers~~benchmarks ~~say,~~as anchors: Llama 2 7B on MLPerf Client v0.6, measured by Intel on a Core Ultra Series 1 processor, and DeepSeek-R1-Distill-Llama-8B INT4 on OpenVINO Model Hub, both real data points that set the floor and ceiling for what ~~the~~you ~~structure~~can ~~of those numbers implies for agent design, and where the gaps in the public record sit.~~expect.

The Two Key Latency Metrics: TTFT and ITL

Model inference latency on ~~Intel~~accelerators ~~Core~~is ~~NPU~~traditionally

~~The~~quoted as a single number (e.g., "inference takes 50 ms"). That's been obsolete for over a decade in LLM contexts because LLMs have two ~~numbers~~phases ~~that~~with ~~matter~~radically ~~for~~different ancharacteristics.

~~interactive agent are~~

~~Time To First~~ Time-To-First-Token (TTFT) —is ~~how~~the latency of the prefill phase: the time from when you send the prompt to when the model emits the first output token. The prompt is static, potentially long (hundreds of tokens), and the ~~user~~entire ~~waits~~computation ~~before~~is ~~anything~~on ~~appears~~the critical path — ~~and~~you can't generate a second token until the first one exists. TTFT is compute-bound.

Inter-Token Latency (ITL), ~~also~~is ~~called~~the ~~per-~~latency of each subsequent token in the decode ~~latency~~phase. —The ~~how~~decoder ~~fast~~sees ~~text streams after generation starts. These are not~~only the ~~same~~new ~~regime:~~token ~~TTFT~~slot plus the KV cache, and the computation is ~~compute-bound~~roughly ~~(matmul-heavy~~constant ~~prefill~~per onnew ~~the full prompt);~~token. ITL is memory-bandwidth-bound ~~(one~~on ~~matmul per token, but every weight has to be streamed from DRAM).~~NPU.

~~Two~~On ~~Intel-~~Intel Core Ultra with Lunar Lake, the published ~~anchor~~benchmarks ~~benchmarks,~~nail ~~both~~this ~~worth~~split:

~~memorizing:~~

Llama 2 7B on MLPerf Client v0.6 (Intel internal, Core Ultra Series 1 Meteor Lake):

TTFT at 128 input tokens: 1.09 seconds ITL (tokens 2+): ~54 ms/token Implied throughput: 18.55 tok/s sustained

DeepSeek-R1-Distill-Llama-8B INT4 on ~~Core Ultra 7 NPU, from the~~ OpenVINO Model Hub (~~Feb~~public ~~2025)~~benchmark, Intel NUC 14 Pro with Lunar Lake):

Measured at 6.10 tok/s ~~decode,~~sustained, ~~163.~~which is ~163 ms/token ITL TTFT is not published; extrapolate from the 8B size and INT4 quantization

The 2.8× gap between Llama 2 (18.55 tok/s) and DeepSeek-Distill-8B (6.10 mstok/s) ~~per-token~~is ~~latency~~real. A naive explanation is parameter count: 7B vs 8B is 14% more matmul. But the gap is closer to 3×, not 14%, which means something structural is different. The honest answer: these are measured on different hardware revisions (Series 1 Meteor Lake vs Series 2 Lunar Lake is a 4× MACs gain), different quantization targets (Llama 2 at FP16? INT8?), and different workload assumptions (batch size, prompt length). The ~~same~~benchmarks ~~model~~are not apples-to-apples; treat them as reference ranges.

The Roofline: Hardware Limits

The sustainable throughput on ~~the~~Intel ~~same SoC's iGPU reaches 12.80 tok/s.~~ ~~The iGPU~~NPU is ~~2.1×~~bounded ~~faster than~~by the LPDDR5X bandwidth ceiling from Chapter 1.1: 136.5 GB/s platform-wide shared among CPU, iGPU, and NPU. ~~for~~No device gets the full 136.5 GB/s; the actual per-device quota depends on driver scheduling and competing loads.

For an 8B INT4 ~~decode.~~model:

~~Intel's~~

~~CES~~Weight ~~2026~~memory: ~~marketing~~4 ~~claim~~GB ~~that Panther Lake NPU beats Jetson Orin AGX on DeepSeek-Llama-~~(8B ~~first-token~~params ~~latency~~× is4 ~~comparative-only;~~bits/param ~~absolute~~/ ~~milliseconds~~8) ~~are not published.~~

~~Llama 2 7B~~ ~~on Core Ultra Series 2 NPU, from MLPerf Client v0.6:~~ ~~TTFT 1.09 s,~~Sustained throughput ~~18.55 tok/s~~~~. The 3× gap between this number and DeepSeek's~~: 6.10 tok/s on(from the ~~same~~published ~~hardware~~benchmark) ~~class~~DRAM ~~reflects~~read ~~model-specific~~rate: ~~differences~~4 ~~(Llama~~GB ~~2 7B vs 8B, more recent driver, possibly different KV quantization configuration). The conservative~~× 6.10 tok/s ~~figure~~= 24.4 GB/s

This is roughly 18% of platform peak bandwidth. The NPU is not starving, but it's not saturating the ~~better~~bus ~~anchor~~either. ~~for~~The ~~reasoning-model~~gap ~~workloads.~~

between

~~Use~~24.4 ~~6 tok/~~GB/s asand ~~your~~136.5 ~~back-of-envelope~~GB/s ~~number~~is ~~for~~scheduling anoverhead, 8Bdriver ~~INT4~~latency, and contention from other agents on the SoC (CPU, iGPU). The roofline model ~~decoding~~says: onif ~~Intel~~you ~~NPU.~~could ~~Use~~eliminate all contention and overhead, you'd hit bandwidth saturation at roughly 18(136.5 GB/s) / (4 GB model weight) = 34 tok/s ~~for~~— aabout ~~well-validated,~~5.5× ~~smaller~~higher ~~model~~than ~~like~~what's ~~Llama~~measured. 2That ~~7B. The truth for any specific deployment~~gap is ~~somewhere in between,~~real and ~~the only way to know is to measure on your hardware.~~

The TTFT-vs-ITL Distinction

~~Why does the regime split matter? Because the optimization techniques are different.~~structural.

~~For TTFT, the matmul has the full prompt to chew on, so it's compute-bound.~~ The ~~NPU's~~practical ~~MAC array shines here. Lunar Lake's 48 TOPS works in your favor; quantization to INT4 helps mostly by shrinking the weight memory traffic, not by speeding compute. Phi Silica reports~~implication: ~~TTFT~~you ~~230~~cannot msexpect ~~for~~sustained ~~short~~decode ~~prompts~~speeds ~~(Snapdragon~~above X15–20 ~~Elite, but the architectural lesson generalizes) and Llama 2 7B~~tok/s on Lunar Lake NPU ~~reports~~for ~~1.09~~reasonable s.

~~For~~models. ~~ITL~~,Going ~~every token~~faster requires ~~streaming~~either a smaller model, lower precision (NF4, FP8 on NPU 5), or moving decode to the ~~entire~~iGPU.

~~weight~~

Comparing tensorto throughiGPU

The same Core Ultra platform has an Xe2 iGPU (Lunar Lake) or Xe1 iGPU (Meteor Lake). The iGPU is not on the ~~MAC array once. At 4 GB INT4 weights and Lunar Lake's~~same 136.5 GB/s ~~LPDDR5X~~bandwidth ~~ceiling, the~~ ~~theoretical~~ ~~floor is~~ 136.5 / 4 = 34 tok/s~~. The 6.10 tok/s observed equals about 18% of that ceiling, eaten by NPU scheduling quota, driver overhead, and the small constants in real workloads.~~ ~~You cannot quantize your way past this ceiling~~~~; you can only halve the weight memory by going INT4, which roughly halves decode latency relative to INT8.~~

~~The architectural lesson is direct:~~ ~~don't expect NPU decode to ever feel like a fast cloud LLM.~~ ~~Treat 6–20 tok/s~~constraint as the ~~design budget for any reasoning-style workload.~~

Cold Start

~~Cold start is dominated by the~~ ~~first compile~~~~, where the~~ NPU ~~plugin~~— ~~tiles~~it ~~the~~has ~~graph,~~its ~~decides~~own ~~SRAM allocation, and emits a binary blob. On Intel hardware the rule of thumb is:~~

~~Class~~ ~~Cold compile (no blob)~~ ~~Warm import (cached)~~ ~~Small CV classifier~~ ~~<1 s~~ ~~~100 ms~~ ~~Whisper / MusicGen / Demucs~~ ~~10–30 s~~ ~~(Audacity docs)~~ ~~1–3 s~~ ~~3B–8B LLM INT4~~ ~~30 s~~path to ~~several~~VRAM ~~minutes~~ ~~(IPEX-LLM quickstart)~~ ~~<3 s~~ ~~(Markaicode)~~

~~The IPEX-LLM NPU quickstart documents the multi-minute first-run delay verbatim:~~ ~~"When running specific GGUF models on NPU for the first time, you might notice delays up to several minutes before the first token is generated."~~ ~~That's the cost of compiling the entire model graph into NPU-tiled blobs. Subsequent runs hit~~ CACHE_DIR— and ~~skip compilation. OpenVINO 2025.4 specifically improved this by~~ ~~memory-mapping cached models in the Level Zero context to eliminate an in-memory copy~~.

For M2M-100 specifically, the encoder compile is fast (a single static-shape encoder is a small graph) and the decoder with-past compile takes longer (more complex graph, more shapes to consider). Pad your first-run latency budget accordingly.

~~The user-facing lesson is the one Audacity gets right:~~ ~~tell the user~~~~. The plugin documentation says explicitly "10 to 30 seconds the first time you run this effect." That's the right pattern. Hiding cold-start by pretending~~ it's ~~instant~~substantially ~~produces~~faster anfor ~~experience~~decode ~~that feels broken on first use.~~

The Cascade Pattern

~~The dominant agent-architecture pattern on Intel SoCs is the~~ ~~cascade~~: a small, cheap model handles the common case; a larger, expensive model handles only what the small one couldn't. This is not novel — cascades exist in cloud serving too — but the Intel single-die integration makes the device-routing version of the pattern especially natural.workloads.

~~The cleanest published Intel example is the Hugging Face × Intel~~ ~~"Qwen3-8B Agent" blog~~~~: Qwen3-8B INT4 target on iGPU, Qwen3-0.6B INT8 draft on~~On the same ~~iGPU,~~hardware (Core Ultra Series 2), Llama 2 7B typically reaches ~~1.3–1.4×~~~40 tok/s on iGPU (measured by community benchmarks; Intel does not publish iGPU LLM numbers). That's a 2.1× speedup ~~via~~over ~~speculative~~NPU ~~decoding~~for decode. For prefill (TTFT), the gap is wider: iGPU TTFT is typically 300–400 ms for a ~~smolagents-based~~128-token ~~reasoning~~prompt, ~~agent. Intel motivates it as:~~ ~~"agentic applications rely on reasoning models that produce 'thinking aloud' traces… making inference speed critical to responsiveness."~~ ~~The pattern generalizes:~~

~~Small-NPU + Big-iGPU~~~~: cheap classification or routing on NPU (5–20 ms per call, sustained low power), heavy generation on iGPU when the agent decides it's needed~~ ~~Small-NPU draft + Big-NPU target~~ (speculative decoding): the small draft model proposes tokens that the larger target model verifies in parallel. OpenVINO 2025.4 sanctioned this with Phi-3-mini FastDraft on Hugging Face, though no Intel benchmark has been published for it yet ~~Big-NPU prefill + Big-CPU decode~~~~: the Phi Silica pattern. NPU eats the compute-bound prompt; CPU streams the decode, reusing the NPU's KV cache~~

~~The device-priority string~~ AUTO:NPU,GPU,CPU is the most common cascade entry point in OpenVINO. The runtime selects the highest-priority compatible device per subgraph, falling back automatically when a device is unavailable or doesn't support a given op.

Phi Silica as the Canonical Reference

~~The single best-documented production NPU agent is Microsoft's~~vs ~~Phi~~1.09 ~~Silica~~~~, a 3.3B-parameter Phi-3.5-mini derivative shipping in Copilot+ Windows. The published numbers (Windows Experience Blog, December 2024):~~ ~~TTFT 230 ms for short prompts, 20 tok/s throughput, 2K context (4K coming), 4.8 mWh per context-processing operation~~ ~~on Snapdragon X Elite.~~

~~What matters for this book is the~~ ~~architecture~~~~, which is exactly what we're recommending for M2M-100:~~

~~Tokenizer, embedding, and LM head on CPU~~ ~~— these are lookup-bound or have shapes the NPU dislikes~~ ~~Transformer block~~seconds on NPU — ~~sustained~~a ~~matmul,~~3–4× gap.

The hybrid story emerges: if you can split the workload with prefill on NPU and decode on iGPU, you get 2.1× throughput for the large constant-cost phase (decode) and take the NPU's ~~sweet~~hit ~~spot~~

only on the one-time prefill. Chapter 3.1 builds the code for this pattern.

What Phi Silica Tells Us

Microsoft's Phi Silica is the closest public reference architecture for an NPU-targeted LLM, deployed on Snapdragon X (Qualcomm NPU, not Intel). The published numbers are TTFT 230 ms, 20 tok/s sustained on a 2K context window. The architecture is: CPU tokenizer + embedding + LM-head, NPU transformer blocks, CPU decode with N=64 KV sliding window.

This is instructive not because Snapdragon X hardware maps cleanly to Intel NPU (it doesn't), but because it shows what real deployed decisions look like: encoder on accelerator, decoder split between accelerator and CPU, because the decode phase's structure (lots of memory, little compute per token) is where the accelerator's architecture breaks down.

Phi Silica also exposes the sliding-window KV cache ~~held~~technique: instead of keeping the full context KV in ~~CPU~~memory, keep only the most recent N tokens (here N=64). This trades recompute (re-running attention over discarded context) for memory ~~via~~bandwidth. For NPU where bandwidth is the constraint, this trade-off wins. The Llama 2 and DeepSeek-Distill benchmarks above use full KV caches. If they switched to sliding-window N=128, ITL would drop materially, but context awareness would degrade after 128 tokens. This is a ~~sliding~~tuning ~~window~~knob ~~with N=64, escaping~~for the ~~static-shape~~agent's ~~constraint~~

working memory size.

Architecture-Specific Wisdom

Three things deserve to be nailed down because they're easy to get wrong:

~~Long~~Batching ~~prompts~~doesn't ~~decomposed~~help ~~into~~on ~~64-token chunks~~NPU for ~~prefill,~~decode. anOn ~~early~~GPU, ~~form~~you ofcan ~~chunked~~batch ~~prefill~~

multiple ~~Speculative~~independent ~~decoding~~decode streams and keep the compute pipeline full — token 1 from user A, token 1 from user B, token 1 from user C, all in parallel. On NPU with a ~~smaller~~fixed-shape ~~draft~~pipeline ~~model~~and 136.5 GB/s bandwidth ceiling, batching adds more weight reads without adding more available bandwidth. Batching increases latency (because you're now serving multiple users sequentially) without increasing throughput (because you hit the bandwidth ceiling with a single-user stream). The practical result: always use batch size 1 for decode on Intel NPU.

LPDDR5X speed is shared, not divisible. ~~amplifying~~The 136.5 GB/s includes all traffic: CPU instruction fetches, iGPU reads, NPU ~~throughput~~

reads, system

~~Click~~memory traffic. If the CPU is running code and the iGPU is running a concurrent task, the NPU's available bandwidth drops. If you want predictable NPU performance, you need to ~~Do,~~account for potential contention. The Phi Silica sliding-window approach partly exists to reduce bandwidth hunger, reducing contention sensitivity.

Compile-time overhead is real. The first invocation of a compiled model on NPU takes 30–60 seconds (from Chapter 1.2 cold-start benchmarks). Subsequent invocations take <3 seconds (warm start, cached to disk via CACHE_DIR). This cost is amortized over the ~~Copilot+~~model's UIlifetime ~~affordance~~in ~~that~~production, ~~uses~~but ~~Phi~~for ~~Silica,~~development ~~routes~~and ~~through~~short-running ~~fixed~~agents, ~~prompt~~it's ~~templates~~.a ~~There~~gotcha. isAlways noset ~~learned~~CACHE_DIR ~~router~~to ~~despite~~a ~~community~~persistent ~~speculation~~location; otherwise you pay the cold-start penalty on every process restart.

The Agent-Loop Latency Budget

A 5-step agent loop — ~~Microsoft has been explicit about this. The lesson generalizes: for NPU agents,~~ ~~template~~where the ~~prompt~~,agent ~~don't~~reasons, ~~ask~~takes an action, observes the ~~model~~result, toand ~~also~~repeats do— ~~prompt~~looks ~~routing.~~like ~~Routing is cheap, NPU calls are not.~~

What Hasn't Been Published

~~Honest gaps that should color how confidently you cite numbers~~this in ~~this~~latency ~~book:~~terms:

NoStep ~~Intel-published~~1 ~~Phi~~prefill: ~~Silica~~512-token ~~numbers~~accumulated prompt, ~4 seconds TTFT

Step 1 decode: 64 output tokens (the agent's "Thought / Action / Observation"), ~10.4 seconds ITL Steps 2–5: same pattern, context grows each iteration Total: ~70–75 seconds for 5 steps at this prompt size

On iGPU: ~35 seconds.

This is the roofline for agent patterns on Intel ~~hardware.~~NPU, ~~All~~and ~~Phi~~it's ~~Silica~~the ~~metrics~~why inbehind ~~circulation~~the ~~are~~Chapter ~~from~~2 ~~Snapdragon~~reasoning-architecture Xrecommendations: ~~Elite. Phi Silica reached Intel Copilot+ PCs through Windows Updates~~ReAct (~~KB5079266,~~which ~~KB5084176,~~is ~~KB5089866)~~inherently ~~during~~loopy) ~~2025,~~doesn't fit the latency budget, but ~~the~~single-shot ~~comparative~~and ~~performance~~cascade ~~data~~patterns ~~isn't in the public record.~~

~~No published TTFT for DeepSeek-Distill-Llama-8B on Core Ultra Series 3~~~~; the CES 2026 claim is comparative-only against Jetson Orin AGX.~~ ~~No published M2M-100-on-NPU performance numbers~~ ~~of any kind — no tok/s, no TTFT, no memory footprint. M2M-100 is not in any OpenVINO Model Hub NPU benchmark.~~ ~~No published quantitative Phi-3-mini-on-NPU numbers from Intel/Hugging Face~~~~, despite multiple how-to walkthroughs.~~ ~~No published agent-loop or ReAct-loop latency benchmarks~~ ~~on Intel NPU. The estimates we'll produce in Chapter 2.3 are extrapolations from the two anchor benchmarks above, presented as such.~~

~~If you encounter precise numbers that aren't in the table at the top of this section, they're almost certainly extrapolation, not measurement. Treat them accordingly.~~do.

What This Section Bought You

You should now understand:

TTFT ~~is compute-bound,~~and ITL isare ~~memory-bandwidth-bound~~two distinct metrics with different hardware bottlenecks — ~~different~~compute ~~regimes,~~vs ~~different~~bandwidth

~~optimizations~~Published benchmarks: Llama 2 7B at 18.55 tok/s (TTFT 1.09s), DeepSeek-Distill-8B at 6.10 tok/s The ~~Lunar Lake decode~~roofline ceiling is 136.5 GB/s LPDDR5X shared across CPU/iGPU/NPU, yielding ~34 tok/s theoretical max for an 8B INT4 ~~model; observed is ~6 tok/s, eaten by overhead~~ iGPU ~~decode~~ is 2.1× faster than NPU for decode, onmaking hybrid prefill-on-NPU / decode-on-iGPU the ~~same Core Ultra SoC for 8B models — the NPU's win is power per watt, not speed~~ ~~Cold start is dominated by first compile~~~~: 10–30 s for media models, minutes for LLMs without~~ CACHE_DIR ~~The cascade~~natural pattern ~~is the Intel-native agent architecture~~ ~~— small-on-NPU + big-on-iGPU, or speculative decoding within a single device~~ Phi Silica isshows ~~the reference~~real deployment wisdom: CPU ~~tokenizer/embedding/LM-head +~~encoder/decoder, NPU ~~transformer~~transformer, ~~+ CPU decode with~~sliding-window KV ~~reuse,~~for ~~all published in the Windows Experience Blog~~bandwidth ~~Templated~~Batch ~~prompts~~size ~~beat learned routers~~1 for ~~NPU-bound~~decode ~~agents — every avoidable~~on NPU; ~~call~~batching ~~wastes~~doesn't increase throughput, only latency Compile-time overhead is 30–60s cold, <3s warm; set CACHE_DIR always A 5-step ReAct loop takes ~70–75 seconds on NPU, which is the ~~budget~~structural reason Chapter 2 recommends single-shot or cascade patterns

Chapter 12 ~~ends here. Chapter 2~~now turns from ~~the~~hardware to software: given these latency budgets, how does model tostate ~~the~~(KV ~~agent:~~cache, ~~given~~attention amemory) ~~system~~factor ~~that~~into ~~can run M2M-100 on Intel NPU, how do we manage state, context,~~design, and ~~decision-making~~what ~~inside~~reasoning ~~the~~architectures ~~constraints~~actually ~~we've~~work ~~now~~within ~~mapped?~~constraint?

Previous: 1.2 Computational Constraints & Model Optimization Next: Chapter 2: Agent State & Decision-Making ~~on Constrained Hardware~~