1.3 Latency, Throughput, and Hardware-Aware Patterns

~~For~~Architecture anand ~~agent,~~constraints ~~time~~set the rules. Performance is ~~the~~what ~~user-facing~~your ~~currency.~~users ~~Throughput~~actually ~~tells you how busy the hardware is; latency tells you whether the user is still paying attention.~~feel. This section ~~gives~~is ~~you~~about the ~~vocabulary~~latency profile of Intel NPU specifically — what the published numbers say, what the structure of those numbers implies for agent design, and where the ~~patterns~~gaps toin ~~reason~~the ~~about~~public ~~both~~record sit.

TTFT and ITL on Intel Core NPU hardware.

The Two Latencies You Actually Care About

~~For~~The ~~generative~~two ~~agents,~~numbers ~~there~~that matter for an interactive agent are ~~two distinct latency metrics — confuse them and you'll optimize the wrong thing.~~

Time toTo First Token (TTFT) is— how long the user waits before anything ~~appears.~~appears ~~It's dominated by the prefill phase: processing the input prompt~~— and ~~warming the KV cache. TTFT is what shapes the user's perception of responsiveness.~~

Inter-Token Latency (ITL), ~~sometimes~~also called ~~time~~per-token decode latency — how fast text streams after generation starts. These are not the same regime: TTFT is compute-bound (matmul-heavy prefill on the full prompt); ITL is memory-bandwidth-bound (one matmul per ~~output~~token, ~~token,~~but every weight has to be streamed from DRAM).

Two Intel-published anchor benchmarks, both worth memorizing:

DeepSeek-R1-Distill-Llama-8B INT4 on Core Ultra 7 NPU, from the OpenVINO Model Hub (Feb 2025): 6.10 tok/s decode, 163.10 ms per-token latency. The same model on the same SoC's iGPU reaches 12.80 tok/s. The iGPU is ~~how~~2.1× ~~long~~faster ~~each~~than ~~subsequent~~the NPU for 8B INT4 decode. Intel's CES 2026 marketing claim that Panther Lake NPU beats Jetson Orin AGX on DeepSeek-Llama-8B first-token latency is comparative-only; absolute milliseconds are not published.

Llama 2 7B on Core Ultra Series 2 NPU, from MLPerf Client v0.6: TTFT 1.09 s, throughput 18.55 tok/s. The 3× gap between this number and DeepSeek's 6.10 tok/s on the same hardware class reflects model-specific differences (Llama 2 7B vs 8B, more recent driver, possibly different KV quantization configuration). The conservative 6.10 tok/s figure is the better anchor for reasoning-model workloads.

Use 6 tok/s as your back-of-envelope number for an 8B INT4 model decoding on Intel NPU. Use 18 tok/s for a well-validated, smaller model like Llama 2 7B. The truth for any specific deployment is somewhere in between, and the only way to know is to measure on your hardware.

The TTFT-vs-ITL Distinction

Why does the regime split matter? Because the optimization techniques are different.

For TTFT, the matmul has the full prompt to chew on, so it's compute-bound. The NPU's MAC array shines here. Lunar Lake's 48 TOPS works in your favor; quantization to INT4 helps mostly by shrinking the weight memory traffic, not by speeding compute. Phi Silica reports TTFT 230 ms for short prompts (Snapdragon X Elite, but the architectural lesson generalizes) and Llama 2 7B on Lunar Lake NPU reports 1.09 s.

For ITL, every token ~~takes~~requires streaming the entire weight tensor through the MAC array once. At 4 GB INT4 weights and Lunar Lake's 136.5 GB/s LPDDR5X ceiling, the theoretical floor is 136.5 / 4 = 34 tok/s. The 6.10 tok/s observed equals about 18% of that ceiling, eaten by NPU scheduling quota, driver overhead, and the small constants in real workloads. You cannot quantize your way past this ceiling; you can only halve the weight memory by going INT4, which roughly halves decode latency relative to ~~generate.~~INT8.

~~ITL~~

The ~~determines~~architectural ~~whether~~lesson is direct: don't expect NPU decode to ever feel like a fast cloud LLM. Treat 6–20 tok/s as the ~~streamed~~design ~~response~~budget ~~feels~~for ~~fluid~~any orreasoning-style ~~stutters.~~workload.

Cold Start

~~These~~Cold ~~two~~start ~~have~~is ~~completely~~dominated ~~different~~by ~~bottlenecks:~~the first compile, where the NPU plugin tiles the graph, decides SRAM allocation, and emits a binary blob. On Intel hardware the rule of thumb is:

~~Phase~~Class	~~Bottleneck~~Cold compile (no blob)	~~What~~Warm ~~helps~~import (cached)
~~Prefill~~Small ~~(TTFT)~~CV classifier	~~Compute-bound~~<1 ~~— large matrix multiplies over the whole prompt~~s	~~Higher~~~100 ~~TOPS, better parallelism, shorter prompts~~ms
~~Decode~~Whisper ~~(ITL)~~/ MusicGen / Demucs	~~Memory-bound~~10–30 —s KV(Audacity ~~cache and weight reads dominate~~docs)	~~Faster~~1–3 ~~memory,~~s

~~smaller~~ ~~models,~~KV3B–8B ~~cache~~LLM ~~optimization~~INT4 30 s to several minutes (IPEX-LLM quickstart) <3 s (Markaicode)

AThe ~~model~~IPEX-LLM ~~that~~NPU ~~benchmarks~~quickstart ~~well~~documents the multi-minute first-run delay verbatim: "When running specific GGUF models on ~~TTFT~~NPU ~~can~~for ~~still~~the ~~feel~~first ~~terrible~~time, you might notice delays up to ~~use~~several ~~if its ITL is high, and vice versa.~~ ~~Measure both.~~

Why Decode is Memory-Bound

On an NPU running a transformer model, generating each new token requires reading every weight at least once and the entire KV cache for every attention head. The compute itself — multiplying a single-token query against the cached keys and values — finishes longminutes before the ~~data~~first ~~has been read from memory.~~

~~This~~token is ~~the regime most NPU agents live in. It has two important implications:~~

~~Peak TOPS numbers are misleading~~~~. A 40-TOPS NPU might deliver 5–10% of that on decode for a small LLM, because it spends most of its time waiting on memory.~~ ~~Smaller weights help more than you'd expect~~~~. Going from INT8 to INT4 isn't just 2x less storage — it's roughly 2x faster decode, because you're moving half the bytes per token.~~

~~This is why the industry obsesses over 4-bit quantization at the edge. It's not vanity — it's the difference between a fluid response and one that stutters.~~

Cold Start vs. Steady State

~~NPU agents pay a one-time cost on cold start that doesn't appear in steady-state benchmarks:~~

~~Model load~~ ~~from disk to memory (can be hundreds of MB)~~ ~~Compiler graph optimization~~ ~~(often cached, but invalidated by SDK or model changes)~~ ~~Weight unpacking and layout conversion~~ ~~for the NPU's preferred format~~ ~~First-run kernel compilation~~ ~~for some platforms~~

~~On a flagship mobile NPU, cold start for a 1B-parameter model is often in the 500ms–2s range. For a laptop NPU loading a 7B model, it can be 5–10 seconds.~~

~~You handle cold start with one or more of:~~

~~Preloading~~ ~~the model when the app launches, not when the user asks a question~~ ~~Persistent runtime processes~~ ~~that keep the model resident across user sessions~~ ~~Streaming UI~~ ~~that surfaces~~ generated."~~thinking…" feedback while the model loads~~ ~~Smaller fast-path models~~ ~~that respond immediately while a larger one warms in the background~~

Three Hardware-Aware Design Patterns

~~The constraints above shape a small set of architectural patterns that consistently work for NPU-based agents. You'll see these recur throughout the book.~~

Pattern 1: The Cascade

~~Use a small, fast model to decide whether the larger model needs to be invoked at all.~~

user query → classifier (tiny, ~10ms)
              ├── trivial / cached → templated response
              ├── needs reasoning → NPU LLM
              └── needs world knowledge → cloud LLM

This pattern works because the routing decision is almost always cheaper than the answer. A 50M-parameter classifier can handle 80–90% of traffic in many agent domains (greetings, simple lookups, repeated queries) without ever waking the larger model.

Pattern 2: Tool-First Reasoning

~~Push computation off the NPU and into tools that run on the CPU or remotely.~~

~~The NPU model's job is to decide~~ ~~which~~ ~~tool to call and~~ ~~how to format~~ ~~the result for the user. The actual work — database lookups, calculations, retrieval, API calls — happens elsewhere. This keeps the NPU on what it'~~That's ~~good at (language understanding and generation) and avoids stuffing world knowledge into a model that can't hold it.~~

~~Chapter 3 covers this in detail, but the principle starts here:~~ ~~the NPU model should be the orchestrator, not the database.~~

Pattern 3: Speculative Decoding

Run a small "draft" model on the CPU or NPU that generates several tokens ahead, then verify them in parallel with the larger model. When the draft is right (often 60–80% of the time for natural language), you get multiple tokens per NPU forward pass.

~~Speculative decoding can deliver 2–3x effective speedup on decode, at~~ the cost of ~~additional~~compiling the entire model ~~load~~graph into NPU-tiled blobs. Subsequent runs hit CACHE_DIR and ~~orchestration~~skip ~~complexity.~~compilation. ~~It's~~OpenVINO ~~increasingly~~2025.4 ~~standard~~specifically improved this by memory-mapping cached models in ~~production~~the ~~NPU~~Level ~~stacks,~~Zero context to eliminate an in-memory copy.

For M2M-100 specifically, the encoder compile is fast (a single static-shape encoder is a small graph) and ~~worth~~the ~~knowing~~decoder ~~about~~with-past ~~even~~compile iftakes ~~you're~~longer ~~not~~(more ~~implementing~~complex itgraph, ~~yourself.~~

Ashapes Profilingto Discipline

consider).

IfPad ~~you~~your ~~take~~first-run ~~only~~latency ~~one~~budget ~~habit from this chapter, make it this:~~ ~~never reason about NPU performance from a spec sheet.~~accordingly.

The ~~actual~~user-facing ~~workflow~~lesson ~~looks~~is ~~like:~~the one Audacity gets right: tell the user. The plugin documentation says explicitly "10 to 30 seconds the first time you run this effect." That's the right pattern. Hiding cold-start by pretending it's instant produces an experience that feels broken on first use.

The

Cascade Pattern

The dominant agent-architecture pattern on Intel SoCs is the ~~Define~~cascade: a ~~representative~~small, ~~workload~~cheap model handles the common case; a larger, expensive model handles only what the small one couldn't. This is not novel — ~~real~~cascades ~~prompts,~~exist ~~real~~in ~~tool~~cloud ~~calls,~~serving ~~real~~too ~~session~~— ~~lengths.~~

but the Intel single-die integration makes the device-routing version of the pattern especially natural.

The cleanest published Intel example is the Hugging Face × Intel ~~Measure~~"Qwen3-8B ~~TTFT~~Agent" ~~and~~blog: ~~ITL~~Qwen3-8B ~~separately~~,INT4 ~~p50 and p95,~~target on ~~each~~iGPU, ~~target~~Qwen3-0.6B ~~device.~~

INT8 ~~Profile~~draft ~~where time is spent~~ ~~— model compile, NPU forward pass, CPU pre/post-processing, tool execution.~~ ~~Identify~~on the ~~actual~~same ~~bottleneck~~iGPU, 1.3–1.4× speedup via speculative decoding ~~before optimizing. Optimizing the NPU forward pass when 80% of latency is in your tokenizer is~~for a ~~waste~~smolagents-based ofreasoning ~~weeks.~~agent. Intel

~~Most~~motivates ~~NPU~~it ~~SDKs~~as: ~~(Core~~"agentic MLapplications ~~Tools,~~rely ~~OpenVINO's~~on ~~benchmark_app,~~reasoning ~~QNN~~models ~~profiler,~~that ~~ONNX~~produce ~~Runtime~~'thinking ~~profiler)~~aloud' ~~emit~~traces… ~~per-operator~~making ~~timings.~~inference ~~Use~~speed ~~them.~~critical to responsiveness." The ~~intuition~~pattern ~~you build from real profiling data is worth more than any rule of thumb in this book — including the ones in this chapter.~~

Wrapping Up Chapter 1

~~You now have the foundations. To recap:~~generalizes:

~~NPUs~~Small-NPU ~~are~~+ ~~integer-first,~~Big-iGPU: ~~memory-constrained~~cheap ~~accelerators~~classification ~~built~~or ~~for~~routing ~~inference,~~on ~~not~~NPU ~~training~~(5–20 ms per call, sustained low power), heavy generation on iGPU when the agent decides it's needed
~~Three~~Small-NPU ~~constraints~~draft ~~govern~~+ ~~every~~Big-NPU ~~deployment~~target (speculative decoding): ~~memory,~~the ~~operator~~small ~~coverage,~~draft ~~numerical~~model ~~precision~~proposes tokens that the larger target model verifies in parallel. OpenVINO 2025.4 sanctioned this with Phi-3-mini FastDraft on Hugging Face, though no Intel benchmark has been published for it yet
~~Quantization~~Big-NPU ~~isn't~~prefill ~~optional~~+ Big-CPU decode ~~— it's~~: the ~~entry~~Phi ~~ticket,~~Silica ~~and~~pattern. ~~INT4~~NPU iseats the ~~practical~~compute-bound ~~norm~~prompt; ~~for~~CPU ~~LLMs at~~streams the ~~edge~~

decode, ~~Decode is memory-bound~~ ~~on NPUs, which makes weight size more important than peak TOPS~~ ~~TTFT and ITL are different problems~~ ~~— measure and optimize both separately~~ ~~Cascading, tool-first reasoning, and speculative decoding~~ ~~are~~reusing the ~~patterns~~NPU's ~~that~~KV ~~recur~~cache

The ~~rest~~device-priority string AUTO:NPU,GPU,CPU is the most common cascade entry point in OpenVINO. The runtime selects the highest-priority compatible device per subgraph, falling back automatically when a device is unavailable or doesn't support a given op.

Phi Silica as the Canonical Reference

The single best-documented production NPU agent is Microsoft's Phi Silica, a 3.3B-parameter Phi-3.5-mini derivative shipping in Copilot+ Windows. The published numbers (Windows Experience Blog, December 2024): TTFT 230 ms for short prompts, 20 tok/s throughput, 2K context (4K coming), 4.8 mWh per context-processing operation on Snapdragon X Elite.

What matters for this book is the architecture, which is exactly what we're recommending for M2M-100:

Tokenizer, embedding, and LM head on CPU — these are lookup-bound or have shapes the NPU dislikes Transformer block on NPU — sustained matmul, the NPU's sweet spot KV cache held in CPU memory via a sliding window with N=64, escaping the static-shape constraint Long prompts decomposed into 64-token chunks for prefill, an early form of chunked prefill Speculative decoding with a smaller draft model amplifying NPU throughput

Click to Do, the ~~book~~Copilot+ ~~builds~~UI affordance that uses Phi Silica, routes through fixed prompt templates. There is no learned router despite community speculation — Microsoft has been explicit about this. The lesson generalizes: for NPU agents, template the prompt, don't ask the model to also do prompt routing. Routing is cheap, NPU calls are not.

What Hasn't Been Published

Honest gaps that should color how confidently you cite numbers in this book:

No Intel-published Phi Silica numbers on ~~this.~~Intel hardware. All Phi Silica metrics in circulation are from Snapdragon X Elite. Phi Silica reached Intel Copilot+ PCs through Windows Updates (KB5079266, KB5084176, KB5089866) during 2025, but the comparative performance data isn't in the public record. No published TTFT for DeepSeek-Distill-Llama-8B on Core Ultra Series 3; the CES 2026 claim is comparative-only against Jetson Orin AGX. No published M2M-100-on-NPU performance numbers of any kind — no tok/s, no TTFT, no memory footprint. M2M-100 is not in any OpenVINO Model Hub NPU benchmark. No published quantitative Phi-3-mini-on-NPU numbers from Intel/Hugging Face, despite multiple how-to walkthroughs. No published agent-loop or ReAct-loop latency benchmarks on Intel NPU. The estimates we'll produce in Chapter 2.3 are extrapolations from the two anchor benchmarks above, presented as such.

If you encounter precise numbers that aren't in the table at the top of this section, they're almost certainly extrapolation, not measurement. Treat them accordingly.

What This Section Bought You

You should now understand:

TTFT is compute-bound, ITL is memory-bandwidth-bound — different regimes, different optimizations The Lunar Lake decode ceiling is ~34 tok/s theoretical for an 8B INT4 model; observed is ~6 tok/s, eaten by overhead iGPU decode is 2.1× faster than NPU decode on the same Core Ultra SoC for 8B models — the NPU's win is power per watt, not speed Cold start is dominated by first compile: 10–30 s for media models, minutes for LLMs without CACHE_DIR The cascade pattern is the Intel-native agent architecture — small-on-NPU + big-on-iGPU, or speculative decoding within a single device Phi Silica is the reference deployment: CPU tokenizer/embedding/LM-head + NPU transformer + CPU decode with KV reuse, all published in the Windows Experience Blog Templated prompts beat learned routers for NPU-bound agents — every avoidable NPU call wastes the budget

Chapter 1 ends here. Chapter 2 ~~dives~~turns ~~into~~from the model to the agent: given a system that can run M2M-100 on Intel NPU, how todo we manage ~~agent state —~~state, context, ~~memory,~~ and ~~reasoning~~decision-making ~~loops — within these constraints. Chapter 3 turns to tool design. Chapter 4 covers deployment, observability, and~~inside the ~~operational~~constraints ~~reality of running agents in production. Chapter 5 closes with case studies from teams who'~~we've ~~shipped~~now ~~real NPU agents and what they learned the hard way.~~

If you're going to do one thing before moving on: pick a target NPU, pick a candidate model, and actually measure TTFT and ITL on it. Everything that follows will land harder if you have those numbers in hand.mapped?

Previous: 1.2 Computational Constraints & Model Optimization Next: Chapter 2: Agent State & Decision-Making on Constrained Hardware