Skip to main content

1.3 Latency, Throughput, and Hardware-Aware Patterns

1.3 Latency, Throughput, and Hardware-Aware Patterns

ForArchitecture anand agent,constraints timeset the rules. Performance is thewhat user-facingyour currency.users Throughputactually tells you how busy the hardware is; latency tells you whether the user is still paying attention.feel. This section givesis youabout the vocabularylatency profile of Intel NPU specifically — what the published numbers say, what the structure of those numbers implies for agent design, and where the patternsgaps toin reasonthe aboutpublic bothrecord sit.

TTFT and ITL on Intel Core NPU hardware.

The Two Latencies You Actually Care About

ForThe generativetwo agents,numbers therethat matter for an interactive agent are two distinct latency metrics — confuse them and you'll optimize the wrong thing.

Time toTo First Token (TTFT) is how long the user waits before anything appears.appears It's dominated by the prefill phase: processing the input prompt and warming the KV cache. TTFT is what shapes the user's perception of responsiveness.

Inter-Token Latency (ITL), sometimesalso called timeper-token decode latency — how fast text streams after generation starts. These are not the same regime: TTFT is compute-bound (matmul-heavy prefill on the full prompt); ITL is memory-bandwidth-bound (one matmul per outputtoken, token,but every weight has to be streamed from DRAM).

Two Intel-published anchor benchmarks, both worth memorizing:

DeepSeek-R1-Distill-Llama-8B INT4 on Core Ultra 7 NPU, from the OpenVINO Model Hub (Feb 2025): 6.10 tok/s decode, 163.10 ms per-token latency. The same model on the same SoC's iGPU reaches 12.80 tok/s. The iGPU is how2.1× longfaster eachthan subsequentthe NPU for 8B INT4 decode. Intel's CES 2026 marketing claim that Panther Lake NPU beats Jetson Orin AGX on DeepSeek-Llama-8B first-token latency is comparative-only; absolute milliseconds are not published.

Llama 2 7B on Core Ultra Series 2 NPU, from MLPerf Client v0.6: TTFT 1.09 s, throughput 18.55 tok/s. The 3× gap between this number and DeepSeek's 6.10 tok/s on the same hardware class reflects model-specific differences (Llama 2 7B vs 8B, more recent driver, possibly different KV quantization configuration). The conservative 6.10 tok/s figure is the better anchor for reasoning-model workloads.

Use 6 tok/s as your back-of-envelope number for an 8B INT4 model decoding on Intel NPU. Use 18 tok/s for a well-validated, smaller model like Llama 2 7B. The truth for any specific deployment is somewhere in between, and the only way to know is to measure on your hardware.

The TTFT-vs-ITL Distinction

Why does the regime split matter? Because the optimization techniques are different.

For TTFT, the matmul has the full prompt to chew on, so it's compute-bound. The NPU's MAC array shines here. Lunar Lake's 48 TOPS works in your favor; quantization to INT4 helps mostly by shrinking the weight memory traffic, not by speeding compute. Phi Silica reports TTFT 230 ms for short prompts (Snapdragon X Elite, but the architectural lesson generalizes) and Llama 2 7B on Lunar Lake NPU reports 1.09 s.

For ITL, every token takesrequires streaming the entire weight tensor through the MAC array once. At 4 GB INT4 weights and Lunar Lake's 136.5 GB/s LPDDR5X ceiling, the theoretical floor is 136.5 / 4 = 34 tok/s. The 6.10 tok/s observed equals about 18% of that ceiling, eaten by NPU scheduling quota, driver overhead, and the small constants in real workloads. You cannot quantize your way past this ceiling; you can only halve the weight memory by going INT4, which roughly halves decode latency relative to generate.INT8.

ITL

The determinesarchitectural whetherlesson is direct: don't expect NPU decode to ever feel like a fast cloud LLM. Treat 6–20 tok/s as the streameddesign responsebudget feelsfor fluidany orreasoning-style stutters.workload.

Cold Start

TheseCold twostart haveis completelydominated differentby bottlenecks:the first compile, where the NPU plugin tiles the graph, decides SRAM allocation, and emits a binary blob. On Intel hardware the rule of thumb is:

PhaseClass BottleneckCold compile (no blob) WhatWarm helpsimport (cached)
PrefillSmall (TTFT)CV classifier Compute-bound<1 — large matrix multiplies over the whole prompts Higher~100 TOPS, better parallelism, shorter promptsms
DecodeWhisper (ITL)/ MusicGen / Demucs Memory-bound10–30 s KV(Audacity cache and weight reads dominatedocs) Faster1–3 memory,s
smaller models,KV3B–8B cacheLLM optimizationINT4 30 s to several minutes (IPEX-LLM quickstart) <3 s (Markaicode)

AThe modelIPEX-LLM thatNPU benchmarksquickstart welldocuments the multi-minute first-run delay verbatim: "When running specific GGUF models on TTFTNPU canfor stillthe feelfirst terribletime, you might notice delays up to useseveral if its ITL is high, and vice versa. Measure both.

Why Decode is Memory-Bound

On an NPU running a transformer model, generating each new token requires reading every weight at least once and the entire KV cache for every attention head. The compute itself — multiplying a single-token query against the cached keys and values — finishes longminutes before the datafirst has been read from memory.

Thistoken is the regime most NPU agents live in. It has two important implications:

    Peak TOPS numbers are misleading. A 40-TOPS NPU might deliver 5–10% of that on decode for a small LLM, because it spends most of its time waiting on memory. Smaller weights help more than you'd expect. Going from INT8 to INT4 isn't just 2x less storage — it's roughly 2x faster decode, because you're moving half the bytes per token.

    This is why the industry obsesses over 4-bit quantization at the edge. It's not vanity — it's the difference between a fluid response and one that stutters.

    Cold Start vs. Steady State

    NPU agents pay a one-time cost on cold start that doesn't appear in steady-state benchmarks:

      Model load from disk to memory (can be hundreds of MB) Compiler graph optimization (often cached, but invalidated by SDK or model changes) Weight unpacking and layout conversion for the NPU's preferred format First-run kernel compilation for some platforms

      On a flagship mobile NPU, cold start for a 1B-parameter model is often in the 500ms–2s range. For a laptop NPU loading a 7B model, it can be 5–10 seconds.

      You handle cold start with one or more of:

        Preloading the model when the app launches, not when the user asks a question Persistent runtime processes that keep the model resident across user sessions Streaming UI that surfaces generated."thinking…" feedback while the model loads Smaller fast-path models that respond immediately while a larger one warms in the background

        Three Hardware-Aware Design Patterns

        The constraints above shape a small set of architectural patterns that consistently work for NPU-based agents. You'll see these recur throughout the book.

        Pattern 1: The Cascade

        Use a small, fast model to decide whether the larger model needs to be invoked at all.

        user query → classifier (tiny, ~10ms)
                      ├── trivial / cached → templated response
                      ├── needs reasoning → NPU LLM
                      └── needs world knowledge → cloud LLM
        

        This pattern works because the routing decision is almost always cheaper than the answer. A 50M-parameter classifier can handle 80–90% of traffic in many agent domains (greetings, simple lookups, repeated queries) without ever waking the larger model.

        Pattern 2: Tool-First Reasoning

        Push computation off the NPU and into tools that run on the CPU or remotely.

        The NPU model's job is to decide which tool to call and how to format the result for the user. The actual work — database lookups, calculations, retrieval, API calls — happens elsewhere. This keeps the NPU on what it'That's good at (language understanding and generation) and avoids stuffing world knowledge into a model that can't hold it.

        Chapter 3 covers this in detail, but the principle starts here: the NPU model should be the orchestrator, not the database.

        Pattern 3: Speculative Decoding

        Run a small "draft" model on the CPU or NPU that generates several tokens ahead, then verify them in parallel with the larger model. When the draft is right (often 60–80% of the time for natural language), you get multiple tokens per NPU forward pass.

        Speculative decoding can deliver 2–3x effective speedup on decode, at the cost of additionalcompiling the entire model loadgraph into NPU-tiled blobs. Subsequent runs hit CACHE_DIR and orchestrationskip complexity.compilation. It'sOpenVINO increasingly2025.4 standardspecifically improved this by memory-mapping cached models in productionthe NPULevel stacks,Zero context to eliminate an in-memory copy.

        For M2M-100 specifically, the encoder compile is fast (a single static-shape encoder is a small graph) and worththe knowingdecoder aboutwith-past evencompile iftakes you'relonger not(more implementingcomplex itgraph, yourself.

        more

        Ashapes Profilingto Discipline

        consider).

        IfPad youyour takefirst-run onlylatency onebudget habit from this chapter, make it this: never reason about NPU performance from a spec sheet.accordingly.

        The actualuser-facing workflowlesson looksis like:the one Audacity gets right: tell the user. The plugin documentation says explicitly "10 to 30 seconds the first time you run this effect." That's the right pattern. Hiding cold-start by pretending it's instant produces an experience that feels broken on first use.

          The

          Cascade Pattern

          The dominant agent-architecture pattern on Intel SoCs is the Definecascade: a representativesmall, workloadcheap model handles the common case; a larger, expensive model handles only what the small one couldn't. This is not novelrealcascades prompts,exist realin toolcloud calls,serving realtoo session lengths.

          but the Intel single-die integration makes the device-routing version of the pattern especially natural.

          The cleanest published Intel example is the Hugging Face × Intel Measure"Qwen3-8B TTFTAgent" andblog: ITLQwen3-8B separately,INT4 p50 and p95,target on eachiGPU, targetQwen3-0.6B device.

          INT8 Profiledraft where time is spent — model compile, NPU forward pass, CPU pre/post-processing, tool execution. Identifyon the actualsame bottleneckiGPU, 1.3–1.4× speedup via speculative decoding before optimizing. Optimizing the NPU forward pass when 80% of latency is in your tokenizer isfor a wastesmolagents-based ofreasoning weeks.agent. Intel

          Mostmotivates NPUit SDKsas: (Core"agentic MLapplications Tools,rely OpenVINO'son benchmark_app,reasoning QNNmodels profiler,that ONNXproduce Runtime'thinking profiler)aloud' emittraces… per-operatormaking timings.inference Usespeed them.critical to responsiveness." The intuitionpattern you build from real profiling data is worth more than any rule of thumb in this book — including the ones in this chapter.

          Wrapping Up Chapter 1

          You now have the foundations. To recap:generalizes:

          • NPUsSmall-NPU are+ integer-first,Big-iGPU: memory-constrainedcheap acceleratorsclassification builtor forrouting inference,on notNPU training(5–20 ms per call, sustained low power), heavy generation on iGPU when the agent decides it's needed
          • ThreeSmall-NPU constraintsdraft govern+ everyBig-NPU deploymenttarget (speculative decoding): memory,the operatorsmall coverage,draft numericalmodel precisionproposes tokens that the larger target model verifies in parallel. OpenVINO 2025.4 sanctioned this with Phi-3-mini FastDraft on Hugging Face, though no Intel benchmark has been published for it yet
          • QuantizationBig-NPU isn'tprefill optional+ Big-CPU decode — it's: the entryPhi ticket,Silica andpattern. INT4NPU iseats the practicalcompute-bound normprompt; forCPU LLMs atstreams the edge
          decode, Decode is memory-bound on NPUs, which makes weight size more important than peak TOPS TTFT and ITL are different problems — measure and optimize both separately Cascading, tool-first reasoning, and speculative decoding arereusing the patternsNPU's thatKV recurcache

          The restdevice-priority string AUTO:NPU,GPU,CPU is the most common cascade entry point in OpenVINO. The runtime selects the highest-priority compatible device per subgraph, falling back automatically when a device is unavailable or doesn't support a given op.

          Phi Silica as the Canonical Reference

          The single best-documented production NPU agent is Microsoft's Phi Silica, a 3.3B-parameter Phi-3.5-mini derivative shipping in Copilot+ Windows. The published numbers (Windows Experience Blog, December 2024): TTFT 230 ms for short prompts, 20 tok/s throughput, 2K context (4K coming), 4.8 mWh per context-processing operation on Snapdragon X Elite.

          What matters for this book is the architecture, which is exactly what we're recommending for M2M-100:

            Tokenizer, embedding, and LM head on CPU — these are lookup-bound or have shapes the NPU dislikes Transformer block on NPU — sustained matmul, the NPU's sweet spot KV cache held in CPU memory via a sliding window with N=64, escaping the static-shape constraint Long prompts decomposed into 64-token chunks for prefill, an early form of chunked prefill Speculative decoding with a smaller draft model amplifying NPU throughput

            Click to Do, the bookCopilot+ buildsUI affordance that uses Phi Silica, routes through fixed prompt templates. There is no learned router despite community speculation — Microsoft has been explicit about this. The lesson generalizes: for NPU agents, template the prompt, don't ask the model to also do prompt routing. Routing is cheap, NPU calls are not.

            What Hasn't Been Published

            Honest gaps that should color how confidently you cite numbers in this book:

              No Intel-published Phi Silica numbers on this.Intel hardware. All Phi Silica metrics in circulation are from Snapdragon X Elite. Phi Silica reached Intel Copilot+ PCs through Windows Updates (KB5079266, KB5084176, KB5089866) during 2025, but the comparative performance data isn't in the public record. No published TTFT for DeepSeek-Distill-Llama-8B on Core Ultra Series 3; the CES 2026 claim is comparative-only against Jetson Orin AGX. No published M2M-100-on-NPU performance numbers of any kind — no tok/s, no TTFT, no memory footprint. M2M-100 is not in any OpenVINO Model Hub NPU benchmark. No published quantitative Phi-3-mini-on-NPU numbers from Intel/Hugging Face, despite multiple how-to walkthroughs. No published agent-loop or ReAct-loop latency benchmarks on Intel NPU. The estimates we'll produce in Chapter 2.3 are extrapolations from the two anchor benchmarks above, presented as such.

              If you encounter precise numbers that aren't in the table at the top of this section, they're almost certainly extrapolation, not measurement. Treat them accordingly.

              What This Section Bought You

              You should now understand:

                TTFT is compute-bound, ITL is memory-bandwidth-bound — different regimes, different optimizations The Lunar Lake decode ceiling is ~34 tok/s theoretical for an 8B INT4 model; observed is ~6 tok/s, eaten by overhead iGPU decode is 2.1× faster than NPU decode on the same Core Ultra SoC for 8B models — the NPU's win is power per watt, not speed Cold start is dominated by first compile: 10–30 s for media models, minutes for LLMs without CACHE_DIR The cascade pattern is the Intel-native agent architecture — small-on-NPU + big-on-iGPU, or speculative decoding within a single device Phi Silica is the reference deployment: CPU tokenizer/embedding/LM-head + NPU transformer + CPU decode with KV reuse, all published in the Windows Experience Blog Templated prompts beat learned routers for NPU-bound agents — every avoidable NPU call wastes the budget

                Chapter 1 ends here. Chapter 2 divesturns intofrom the model to the agent: given a system that can run M2M-100 on Intel NPU, how todo we manage agent state —state, context, memory, and reasoningdecision-making loops — within these constraints. Chapter 3 turns to tool design. Chapter 4 covers deployment, observability, andinside the operationalconstraints reality of running agents in production. Chapter 5 closes with case studies from teams who'we've shippednow real NPU agents and what they learned the hard way.

                If you're going to do one thing before moving on: pick a target NPU, pick a candidate model, and actually measure TTFT and ITL on it. Everything that follows will land harder if you have those numbers in hand.mapped?


                Previous: 1.2 Computational Constraints & Model Optimization Next: Chapter 2: Agent State & Decision-Making on Constrained Hardware