Skip to main content

1.3 Latency, Throughput, and Hardware-Aware Patterns

1.3

The Latency, Throughput, and Hardware-Aware Patterns

Architecturearchitecture and constraints from Chapters 1.1 and 1.2 set the rules. Performance is what your users actually feel.ceiling. This section is about themeasuring it: what does a real model's latency profile oflook like on Intel NPUhardware, specificallyhow does that latency break down, and what does that imply for the agent loop design patterns Chapter 2 will develop?

We use two published numbersbenchmarks say,as anchors: Llama 2 7B on MLPerf Client v0.6, measured by Intel on a Core Ultra Series 1 processor, and DeepSeek-R1-Distill-Llama-8B INT4 on OpenVINO Model Hub, both real data points that set the floor and ceiling for what theyou structurecan of those numbers implies for agent design, and where the gaps in the public record sit.expect.

The Two Key Latency Metrics: TTFT and ITL

Model inference latency on Intelaccelerators Coreis NPUtraditionally

Thequoted as a single number (e.g., "inference takes 50 ms"). That's been obsolete for over a decade in LLM contexts because LLMs have two numbersphases thatwith matterradically fordifferent ancharacteristics.

interactive agent are

Time To First Time-To-First-Token (TTFT) is howthe latency of the prefill phase: the time from when you send the prompt to when the model emits the first output token. The prompt is static, potentially long (hundreds of tokens), and the userentire waitscomputation beforeis anythingon appearsthe critical pathandyou can't generate a second token until the first one exists. TTFT is compute-bound.

Inter-Token Latency (ITL), alsois calledthe per-latency of each subsequent token in the decode latencyphase. The howdecoder fastsees text streams after generation starts. These are notonly the samenew regime:token TTFTslot plus the KV cache, and the computation is compute-boundroughly (matmul-heavyconstant prefillper onnew the full prompt);token. ITL is memory-bandwidth-bound (oneon matmul per token, but every weight has to be streamed from DRAM).NPU.

TwoOn Intel-Intel Core Ultra with Lunar Lake, the published anchorbenchmarks benchmarks,nail boththis worthsplit:

memorizing:

Llama 2 7B on MLPerf Client v0.6 (Intel internal, Core Ultra Series 1 Meteor Lake):

    TTFT at 128 input tokens: 1.09 seconds ITL (tokens 2+): ~54 ms/token Implied throughput: 18.55 tok/s sustained

    DeepSeek-R1-Distill-Llama-8B INT4 on Core Ultra 7 NPU, from the OpenVINO Model Hub (Febpublic 2025)benchmark, Intel NUC 14 Pro with Lunar Lake):

      Measured at 6.10 tok/s decode,sustained, 163.which is ~163 ms/token ITL TTFT is not published; extrapolate from the 8B size and INT4 quantization

      The 2.8× gap between Llama 2 (18.55 tok/s) and DeepSeek-Distill-8B (6.10 mstok/s) per-tokenis latencyreal. A naive explanation is parameter count: 7B vs 8B is 14% more matmul. But the gap is closer to 3×, not 14%, which means something structural is different. The honest answer: these are measured on different hardware revisions (Series 1 Meteor Lake vs Series 2 Lunar Lake is a 4× MACs gain), different quantization targets (Llama 2 at FP16? INT8?), and different workload assumptions (batch size, prompt length). The samebenchmarks modelare not apples-to-apples; treat them as reference ranges.

      The Roofline: Hardware Limits

      The sustainable throughput on theIntel same SoC's iGPU reaches 12.80 tok/s. The iGPUNPU is 2.1×bounded faster thanby the LPDDR5X bandwidth ceiling from Chapter 1.1: 136.5 GB/s platform-wide shared among CPU, iGPU, and NPU. forNo device gets the full 136.5 GB/s; the actual per-device quota depends on driver scheduling and competing loads.

      For an 8B INT4 decode.model:

      Intel's
      CESWeight 2026memory: marketing4 claimGB that Panther Lake NPU beats Jetson Orin AGX on DeepSeek-Llama-(8B first-tokenparams latency× is4 comparative-only;bits/param absolute/ milliseconds8) are not published.

      Llama 2 7B on Core Ultra Series 2 NPU, from MLPerf Client v0.6: TTFT 1.09 s,Sustained throughput 18.55 tok/s. The 3× gap between this number and DeepSeek's: 6.10 tok/s on(from the samepublished hardwarebenchmark) classDRAM reflectsread model-specificrate: differences4 (LlamaGB 2 7B vs 8B, more recent driver, possibly different KV quantization configuration). The conservative× 6.10 tok/s figure= 24.4 GB/s

      This is roughly 18% of platform peak bandwidth. The NPU is not starving, but it's not saturating the betterbus anchoreither. forThe reasoning-modelgap workloads.

      between

      Use24.4 6 tok/GB/s asand your136.5 back-of-envelopeGB/s numberis forscheduling anoverhead, 8Bdriver INT4latency, and contention from other agents on the SoC (CPU, iGPU). The roofline model decodingsays: onif Intelyou NPU.could Useeliminate all contention and overhead, you'd hit bandwidth saturation at roughly 18(136.5 GB/s) / (4 GB model weight) = 34 tok/s for aabout well-validated,5.5× smallerhigher modelthan likewhat's Llamameasured. 2That 7B. The truth for any specific deploymentgap is somewhere in between,real and the only way to know is to measure on your hardware.

      The TTFT-vs-ITL Distinction

      Why does the regime split matter? Because the optimization techniques are different.structural.

      For TTFT, the matmul has the full prompt to chew on, so it's compute-bound. The NPU'spractical MAC array shines here. Lunar Lake's 48 TOPS works in your favor; quantization to INT4 helps mostly by shrinking the weight memory traffic, not by speeding compute. Phi Silica reportsimplication: TTFTyou 230cannot msexpect forsustained shortdecode promptsspeeds (Snapdragonabove X15–20 Elite, but the architectural lesson generalizes) and Llama 2 7Btok/s on Lunar Lake NPU reportsfor 1.09reasonable s.

      8B

      Formodels. ITL,Going every tokenfaster requires streamingeither a smaller model, lower precision (NF4, FP8 on NPU 5), or moving decode to the entireiGPU.

      weight

      Comparing tensorto throughiGPU

      The same Core Ultra platform has an Xe2 iGPU (Lunar Lake) or Xe1 iGPU (Meteor Lake). The iGPU is not on the MAC array once. At 4 GB INT4 weights and Lunar Lake'ssame 136.5 GB/s LPDDR5Xbandwidth ceiling, the theoretical floor is 136.5 / 4 = 34 tok/s. The 6.10 tok/s observed equals about 18% of that ceiling, eaten by NPU scheduling quota, driver overhead, and the small constants in real workloads. You cannot quantize your way past this ceiling; you can only halve the weight memory by going INT4, which roughly halves decode latency relative to INT8.

      The architectural lesson is direct: don't expect NPU decode to ever feel like a fast cloud LLM. Treat 6–20 tok/sconstraint as the design budget for any reasoning-style workload.

      Cold Start

      Cold start is dominated by the first compile, where the NPU plugin tilesit thehas graph,its decidesown SRAM allocation, and emits a binary blob. On Intel hardware the rule of thumb is:

      Class Cold compile (no blob) Warm import (cached) Small CV classifier <1 s ~100 ms Whisper / MusicGen / Demucs 10–30 s (Audacity docs) 1–3 s 3B–8B LLM INT4 30 spath to severalVRAM minutes (IPEX-LLM quickstart) <3 s (Markaicode)

      The IPEX-LLM NPU quickstart documents the multi-minute first-run delay verbatim: "When running specific GGUF models on NPU for the first time, you might notice delays up to several minutes before the first token is generated." That's the cost of compiling the entire model graph into NPU-tiled blobs. Subsequent runs hit CACHE_DIR and skip compilation. OpenVINO 2025.4 specifically improved this by memory-mapping cached models in the Level Zero context to eliminate an in-memory copy.

      For M2M-100 specifically, the encoder compile is fast (a single static-shape encoder is a small graph) and the decoder with-past compile takes longer (more complex graph, more shapes to consider). Pad your first-run latency budget accordingly.

      The user-facing lesson is the one Audacity gets right: tell the user. The plugin documentation says explicitly "10 to 30 seconds the first time you run this effect." That's the right pattern. Hiding cold-start by pretending it's instantsubstantially producesfaster anfor experiencedecode that feels broken on first use.

      The Cascade Pattern

      The dominant agent-architecture pattern on Intel SoCs is the cascade: a small, cheap model handles the common case; a larger, expensive model handles only what the small one couldn't. This is not novel — cascades exist in cloud serving too — but the Intel single-die integration makes the device-routing version of the pattern especially natural.workloads.

      The cleanest published Intel example is the Hugging Face × Intel "Qwen3-8B Agent" blog: Qwen3-8B INT4 target on iGPU, Qwen3-0.6B INT8 draft onOn the same iGPU,hardware (Core Ultra Series 2), Llama 2 7B typically reaches 1.3–1.4×~40 tok/s on iGPU (measured by community benchmarks; Intel does not publish iGPU LLM numbers). That's a 2.1× speedup viaover speculativeNPU decodingfor decode. For prefill (TTFT), the gap is wider: iGPU TTFT is typically 300–400 ms for a smolagents-based128-token reasoningprompt, agent. Intel motivates it as: "agentic applications rely on reasoning models that produce 'thinking aloud' traces… making inference speed critical to responsiveness." The pattern generalizes:

        Small-NPU + Big-iGPU: cheap classification or routing on NPU (5–20 ms per call, sustained low power), heavy generation on iGPU when the agent decides it's needed Small-NPU draft + Big-NPU target (speculative decoding): the small draft model proposes tokens that the larger target model verifies in parallel. OpenVINO 2025.4 sanctioned this with Phi-3-mini FastDraft on Hugging Face, though no Intel benchmark has been published for it yet Big-NPU prefill + Big-CPU decode: the Phi Silica pattern. NPU eats the compute-bound prompt; CPU streams the decode, reusing the NPU's KV cache

        The device-priority string AUTO:NPU,GPU,CPU is the most common cascade entry point in OpenVINO. The runtime selects the highest-priority compatible device per subgraph, falling back automatically when a device is unavailable or doesn't support a given op.

        Phi Silica as the Canonical Reference

        The single best-documented production NPU agent is Microsoft'svs Phi1.09 Silica, a 3.3B-parameter Phi-3.5-mini derivative shipping in Copilot+ Windows. The published numbers (Windows Experience Blog, December 2024): TTFT 230 ms for short prompts, 20 tok/s throughput, 2K context (4K coming), 4.8 mWh per context-processing operation on Snapdragon X Elite.

        What matters for this book is the architecture, which is exactly what we're recommending for M2M-100:

          Tokenizer, embedding, and LM head on CPU — these are lookup-bound or have shapes the NPU dislikes Transformer blockseconds on NPUsustaineda matmul,3–4× gap.

          The hybrid story emerges: if you can split the workload with prefill on NPU and decode on iGPU, you get 2.1× throughput for the large constant-cost phase (decode) and take the NPU's sweethit spot

          only on the one-time prefill. Chapter 3.1 builds the code for this pattern.

          What Phi Silica Tells Us

          Microsoft's Phi Silica is the closest public reference architecture for an NPU-targeted LLM, deployed on Snapdragon X (Qualcomm NPU, not Intel). The published numbers are TTFT 230 ms, 20 tok/s sustained on a 2K context window. The architecture is: CPU tokenizer + embedding + LM-head, NPU transformer blocks, CPU decode with N=64 KV sliding window.

          This is instructive not because Snapdragon X hardware maps cleanly to Intel NPU (it doesn't), but because it shows what real deployed decisions look like: encoder on accelerator, decoder split between accelerator and CPU, because the decode phase's structure (lots of memory, little compute per token) is where the accelerator's architecture breaks down.

          Phi Silica also exposes the sliding-window KV cache heldtechnique: instead of keeping the full context KV in CPUmemory, keep only the most recent N tokens (here N=64). This trades recompute (re-running attention over discarded context) for memory viabandwidth. For NPU where bandwidth is the constraint, this trade-off wins. The Llama 2 and DeepSeek-Distill benchmarks above use full KV caches. If they switched to sliding-window N=128, ITL would drop materially, but context awareness would degrade after 128 tokens. This is a slidingtuning windowknob with N=64, escapingfor the static-shapeagent's constraint

          working memory size.

          Architecture-Specific Wisdom

          Three things deserve to be nailed down because they're easy to get wrong:

          LongBatching promptsdoesn't decomposedhelp intoon 64-token chunksNPU for prefill,decode. anOn earlyGPU, formyou ofcan chunkedbatch prefill

          multiple Speculativeindependent decodingdecode streams and keep the compute pipeline full — token 1 from user A, token 1 from user B, token 1 from user C, all in parallel. On NPU with a smallerfixed-shape draftpipeline modeland 136.5 GB/s bandwidth ceiling, batching adds more weight reads without adding more available bandwidth. Batching increases latency (because you're now serving multiple users sequentially) without increasing throughput (because you hit the bandwidth ceiling with a single-user stream). The practical result: always use batch size 1 for decode on Intel NPU.

          LPDDR5X speed is shared, not divisible. amplifyingThe 136.5 GB/s includes all traffic: CPU instruction fetches, iGPU reads, NPU throughput

          reads, system

          Clickmemory traffic. If the CPU is running code and the iGPU is running a concurrent task, the NPU's available bandwidth drops. If you want predictable NPU performance, you need to Do,account for potential contention. The Phi Silica sliding-window approach partly exists to reduce bandwidth hunger, reducing contention sensitivity.

          Compile-time overhead is real. The first invocation of a compiled model on NPU takes 30–60 seconds (from Chapter 1.2 cold-start benchmarks). Subsequent invocations take <3 seconds (warm start, cached to disk via CACHE_DIR). This cost is amortized over the Copilot+model's UIlifetime affordancein thatproduction, usesbut Phifor Silica,development routesand throughshort-running fixedagents, promptit's templates.a Theregotcha. isAlways noset learnedCACHE_DIR routerto despitea communitypersistent speculationlocation; otherwise you pay the cold-start penalty on every process restart.

          The Agent-Loop Latency Budget

          A 5-step agent loopMicrosoft has been explicit about this. The lesson generalizes: for NPU agents, templatewhere the prompt,agent don'treasons, asktakes an action, observes the modelresult, toand alsorepeats do promptlooks routing.like Routing is cheap, NPU calls are not.

          What Hasn't Been Published

          Honest gaps that should color how confidently you cite numbersthis in thislatency book:terms:

          • NoStep Intel-published1 Phiprefill: Silica512-token numbersaccumulated prompt, ~4 seconds TTFT
          Step 1 decode: 64 output tokens (the agent's "Thought / Action / Observation"), ~10.4 seconds ITL Steps 2–5: same pattern, context grows each iteration Total: ~70–75 seconds for 5 steps at this prompt size

          On iGPU: ~35 seconds.

          This is the roofline for agent patterns on Intel hardware.NPU, Alland Phiit's Silicathe metricswhy inbehind circulationthe areChapter from2 Snapdragonreasoning-architecture Xrecommendations: Elite. Phi Silica reached Intel Copilot+ PCs through Windows UpdatesReAct (KB5079266,which KB5084176,is KB5089866)inherently duringloopy) 2025,doesn't fit the latency budget, but thesingle-shot comparativeand performancecascade datapatterns isn't in the public record.

          No published TTFT for DeepSeek-Distill-Llama-8B on Core Ultra Series 3; the CES 2026 claim is comparative-only against Jetson Orin AGX. No published M2M-100-on-NPU performance numbers of any kind — no tok/s, no TTFT, no memory footprint. M2M-100 is not in any OpenVINO Model Hub NPU benchmark. No published quantitative Phi-3-mini-on-NPU numbers from Intel/Hugging Face, despite multiple how-to walkthroughs. No published agent-loop or ReAct-loop latency benchmarks on Intel NPU. The estimates we'll produce in Chapter 2.3 are extrapolations from the two anchor benchmarks above, presented as such.

          If you encounter precise numbers that aren't in the table at the top of this section, they're almost certainly extrapolation, not measurement. Treat them accordingly.do.

          What This Section Bought You

          You should now understand:

          • TTFT is compute-bound,and ITL isare memory-bandwidth-boundtwo distinct metrics with different hardware bottlenecksdifferentcompute regimes,vs differentbandwidth
          optimizationsPublished benchmarks: Llama 2 7B at 18.55 tok/s (TTFT 1.09s), DeepSeek-Distill-8B at 6.10 tok/s The Lunar Lake decoderoofline ceiling is 136.5 GB/s LPDDR5X shared across CPU/iGPU/NPU, yielding ~34 tok/s theoretical max for an 8B INT4 model; observed is ~6 tok/s, eaten by overhead iGPU decode is 2.1× faster than NPU for decode, onmaking hybrid prefill-on-NPU / decode-on-iGPU the same Core Ultra SoC for 8B models — the NPU's win is power per watt, not speed Cold start is dominated by first compile: 10–30 s for media models, minutes for LLMs without CACHE_DIR The cascadenatural pattern is the Intel-native agent architecture — small-on-NPU + big-on-iGPU, or speculative decoding within a single device Phi Silica isshows the referencereal deployment wisdom: CPU tokenizer/embedding/LM-head +encoder/decoder, NPU transformertransformer, + CPU decode withsliding-window KV reuse,for all published in the Windows Experience Blogbandwidth TemplatedBatch promptssize beat learned routers1 for NPU-bounddecode agents — every avoidableon NPU; callbatching wastesdoesn't increase throughput, only latency Compile-time overhead is 30–60s cold, <3s warm; set CACHE_DIR always A 5-step ReAct loop takes ~70–75 seconds on NPU, which is the budgetstructural reason Chapter 2 recommends single-shot or cascade patterns

          Chapter 12 ends here. Chapter 2now turns from thehardware to software: given these latency budgets, how does model tostate the(KV agent:cache, givenattention amemory) systemfactor thatinto can run M2M-100 on Intel NPU, how do we manage state, context,design, and decision-makingwhat insidereasoning thearchitectures constraintsactually we'vework nowwithin mapped?constraint?


          Previous: 1.2 Computational Constraints & Model Optimization Next: Chapter 2: Agent State & Decision-Making on Constrained Hardware