1.3 Latency, Throughput, and Hardware-Aware Patterns

For an agent, time is the user-facing currency. Throughput tells you how busy the hardware is; latency tells you whether the user is still paying attention. This section gives you the vocabulary and the patterns to reason about both on NPU hardware.

The Two Latencies You Actually Care About

For generative agents, there are two distinct latency metrics — confuse them and you'll optimize the wrong thing.

Time to First Token (TTFT) is how long the user waits before anything appears. It's dominated by the prefill phase: processing the input prompt and warming the KV cache. TTFT is what shapes the user's perception of responsiveness.

Inter-Token Latency (ITL), sometimes called time per output token, is how long each subsequent token takes to generate. ITL determines whether the streamed response feels fluid or stutters.

These two have completely different bottlenecks:

Phase	Bottleneck	What helps
Prefill (TTFT)	Compute-bound — large matrix multiplies over the whole prompt	Higher TOPS, better parallelism, shorter prompts
Decode (ITL)	Memory-bound — KV cache and weight reads dominate	Faster memory, smaller models, KV cache optimization

A model that benchmarks well on TTFT can still feel terrible to use if its ITL is high, and vice versa. Measure both.

Why Decode is Memory-Bound

On an NPU running a transformer model, generating each new token requires reading every weight at least once and the entire KV cache for every attention head. The compute itself — multiplying a single-token query against the cached keys and values — finishes long before the data has been read from memory.

This is the regime most NPU agents live in. It has two important implications:

Peak TOPS numbers are misleading. A 40-TOPS NPU might deliver 5–10% of that on decode for a small LLM, because it spends most of its time waiting on memory.
Smaller weights help more than you'd expect. Going from INT8 to INT4 isn't just 2x less storage — it's roughly 2x faster decode, because you're moving half the bytes per token.

This is why the industry obsesses over 4-bit quantization at the edge. It's not vanity — it's the difference between a fluid response and one that stutters.

Cold Start vs. Steady State

NPU agents pay a one-time cost on cold start that doesn't appear in steady-state benchmarks:

Model load from disk to memory (can be hundreds of MB)
Compiler graph optimization (often cached, but invalidated by SDK or model changes)
Weight unpacking and layout conversion for the NPU's preferred format
First-run kernel compilation for some platforms

On a flagship mobile NPU, cold start for a 1B-parameter model is often in the 500ms–2s range. For a laptop NPU loading a 7B model, it can be 5–10 seconds.

You handle cold start with one or more of:

Preloading the model when the app launches, not when the user asks a question
Persistent runtime processes that keep the model resident across user sessions
Streaming UI that surfaces "thinking…" feedback while the model loads
Smaller fast-path models that respond immediately while a larger one warms in the background

Three Hardware-Aware Design Patterns

The constraints above shape a small set of architectural patterns that consistently work for NPU-based agents. You'll see these recur throughout the book.

Pattern 1: The Cascade

Use a small, fast model to decide whether the larger model needs to be invoked at all.

user query → classifier (tiny, ~10ms)
              ├── trivial / cached → templated response
              ├── needs reasoning → NPU LLM
              └── needs world knowledge → cloud LLM

This pattern works because the routing decision is almost always cheaper than the answer. A 50M-parameter classifier can handle 80–90% of traffic in many agent domains (greetings, simple lookups, repeated queries) without ever waking the larger model.

Pattern 2: Tool-First Reasoning

Push computation off the NPU and into tools that run on the CPU or remotely.

The NPU model's job is to decide which tool to call and how to format the result for the user. The actual work — database lookups, calculations, retrieval, API calls — happens elsewhere. This keeps the NPU on what it's good at (language understanding and generation) and avoids stuffing world knowledge into a model that can't hold it.

Chapter 3 covers this in detail, but the principle starts here: the NPU model should be the orchestrator, not the database.

Pattern 3: Speculative Decoding

Run a small "draft" model on the CPU or NPU that generates several tokens ahead, then verify them in parallel with the larger model. When the draft is right (often 60–80% of the time for natural language), you get multiple tokens per NPU forward pass.

Speculative decoding can deliver 2–3x effective speedup on decode, at the cost of additional model load and orchestration complexity. It's increasingly standard in production NPU stacks, and worth knowing about even if you're not implementing it yourself.

A Profiling Discipline

If you take only one habit from this chapter, make it this: never reason about NPU performance from a spec sheet.

The actual workflow looks like:

Define a representative workload — real prompts, real tool calls, real session lengths.
Measure TTFT and ITL separately, p50 and p95, on each target device.
Profile where time is spent — model compile, NPU forward pass, CPU pre/post-processing, tool execution.
Identify the actual bottleneck before optimizing. Optimizing the NPU forward pass when 80% of latency is in your tokenizer is a waste of weeks.

Most NPU SDKs (Core ML Tools, OpenVINO's benchmark_app, QNN profiler, ONNX Runtime profiler) emit per-operator timings. Use them. The intuition you build from real profiling data is worth more than any rule of thumb in this book — including the ones in this chapter.

Wrapping Up Chapter 1

You now have the foundations. To recap:

NPUs are integer-first, memory-constrained accelerators built for inference, not training
Three constraints govern every deployment: memory, operator coverage, numerical precision
Quantization isn't optional — it's the entry ticket, and INT4 is the practical norm for LLMs at the edge
Decode is memory-bound on NPUs, which makes weight size more important than peak TOPS
TTFT and ITL are different problems — measure and optimize both separately
Cascading, tool-first reasoning, and speculative decoding are the patterns that recur

The rest of the book builds on this. Chapter 2 dives into how to manage agent state — context, memory, and reasoning loops — within these constraints. Chapter 3 turns to tool design. Chapter 4 covers deployment, observability, and the operational reality of running agents in production. Chapter 5 closes with case studies from teams who've shipped real NPU agents and what they learned the hard way.

If you're going to do one thing before moving on: pick a target NPU, pick a candidate model, and actually measure TTFT and ITL on it. Everything that follows will land harder if you have those numbers in hand.

Previous: 1.2 Computational Constraints & Model Optimization Next: Chapter 2: Agent State & Decision-Making on Constrained Hardware