Preface
This book is about a narrow, awkward, increasingly important corner of applied AI: building agents that run on the Neural Processing Unit of a consumer-grade laptop. Specifically, on Intel Core Ultra hardware, using OpenVINO, with one eye on the production deployments shipping today and one eye on where the silicon is going next.
It exists because most of the agent-building literature lives in a world that doesn't match the constraints. Cloud-LLM agents enjoy GPU bandwidth measured in terabytes per second, context windows in the hundreds of thousands of tokens, and the freedom to bolt twenty steps of ReAct on top of any problem. Edge-LLM agents on NPU live in a different physics: bandwidth measured in hundreds of gigabytes per second shared across the SoC, context windows where 8K is still a validated-preview feature, and a decode budget that turns a five-step reasoning loop into a coffee break. The patterns that work in one regime fail quietly in the other, and the failures aren't always obvious until you've already chosen the wrong architecture.
The book's central argument, stated bluntly: NPU constraints reshape agent design more than developers expect, and most of the cloud-era playbook ports badly to on-device deployment. Single-shot beats ReAct. Encoder-decoder partitioning beats monolithic decoding. Local memory beats long context. The reasoning chain that fits the hardware tends to be flatter, shallower, and more declarative than what an unconstrained agent would do — and once you internalize the constraints, the design space simplifies rather than collapses.
Who This Book Is For
The reader I had in mind is a software engineer who has shipped at least one LLM-backed application — probably against a cloud API — and now needs to do the same thing without the cloud. The reasons vary: privacy (the data can't leave the device), latency (network round-trip is too slow), cost (cloud token bills are intolerable at scale), or just the fact that the product team committed to "AI on Copilot+ PCs" before anyone checked what that meant.
You should be comfortable with transformer architectures at a conceptual level: attention heads, KV cache, prefill vs decode, what tokenization is doing. You should be comfortable in Python and able to read PyTorch code, even if you don't write it from scratch. You don't need to be a Microsoft Visual Studio or .NET developer, though if you are, the deployment-side material in Chapter 4 will feel familiar.
You do not need to be an OpenVINO expert. The book assumes you can install it, run a hello-world inference, and look up the docs when something doesn't compile, but it doesn't assume you've spent six months tuning NNCF quantization recipes. Where OpenVINO-specific knowledge is load-bearing, the book teaches it.
You do not need to be a hardware person. Chapter 1.1 covers what an NPU is at the level you need to design agents around it. Deeper hardware questions — exactly how the SHAVE DSP's VLIW scheduling works, the precise MAC array tile geometry — are out of scope. They're fascinating; they're also not what you need to know to ship a working agent.
The Argument in One Paragraph
Intel Core Ultra NPUs are bandwidth-bound, statically-shaped, INT4-friendly accelerators with strong matmul throughput and weak operator coverage. Building an agent for them means accepting the bandwidth ceiling as a hard constraint on decode speed (10–20 tok/s is realistic, 30 tok/s is theoretical), the static-shape requirement as a hard constraint on dynamism (chunked prefill is your friend), and the quantization recipe as a hard constraint on accuracy (INT4-sym group-128 is the canonical setting). Within those constraints, single-shot reasoning tasks fly, encoder-decoder partitions across NPU and iGPU work well, and small models punch above their weight if you let them. Outside those constraints — long ReAct loops, dynamic batching, FP16 weights at scale — you'll fight the hardware all the way to production.
The book builds up to that one-paragraph claim, then shows you how to act on it.
How the Book Is Organized
Chapter 1: Foundations establishes the hardware. NPU architecture, the three Intel generations, computational constraints, model optimization, quantization quality, the latency profile, and speculative decoding as the main mitigation for the bandwidth ceiling. By the end of Chapter 1 you should have a working mental model of what the NPU is good at, what it isn't, and where the numbers come from.
Chapter 2: Agent State & Decision-Making turns to the software. Context windows and the memory wall, KV cache engineering, and the reasoning architectures that fit the constraints. Single-shot, plan-then-execute, and cascade-triage patterns dominate; ReAct is the cautionary tale.
Chapter 3: Tool Use & Integration Patterns is about the connections — how an NPU-bound agent reaches the outside world. Tool design, local-vs-cloud tool trade-offs, multi-device orchestration on the SoC, and structured outputs / constrained decoding for the agent-tool contract.
Chapter 4: Production Deployment & Observability handles the deployment surface. Serving with OVMS, telemetry, A/B testing and canaries, and the security/privacy model that on-device deployment actually buys you (which is less than the marketing suggests).
Chapter 5: Real-World Case Studies grounds the abstractions. What's actually shipping on Intel NPUs today, a worked agentic translation assistant from end to end, and the anti-patterns that recur across teams.
Chapter 6: Beyond Text — Audio and Vision Agents extends the thesis to multimodality. Whisper on NPU for speech, VLMs for vision, and the orchestration patterns that combine streams. The book's title says "agentic AI"; this chapter makes sure that promise covers more than text.
Appendices include a glossary of terms used throughout the book and a consolidated reference list keyed back to the chapters that depend on each source.
How to Read It
Linearly is fine. The chapters build on each other, and skipping forward risks losing the chain of constraints — Chapter 2's reasoning-architecture choices won't make sense without Chapter 1's bandwidth analysis, and Chapter 3's tool-design discussion assumes you've internalized Chapter 2's loop-budget math.
That said:
- If you're skeptical that the constraints are as binding as the book claims, read Chapter 1.3 first. The latency numbers either persuade you or they don't.
- If you already know OpenVINO well, you can skim Chapter 1.1 and 1.2 for vocabulary alignment and dive into Chapter 1.3 and beyond.
- If you're shipping a product on a deadline, Chapter 5 has the highest density of "what actually works" content.
- If you're at the "is this even feasible?" stage, the worked translation assistant in Chapter 5.2 is the fastest way to see a real end-to-end NPU agent.
The "What This Section Bought You" closer at the end of each section is a deliberate concession to non-linear reading: you can scan those bullets to decide which sections you need to read in detail.
On Honesty and Recency
The book was finalized in May 2026, against OpenVINO 2026.1, the Lunar Lake / NPU 4 generation as the realistic deployment target, and the Panther Lake / NPU 5 generation as the near-future direction. Specific version numbers will be stale within months. Specific benchmark figures will be displaced by newer measurements. The reasoning is meant to outlast the numbers.
Where the public record is thin — and on NPU agent design it is genuinely thin, with Intel and Microsoft having published almost nothing on multi-step agents directly — the book flags the gap rather than papering over it. You will see phrases like "no Intel-published benchmark exists for this" and "extrapolated, not measured" throughout. They mean what they say. If you build something whose numbers contradict the book's extrapolations, your numbers win.
The references appendix marks unverified-at-write-time sources with †. These are claims I believe but couldn't validate with a direct read of the canonical source at the time of writing. Don't quote them in production decisions without re-checking.
A Note on the Worked Example
The book uses M2M-100 — Facebook AI's 100-language translation model — as its primary worked-example model. This is a deliberate choice over Llama, Phi, or Qwen, all of which would have been easier. M2M-100 is encoder-decoder (forcing you to think about partitioning), uses full multi-head attention with no GQA (making its KV bandwidth costs visible), and is not on Intel's validated NPU model list (so you have to engineer the deployment rather than follow a tutorial). It is also MIT-licensed and therefore commercially usable, unlike its CC-BY-NC successor NLLB-200.
If your real production target is a decoder-only Llama derivative, the M2M-100 examples still apply — the constraints are the same, just easier on a model with GQA and unified architecture. If you're building a translation product specifically, M2M-100 is a viable starting point in itself.
Acknowledgments
The technical content owes a heavy debt to the OpenVINO documentation team at Intel, whose per-release notes and per-device plugin pages remain the only reliable primary source for NPU behavior. Microsoft's Phi Silica engineering blogs supplied the closest available analog to a Microsoft architectural reference for on-device LLM deployment. The independent benchmarks from Chips and Cheese gave the book honest counter-readings against Intel's marketing TOPS numbers. MLPerf Client v0.6 supplied the only externally-validated TTFT and ITL anchors used throughout.
Errors are mine.
Next: Chapter 1: Foundations — 1.1 Understanding NPU Architecture
No comments to display
No comments to display