Recently Updated Pages
4.4 Security and Privacy on the Edge
"It runs on the device, so it's private" is the marketing line. It's also a half-truth that has c...
3.4 Structured Outputs and Constrained Decoding
An agent is only as reliable as the parser that reads its output. Chapter 3.1 covered designing t...
1.5 Speculative Decoding
Chapter 1.3 established the bandwidth ceiling as the binding constraint on LLM decode: 136.5 GB/s...
1.4 The Accuracy Cost of Quantization
Chapter 1.2 laid out the quantization recipes Intel NPU supports: INT8-sym, INT4-sym group-128 or...
Preface
This book is about a narrow, awkward, increasingly important corner of applied AI: building agent...
References
These are the primary sources for the technical claims in the book. Where multiple sources existe...
Glossary
The book uses vocabulary from three communities that don't always agree on terms: Intel NPU hardw...
2.3 Reasoning Loops Under Constraint
Chapter 2 closes here. We have a model that fits, weights we can stream, KV state we can manage, ...
2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing
The distinction between KV cache (what you keep in memory) and KV cache bandwidth (what you strea...
2.1 Context Windows and the Memory Wall
The agent's state — what it remembers from past steps and what it uses to make the next decision ...
1.3 Latency, Throughput, and Hardware-Aware Patterns
The architecture and constraints from Chapters 1.1 and 1.2 set the ceiling. This section is about...
1.2 Computational Constraints & Model Optimization
The architecture from Chapter 1.1 sets the rules. This section is about playing inside them: what...
1.1 Understanding NPU Architecture
Before talking about agents on NPUs, we need to talk about the NPU itself — what makes it a disti...
5.3 Anti-Patterns and Lessons
We've covered foundations, state, tools, deployment, and case studies. This final section pulls t...
5.2 A Worked Agentic Translation Assistant
This section ties the book together by walking through an end-to-end agentic translation assistan...
5.1 What's Actually Shipping on Intel NPUs
The most useful thing a book like this can do, in its closing chapter, is be honest about what is...
4.3 A/B Testing, Canaries, and Hotswaps
Models drift. Drivers update. Quantization schemes change. The NPU you tested against in February...
4.2 Telemetry: What Works, What Doesn't, and What's Missing
You can't operate what you can't observe. NPU agents have a harder observability story than CPU- ...
4.1 Serving NPU Models with OVMS
A development-time compile_model(...) call is not a production deployment. Once your agent is rea...
3.3 Multi-Device Orchestration on a Single SoC
A Core Ultra SoC isn't one engine — it's three. CPU cores for general-purpose work, an integrated...