On the Edge: Agentic AI for Neural Processors

A practical guide to building intelligent agents optimized for NPU hardware. Learn how to design, implement, and deploy agentic systems that leverage neural processors for edge computing, with real-world patterns, performance optimization techniques, and production-ready strategies.

Preface

This book is about a narrow, awkward, increasingly important corner of applied AI: building agent...

Foundations of NPU-Optimized Agents

NPU architecture and computational constraints. Model quantization and optimization for NPU deplo...

1.1 Understanding NPU Architecture

Before talking about agents on NPUs, we need to talk about the NPU itself — what makes it a disti...

1.2 Computational Constraints & Model Optimization

The architecture from Chapter 1.1 sets the rules. This section is about playing inside them: what...

1.3 Latency, Throughput, and Hardware-Aware Patterns

The architecture and constraints from Chapters 1.1 and 1.2 set the ceiling. This section is about...

1.4 The Accuracy Cost of Quantization

Chapter 1.2 laid out the quantization recipes Intel NPU supports: INT8-sym, INT4-sym group-128 or...

1.5 Speculative Decoding

Chapter 1.3 established the bandwidth ceiling as the binding constraint on LLM decode: 136.5 GB/s...

Agent State & Decision-Making on Constrained Hardware

Managing agent context and memory within NPU limits. Efficient reasoning loops for low-latency in...

2.1 Context Windows and the Memory Wall

The agent's state — what it remembers from past steps and what it uses to make the next decision ...

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

The distinction between KV cache (what you keep in memory) and KV cache bandwidth (what you strea...

2.3 Reasoning Loops Under Constraint

Chapter 2 closes here. We have a model that fits, weights we can stream, KV state we can manage, ...

Tool Use & Integration Patterns

Designing lightweight tools for NPU-based agents. Async I/O and non-blocking integrations. Local ...

3.1 Designing Tools for NPU-Bound Agents

Chapter 2 ended with a claim: tool selection is a decision problem, not a search. This chapter go...

3.2 Local-NPU vs Cloud Tools: A Real Trade-Off Table

If the tool runs locally on the NPU, the orchestrator pays a one-time compile cost and then has p...

3.3 Multi-Device Orchestration on a Single SoC

A Core Ultra SoC isn't one engine — it's three. CPU cores for general-purpose work, an integrated...

3.4 Structured Outputs and Constrained Decoding

An agent is only as reliable as the parser that reads its output. Chapter 3.1 covered designing t...

Production Deployment & Observability

Model serving architectures (ONNX, TensorRT, TVM). Monitoring latency, throughput, and reliabilit...

4.1 Serving NPU Models with OVMS

A development-time compile_model(...) call is not a production deployment. Once your agent is rea...

4.2 Telemetry: What Works, What Doesn't, and What's Missing

You can't operate what you can't observe. NPU agents have a harder observability story than CPU- ...

4.3 A/B Testing, Canaries, and Hotswaps

Models drift. Drivers update. Quantization schemes change. The NPU you tested against in February...

4.4 Security and Privacy on the Edge

"It runs on the device, so it's private" is the marketing line. It's also a half-truth that has c...

Real-World Case Studies & Best Practices

Building customer-facing NPU agents (chatbots, assistants). Batch vs. streaming inference strateg...

5.1 What's Actually Shipping on Intel NPUs

The most useful thing a book like this can do, in its closing chapter, is be honest about what is...

5.2 A Worked Agentic Translation Assistant

This section ties the book together by walking through an end-to-end agentic translation assistan...

5.3 Anti-Patterns and Lessons

We've covered foundations, state, tools, deployment, and case studies. This final section pulls t...

Appendices

Glossary of terms and consolidated source references for the book.

Glossary

The book uses vocabulary from three communities that don't always agree on terms: Intel NPU hardw...

References

These are the primary sources for the technical claims in the book. Where multiple sources existe...