1.1 Understanding NPU Architecture

Neural Processing Units (NPUs) are purpose-built silicon for one job: running neural network inference efficiently. Unlike CPUs (general-purpose) or GPUs (massively parallel floating-point), NPUs are optimized for the specific math that dominates modern ML workloads — matrix multiplication, convolution, and tensor reduction — at low power.

If you're going to build agents that run on NPUs, you need a working model of what the hardware actually does well, and where it falls down. This chapter gives you that model.

Why NPUs Exist

Modern transformer-based models spend the overwhelming majority of their cycles in just a few operations: GEMMs (general matrix multiplies), attention computations, and elementwise activations. A general-purpose CPU wastes silicon on branch prediction, out-of-order execution, and cache hierarchies that don't help these workloads. A GPU is better, but pays a power cost for flexibility.

NPUs strip the design down to what matters:

Systolic arrays or dataflow engines for high-throughput matrix math
Integer-first compute (INT8, INT4) — sometimes with FP16 fallbacks
On-chip SRAM sized to hold model weights or activations close to the compute
Tight SoC integration — they share memory with the CPU rather than living behind a PCIe bus

The result is dramatically better performance-per-watt for inference. A modern mobile NPU can deliver 40+ TOPS while drawing under 5 watts.

The Major NPU Families

You'll encounter several NPU architectures in practice, each with its own toolchain and quirks:

Vendor	NPU	Typical Platform	Toolchain
Apple	Neural Engine (ANE)	iPhone, Mac (M-series)	Core ML
Qualcomm	Hexagon NPU	Snapdragon SoCs	QNN, SNPE
Intel	NPU (Movidius/VPU lineage)	Meteor Lake, Lunar Lake	OpenVINO
AMD	XDNA / Ryzen AI	Ryzen AI laptops	ONNX Runtime + Vitis
Google	Edge TPU	Coral devices, Pixel	TFLite, Edge TPU Compiler
MediaTek	APU	Dimensity SoCs	NeuroPilot

These differ in supported operators, quantization formats, memory layouts, and how they handle unsupported ops (some fall back to CPU silently, others fail compilation). A model that runs beautifully on one NPU may run poorly — or not at all — on another. This portability problem is one of the biggest practical challenges in NPU-based agent deployment.

The Mental Model: Compute is Cheap, Memory is Expensive

The single most important thing to internalize about NPU programming: moving data costs more than computing on it.

NPUs have tiny on-chip memory compared to GPUs. When activations or weights spill to DRAM, you pay a 10–100x latency penalty and burn significant power. This shapes every design decision:

Smaller models that fit in on-chip SRAM dominate larger models that don't
Quantization isn't just about model size — it's about memory bandwidth
Operator fusion (combining adjacent ops to avoid intermediate writes) is critical
Batch size 1 is normal at the edge, which means memory-bound workloads

When you're designing an agent, this means you can't just port a cloud-side architecture and expect it to work. The trade-offs are different, and the bottlenecks live in different places.

What This Means for Agents

Traditional agent architectures assume abundant compute and memory: long context windows, multiple model calls per turn, tool selection by a large generalist model. NPUs invert these assumptions. To build agentic systems that actually perform on NPU hardware, you need to:

Right-size the model to the hardware envelope (more on this in 1.2)
Design the reasoning loop to minimize round-trips to memory (Chapter 2)
Push complexity into tools that run on the CPU or remotely (Chapter 3)
Profile early and often — guessing about NPU performance is a fast way to ship something slow (Chapter 4)

The rest of this chapter builds out the foundations: how NPU constraints translate into model optimization choices, what latency and throughput actually mean on this class of hardware, and the design patterns that consistently work in production.

Next: 1.2 Computational Constraints & Model Optimization