1.1 Understanding NPU Architecture
1.1 Understanding NPU Architecture
Neural Processing Units (NPUs) are purpose-built silicon for one job: running neural network inference efficiently. Unlike CPUs (general-purpose) or GPUs (massively parallel floating-point), NPUs are optimized for the specific math that dominates modern ML workloads — matrix multiplication, convolution, and tensor reduction — at low power.
If you're going to build agents that run on NPUs, you need a working model of what the hardware actually does well, and where it falls down. This chapter gives you that model.
Why NPUs Exist
Modern transformer-based models spend the overwhelming majority of their cycles in just a few operations: GEMMs (general matrix multiplies), attention computations, and elementwise activations. A general-purpose CPU wastes silicon on branch prediction, out-of-order execution, and cache hierarchies that don't help these workloads. A GPU is better, but pays a power cost for flexibility.
NPUs strip the design down to what matters:
- Systolic arrays or dataflow engines for high-throughput matrix math
- Integer-first compute (INT8, INT4) — sometimes with FP16 fallbacks
- On-chip SRAM sized to hold model weights or activations close to the compute
- Tight SoC integration — they share memory with the CPU rather than living behind a PCIe bus
The result is dramatically better performance-per-watt for inference. A modern mobile NPU can deliver 40+ TOPS while drawing under 5 watts.
The Major NPU Families
You'll encounter several NPU architectures in practice, each with its own toolchain and quirks:
| Vendor | NPU | Typical Platform | Toolchain |
|---|---|---|---|
| Apple | Neural Engine (ANE) | iPhone, Mac (M-series) | Core ML |
| Qualcomm | Hexagon NPU | Snapdragon SoCs | QNN, SNPE |
| Intel | NPU (Movidius/VPU lineage) | Meteor Lake, Lunar Lake | OpenVINO |
| AMD | XDNA / Ryzen AI | Ryzen AI laptops | ONNX Runtime + Vitis |
| Edge TPU | Coral devices, Pixel | TFLite, Edge TPU Compiler | |
| MediaTek | APU | Dimensity SoCs | NeuroPilot |
These differ in supported operators, quantization formats, memory layouts, and how they handle unsupported ops (some fall back to CPU silently, others fail compilation). A model that runs beautifully on one NPU may run poorly — or not at all — on another. This portability problem is one of the biggest practical challenges in NPU-based agent deployment.
The Mental Model: Compute is Cheap, Memory is Expensive
The single most important thing to internalize about NPU programming: moving data costs more than computing on it.
NPUs have tiny on-chip memory compared to GPUs. When activations or weights spill to DRAM, you pay a 10–100x latency penalty and burn significant power. This shapes every design decision:
- Smaller models that fit in on-chip SRAM dominate larger models that don't
- Quantization isn't just about model size — it's about memory bandwidth
- Operator fusion (combining adjacent ops to avoid intermediate writes) is critical
- Batch size 1 is normal at the edge, which means memory-bound workloads
When you're designing an agent, this means you can't just port a cloud-side architecture and expect it to work. The trade-offs are different, and the bottlenecks live in different places.
What This Means for Agents
Traditional agent architectures assume abundant compute and memory: long context windows, multiple model calls per turn, tool selection by a large generalist model. NPUs invert these assumptions. To build agentic systems that actually perform on NPU hardware, you need to:
- Right-size the model to the hardware envelope (more on this in 1.2)
- Design the reasoning loop to minimize round-trips to memory (Chapter 2)
- Push complexity into tools that run on the CPU or remotely (Chapter 3)
- Profile early and often — guessing about NPU performance is a fast way to ship something slow (Chapter 4)
The rest of this chapter builds out the foundations: how NPU constraints translate into model optimization choices, what latency and throughput actually mean on this class of hardware, and the design patterns that consistently work in production.
Next: 1.2 Computational Constraints & Model Optimization