1.1 Understanding NPU Architecture

Before talking about agents on NPUs, we need to talk about the NPU itself — what makes it a distinct class of accelerator, and why the architectural choices ripple all the way up to how you design an agent loop. This book uses Intel Core NPU as its primary anchor and Facebook AI's M2M-100 as its primary worked-example model. Every concept in this chapter ladders back to those two.

What an NPU Actually Is

Neural Processing Units ~~(NPUs)~~ are ~~purpose-~~domain-specific accelerators built ~~silicon~~ for ~~one~~the ~~job:~~matrix-multiplication ~~running~~and activation workloads that dominate neural network ~~inference~~inference. ~~efficiently.~~They ~~Unlike~~sit between CPUs (~~general-purpose)~~which orare flexible but inefficient at dense matmul) and GPUs (~~massively parallel floating-point), NPUs~~which are ~~optimized~~powerful but power-hungry and latency-spiky). The NPU's pitch is sustained matrix throughput at a fraction of the GPU's power budget — useful for ~~the~~always-on, ~~specific~~on-device ~~math~~workloads ~~that~~where ~~dominates~~battery and thermal headroom matter more than peak FLOPS.

The internal recipe varies by vendor, but every modern MLNPU ~~workloads~~combines —three ~~matrix multiplication, convolution, and tensor reduction — at low power.~~

~~If you're going to build agents that run on NPUs, you need~~things: a ~~working~~MAC ~~model of what the hardware actually does well, and where it falls down. This chapter gives you that model.~~

Why NPUs Exist

Modern transformer-based models spend the overwhelming majority of their cycles in just a few operations: GEMMs (general matrix multiplies), attention computations, and elementwise activations. A general-purpose CPU wastes silicon on branch prediction, out-of-order execution, and cache hierarchies that don't help these workloads. A GPU is better, but pays a power cost for flexibility.

~~NPUs strip the design down to what matters:~~

~~Systolic arrays or dataflow engines~~ ~~for high-throughput matrix math~~ ~~Integer-first compute~~array (~~INT8,~~the ~~INT4)~~dense-matmul —workhorse, ~~sometimes~~typically ~~with~~1K–4K ~~FP16~~multiply-accumulate ~~fallbacks~~units per cycle), ~~On-~~on-chip SRAM sized to hold ~~model~~a tile of activations and weights orwithout ~~activations close~~round-tripping to DRAM, and a fixed-function or near-fixed-function activation pipeline (GELU, Sigmoid, Tanh, sometimes Softmax in hardware). What an NPU does not have, by design, is a general-purpose programmable shader array. Generality is the ~~compute~~GPU's job.

How the Major Families Compare

Five NPU families currently matter in commercial deployments, and they descend from recognizably different lineages:

~~Tight~~Intel ~~SoC~~Core ~~integration~~NPU is the only x86-native NPU and the focus of this book. It inherits its architecture from Movidius — ~~they~~Intel ~~share~~acquired ~~memory~~the company in 2016 — and pairs a MAC array with programmable SHAVE VLIW DSPs in the same compute engine. The SHAVEs handle transcendentals, type conversion, and FP32 fallback. Three generations exist: NPU 3720 (Meteor Lake, December 2023, ~11.5 TOPS INT8 claimed but measured at 9.5 TOPS at 1.16 GHz by Chips and Cheese), NPU 4 (Lunar Lake, September 2024, 48 TOPS INT8, on the same compute tile as the CPU and Xe2 iGPU), and NPU 5 (Panther Lake, CES 2026, 50 TOPS INT8, Intel 18A process, with native FP8 support).

Apple Neural Engine is a fixed-function tensor accelerator tightly bound to Core ML on macOS and iOS. The M4 family ships a 16-core Neural Engine at 38 TOPS. Developer access is gated through Core ML — there's no equivalent of OpenVINO that lets you reach the silicon directly.

Qualcomm Hexagon NPU descends from a phone DSP (Hexagon QDSP6) with a bolted-on Tensor Accelerator and Vector eXtensions. Snapdragon X Elite reaches 45 TOPS. The architecture is fundamentally optimized for power efficiency at phone scale; bringing it to laptops is a relatively recent push.

AMD XDNA descends from Xilinx Versal AI Engine tiles arranged in a 2D spatial array. XDNA 2 in Ryzen AI 300 (Strix Point) hits 50 TOPS INT8 plus 50 TOPS Block FP16. Unlike Intel and Apple, XDNA sits as a separate IP block rather than ~~living~~on ~~behind~~the main compute die — a ~~PCIe~~different ~~bus~~

integration model with implications for memory contention.

~~The~~Google ~~result~~Edge TPU is ~~dramatically~~a ~~better~~fixed-function ~~performance-per-watt~~systolic-array ASIC, primarily for ~~inference.~~Coral Adevices ~~modern~~and ~~mobile~~on-device TensorFlow Lite. It's a different deployment story (small embedded modules) and outside the scope of consumer-PC agents.

What's Distinctive About Intel

Four things set Intel apart, and each has practical consequences for agent design:

OS support spans Windows and Linux. The in-tree intel/linux-npu-driver makes Intel NPU ~~can~~usable ~~deliver~~on ~~40+~~Ubuntu ~~TOPS~~and ~~while~~other ~~drawing~~Linux ~~under~~distributions 5without ~~watts.~~proprietary blobs in user space. Apple's ANE is macOS-only; Qualcomm's NPU is largely Windows-on-Arm. This matters when your agent's deployment target isn't a consumer laptop — embedded kiosks, industrial edge boxes, server racks running Linux all become viable on Intel NPU.

Developer access is ungated. Every Core Ultra Series 2 or Series 3 SKU exposes the NPU. OpenVINO is Apache-2.0 open source. There's no equivalent of needing a Mac to develop for ANE or a specific Snapdragon SKU to access Hexagon at full capability.

Single-die integration on Lunar Lake and Panther Lake. CPU, Xe2 (or Xe3 on Panther Lake) iGPU, and NPU all sit on the same compute die, sharing an 8 MB memory-side L4 cache on Lunar Lake. AMD's XDNA, by contrast, is a separate block. The integration matters because agents that hop between devices — say, NPU for prefill and iGPU for decode — pay less for the hop on a single-die SoC.

OpenVINO ecosystem coverage. OpenVINO is the only unified toolkit that targets CPU, iGPU, NPU, dGPU (Arc), and Gaudi from the same source intermediate representation, with native Hugging Face Optimum-Intel integration. No competing vendor offers this breadth.

The MajorIntel NPU FamiliesGeneration Table

~~You'll~~The ~~encounter~~differences ~~several~~between NPU ~~architectures~~3, inNPU ~~practice, each with its own toolchain~~4, and ~~quirks:~~NPU 5 are large enough that "the Intel NPU" is not one target — it's three. Code that runs well on NPU 4 may fail to compile on NPU 3, and FP8 paths that work on NPU 5 won't exist on either predecessor.

~~Vendor~~Generation	~~NPU~~SoC / launch	~~Typical Platform~~NCEs	~~Toolchain~~SHAVE DSPs

INT8 TOPS (claimed) INT8 TOPS (measured) Distinctive feature ~~Apple~~NPU 3720 ~~Neural~~Meteor ~~Engine~~Lake, ~~(ANE)~~Dec 2023 ~~iPhone, Mac (M-series)~~2 ~~Core~~4 ML(Movidius SHAVE) ~11.5 9.5 @ 1.16 GHz First Intel NPU; INT8/FP16; 4 MB total scratchpad ~~Qualcomm~~NPU 4 ~~Hexagon~~Lunar ~~NPU~~Lake, Sep 2024 ~~Snapdragon SoCs~~6 ~~QNN,~~12 ~~SNPE~~(SHAVE-V, 4× wider) 48 back-calc ~1.95 GHz 4× MACs; NF4 weight compression; single-tile integration ~~Intel~~NPU 5 ~~NPU~~Panther ~~(Movidius/VPU~~Lake, ~~lineage)~~CES 2026 ~~Meteor~~3 ~~Lake,~~(each ~~Lunar~~2× ~~Lake~~wider) ~~OpenVINO~~not ~~AMD~~disclosed ~~XDNA / Ryzen AI~~50 ~~Ryzen AI laptops~~— ~~ONNX~~Native ~~Runtime~~FP8 +(E4M3/E5M2); ~~Vitis~~programmable LUT for ~~Google~~activations; ~~Edge~~Intel ~~TPU~~ ~~Coral devices, Pixel~~ ~~TFLite, Edge TPU Compiler~~ ~~MediaTek~~ ~~APU~~ ~~Dimensity SoCs~~ ~~NeuroPilot~~18A

~~These~~Per-engine, ~~differ~~the inMAC array is 2048 INT8 MAC/cycle on every generation. What changes is the count of engines, the SRAM, and the supported ~~operators,~~data ~~quantization~~types. ~~formats,~~NPU ~~memory~~3 ~~layouts,~~totals ~~and~~4,096 ~~how~~INT8 ~~they~~MAC/cycle; ~~handle~~NPU ~~unsupported~~4 ~~ops~~totals ~~(some~~12,288; ~~fall~~NPU 5 consolidates back to ~~CPU~~roughly ~~silently,~~12,288 ~~others~~with ~~fail~~wider ~~compilation).~~per-engine Aunits ~~model~~and ~~that~~the ~~runs~~same ~~beautifully~~Intel-18A area efficiency win.

The Copilot+ certification line (≥40 TOPS) draws cleanly across the table: NPU 4 and NPU 5 qualify; NPU 3 doesn't. If your agent depends on ~~one~~Phi ~~NPU may run poorly —~~Silica or ~~not~~other atCopilot+ ~~all~~OS —features, onyour ~~another.~~ ~~This portability problem~~floor is ~~one~~Lunar ofLake ~~the~~or ~~biggest practical challenges in NPU-based agent deployment.~~later.

The MentalHidden Model:Constraint: ComputeMemory Bandwidth

TOPS is ~~Cheap,~~the ~~Memory~~marketing number. Bandwidth is ~~Expensive~~the

~~The~~engineering number. Lunar Lake ships LPDDR5X-8533 on a 128-bit on-package bus, yielding 8,533 MT/s × 128 bits / 8 = 136.5 GB/s of total platform bandwidth shared among CPU, iGPU, and NPU. There is no private DRAM for the NPU and no published per-device bandwidth quota. This is the single most important ~~thing~~number for understanding why LLM decode tops out where it does on Intel hardware.

Intel does not say "decode is DRAM-bandwidth-bound on NPU" in marketing copy — that specific phrasing is a gap in vendor literature. The closest official analog is Microsoft's Phi Silica blog (Windows Experience Blog, December 2024): "Context processing involves intense parallel computation, mainly matrix multiplications, requiring high computational power. In contrast, the token iteration stage demands substantial memory for storing and accessing the KV cache for each token generation step. While it needs less computation, efficient memory access is crucial." That's the canonical quotable framing. The roofline becomes tangible on DeepSeek-R1-Distill-Llama-8B INT4: 4 GB of weights streamed at 6.10 tok/s equals about 24.4 GB/s of sustained DRAM read, roughly 18% of platform peak. The NPU does not saturate LPDDR5X; it saturates its scheduling-quota share plus driver overhead.

We'll return to ~~internalize~~this ceiling in Chapter 1.3 (where it sets the ITL floor) and Chapter 2.1 (where it sets the KV cache wall).

Why M2M-100 as the Worked Model

A book about agentic AI on NPUs needs a concrete model to keep referencing, and we'll use Facebook AI's M2M-100 — specifically the 418M and 1.2B variants. M2M-100 is a 100-language many-to-many translation model released by Meta in 2020 with three properties that make it a useful teaching example:

It is encoder-decoder seq2seq, which forces us to confront the asymmetric NPU/CPU partition that the rest of the field is converging on — encoders fit NPU ~~programming:~~constraints well (static shape, single forward pass), decoders do not (dynamic shape, autoregressive). M2M-100 makes the partition visible in code, not just in theory.

It uses ~~moving~~full ~~data~~multi-head ~~costs~~attention with no GQA or MQA — the architectural choice that defines its KV cache footprint. In Chapter 2.1 we'll show that M2M-100 1.2B has the same per-token decoder KV bandwidth as Phi-3-mini-3.8B, because Phi-3 uses GQA with one-quarter the KV heads. The KV cache wall is set by attention design, not parameter count, and M2M-100 makes this visceral.

It is MIT-licensed (unlike its successor NLLB-200, which is CC-BY-NC 4.0 and unusable in commercial products). The licensing distinction matters more than ~~computing~~the technical successor relationship.

It is not on ~~it.~~Intel's validated NPU model list.

~~NPUs~~This ~~have tiny on-chip memory compared to GPUs. When activations or weights spill to DRAM, you pay~~is a ~~10–100x~~feature ~~latency~~for ~~penalty~~our ~~and~~purposes. ~~burn~~Production ~~significant~~NPU ~~power.~~deployment ~~This~~guides ~~shapes~~usually ~~every~~anchor ~~design~~on ~~decision:~~

~~Smaller~~validated models ~~that~~(Llama, ~~fit~~Phi, inQwen) ~~on-chip~~where ~~SRAM~~everything ~~dominate~~works. ~~larger~~Real ~~models~~engineering ~~that don't~~ ~~Quantization isn't just about model size — it's about memory bandwidth~~ ~~Operator fusion (combining adjacent ops to avoid intermediate writes) is critical~~ ~~Batch size 1 is normal~~happens at the ~~edge,~~edge of the validated set, and M2M-100 is squarely there.

The honest deployment recommendation, which ~~means~~we'll ~~memory-bound~~build ~~workloads~~

up to,

~~When~~is ~~you're~~encoder ~~designing~~on ~~an agent, this means you can't just port a cloud-side architecture~~NPU and ~~expect~~decoder iton toCPU ~~work.~~or iGPU. The ~~trade-offs~~mechanics of that split are ~~different,~~the ~~and~~through-line of the ~~bottlenecks live in different places.~~book.

What This MeansSection forBought AgentsYou

~~Traditional~~You ~~agent~~should ~~architectures~~now assume abundant compute and memory: long context windows, multiple model calls per turn, tool selection by a large generalist model. NPUs invert these assumptions. To build agentic systems that actually perform on NPU hardware, you need to:understand:

- ~~Right-size the model~~ ~~to the hardware envelope (more on this in 1.2)~~

~~Design the reasoning loop~~ ~~to minimize round-trips to memory (Chapter 2)~~ ~~Push complexity into tools~~ ~~that run on the CPU or remotely (Chapter 3)~~ ~~Profile early and often~~ ~~— guessing about~~An NPU ~~performance~~ is a ~~fast~~domain-specific ~~way~~matmul toaccelerator ~~ship~~with ~~something~~on-chip ~~slow~~SRAM and fixed-function activations — not a general-purpose shader array Intel Core NPU is distinctive in OS coverage, ungated developer access, single-die integration, and ecosystem breadth There are three Intel NPU generations (~~Chapter~~3720, 4)4, 5) with materially different capabilities — "the Intel NPU" is not one target Memory bandwidth, not TOPS, is the binding constraint on Lunar Lake's 136.5 GB/s LPDDR5X-8533 M2M-100 is the worked model for the book — encoder-decoder with full MHA, MIT-licensed, deliberately at the edge of NPU validation

The ~~rest~~next ofsection ~~this~~moves ~~chapter~~from ~~builds~~architecture ~~out~~to ~~the~~consequence: ~~foundations:~~given ~~how~~these properties, what computational constraints fall out, and what does optimization look like for an encoder-decoder model on Intel NPU ~~constraints translate into model optimization choices, what latency and throughput actually mean on this class of hardware, and the design patterns that consistently work in production.~~specifically?

Next: 1.2 Computational Constraints & Model Optimization