1.1 Understanding NPU Architecture
1.1 Understanding NPU Architecture
Before talking about agents on NPUs, we need to talk about the NPU itself — what makes it a distinct class of accelerator, and why the architectural choices ripple all the way up to how you design an agent loop. This book uses Intel Core NPU as its primary anchor and Facebook AI's M2M-100 as its primary worked-example model. Every concept in this chapter ladders back to those two.
What an NPU Actually Is
Neural Processing Units (NPUs) are purpose-domain-specific accelerators built silicon for onethe job:matrix-multiplication runningand activation workloads that dominate neural network inferenceinference. efficiently.They Unlikesit between CPUs (general-purpose)which orare flexible but inefficient at dense matmul) and GPUs (massively parallel floating-point), NPUswhich are optimizedpowerful but power-hungry and latency-spiky). The NPU's pitch is sustained matrix throughput at a fraction of the GPU's power budget — useful for thealways-on, specificon-device mathworkloads thatwhere dominatesbattery and thermal headroom matter more than peak FLOPS.
The internal recipe varies by vendor, but every modern MLNPU workloadscombines —three matrix multiplication, convolution, and tensor reduction — at low power.
If you're going to build agents that run on NPUs, you needthings: a workingMAC model of what the hardware actually does well, and where it falls down. This chapter gives you that model.
Why NPUs Exist
Modern transformer-based models spend the overwhelming majority of their cycles in just a few operations: GEMMs (general matrix multiplies), attention computations, and elementwise activations. A general-purpose CPU wastes silicon on branch prediction, out-of-order execution, and cache hierarchies that don't help these workloads. A GPU is better, but pays a power cost for flexibility.
NPUs strip the design down to what matters:
How the Major Families Compare
Five NPU families currently matter in commercial deployments, and they descend from recognizably different lineages:
TightIntel SoCCore integrationNPU is the only x86-native NPU and the focus of this book. It inherits its architecture from Movidius — theyIntel shareacquired memorythe company in 2016 — and pairs a MAC array with programmable SHAVE VLIW DSPs in the same compute engine. The SHAVEs handle transcendentals, type conversion, and FP32 fallback. Three generations exist: NPU 3720 (Meteor Lake, December 2023, ~11.5 TOPS INT8 claimed but measured at 9.5 TOPS at 1.16 GHz by Chips and Cheese), NPU 4 (Lunar Lake, September 2024, 48 TOPS INT8, on the same compute tile as the CPU and Xe2 iGPU), and NPU 5 (Panther Lake, CES 2026, 50 TOPS INT8, Intel 18A process, with native FP8 support).
Apple Neural Engine is a fixed-function tensor accelerator tightly bound to Core ML on macOS and iOS. The M4 family ships a 16-core Neural Engine at 38 TOPS. Developer access is gated through Core ML — there's no equivalent of OpenVINO that lets you reach the silicon directly.
Qualcomm Hexagon NPU descends from a phone DSP (Hexagon QDSP6) with a bolted-on Tensor Accelerator and Vector eXtensions. Snapdragon X Elite reaches 45 TOPS. The architecture is fundamentally optimized for power efficiency at phone scale; bringing it to laptops is a relatively recent push.
AMD XDNA descends from Xilinx Versal AI Engine tiles arranged in a 2D spatial array. XDNA 2 in Ryzen AI 300 (Strix Point) hits 50 TOPS INT8 plus 50 TOPS Block FP16. Unlike Intel and Apple, XDNA sits as a separate IP block rather than livingon behindthe main compute die — a PCIedifferent bus
TheGoogle resultEdge TPU is dramaticallya betterfixed-function performance-per-wattsystolic-array ASIC, primarily for inference.Coral Adevices modernand mobileon-device TensorFlow Lite. It's a different deployment story (small embedded modules) and outside the scope of consumer-PC agents.
What's Distinctive About Intel
Four things set Intel apart, and each has practical consequences for agent design:
OS support spans Windows and Linux. The in-tree intel/linux-npu-driver makes Intel NPU canusable deliveron 40+Ubuntu TOPSand whileother drawingLinux underdistributions 5without watts.proprietary blobs in user space. Apple's ANE is macOS-only; Qualcomm's NPU is largely Windows-on-Arm. This matters when your agent's deployment target isn't a consumer laptop — embedded kiosks, industrial edge boxes, server racks running Linux all become viable on Intel NPU.
Developer access is ungated. Every Core Ultra Series 2 or Series 3 SKU exposes the NPU. OpenVINO is Apache-2.0 open source. There's no equivalent of needing a Mac to develop for ANE or a specific Snapdragon SKU to access Hexagon at full capability.
Single-die integration on Lunar Lake and Panther Lake. CPU, Xe2 (or Xe3 on Panther Lake) iGPU, and NPU all sit on the same compute die, sharing an 8 MB memory-side L4 cache on Lunar Lake. AMD's XDNA, by contrast, is a separate block. The integration matters because agents that hop between devices — say, NPU for prefill and iGPU for decode — pay less for the hop on a single-die SoC.
OpenVINO ecosystem coverage. OpenVINO is the only unified toolkit that targets CPU, iGPU, NPU, dGPU (Arc), and Gaudi from the same source intermediate representation, with native Hugging Face Optimum-Intel integration. No competing vendor offers this breadth.
The MajorIntel NPU FamiliesGeneration Table
You'llThe encounterdifferences severalbetween NPU architectures3, inNPU practice, each with its own toolchain4, and quirks:NPU 5 are large enough that "the Intel NPU" is not one target — it's three. Code that runs well on NPU 4 may fail to compile on NPU 3, and FP8 paths that work on NPU 5 won't exist on either predecessor.
ThesePer-engine, differthe inMAC array is 2048 INT8 MAC/cycle on every generation. What changes is the count of engines, the SRAM, and the supported operators,data quantizationtypes. formats,NPU memory3 layouts,totals and4,096 howINT8 theyMAC/cycle; handleNPU unsupported4 opstotals (some12,288; fallNPU 5 consolidates back to CPUroughly silently,12,288 otherswith failwider compilation).per-engine Aunits modeland thatthe runssame beautifullyIntel-18A area efficiency win.
The Copilot+ certification line (≥40 TOPS) draws cleanly across the table: NPU 4 and NPU 5 qualify; NPU 3 doesn't. If your agent depends on onePhi NPU may run poorly —Silica or notother atCopilot+ allOS —features, onyour another. This portability problemfloor is oneLunar ofLake theor biggest practical challenges in NPU-based agent deployment.later.
The MentalHidden Model:Constraint: ComputeMemory Bandwidth
TOPS is Cheap,the Memorymarketing number. Bandwidth is Expensivethe
Theengineering number. Lunar Lake ships LPDDR5X-8533 on a 128-bit on-package bus, yielding 8,533 MT/s × 128 bits / 8 = 136.5 GB/s of total platform bandwidth shared among CPU, iGPU, and NPU. There is no private DRAM for the NPU and no published per-device bandwidth quota. This is the single most important thingnumber for understanding why LLM decode tops out where it does on Intel hardware.
Intel does not say "decode is DRAM-bandwidth-bound on NPU" in marketing copy — that specific phrasing is a gap in vendor literature. The closest official analog is Microsoft's Phi Silica blog (Windows Experience Blog, December 2024): "Context processing involves intense parallel computation, mainly matrix multiplications, requiring high computational power. In contrast, the token iteration stage demands substantial memory for storing and accessing the KV cache for each token generation step. While it needs less computation, efficient memory access is crucial." That's the canonical quotable framing. The roofline becomes tangible on DeepSeek-R1-Distill-Llama-8B INT4: 4 GB of weights streamed at 6.10 tok/s equals about 24.4 GB/s of sustained DRAM read, roughly 18% of platform peak. The NPU does not saturate LPDDR5X; it saturates its scheduling-quota share plus driver overhead.
We'll return to internalizethis ceiling in Chapter 1.3 (where it sets the ITL floor) and Chapter 2.1 (where it sets the KV cache wall).
Why M2M-100 as the Worked Model
A book about agentic AI on NPUs needs a concrete model to keep referencing, and we'll use Facebook AI's M2M-100 — specifically the 418M and 1.2B variants. M2M-100 is a 100-language many-to-many translation model released by Meta in 2020 with three properties that make it a useful teaching example:
It is encoder-decoder seq2seq, which forces us to confront the asymmetric NPU/CPU partition that the rest of the field is converging on — encoders fit NPU programming:constraints well (static shape, single forward pass), decoders do not (dynamic shape, autoregressive). M2M-100 makes the partition visible in code, not just in theory.
It uses movingfull datamulti-head costsattention with no GQA or MQA — the architectural choice that defines its KV cache footprint. In Chapter 2.1 we'll show that M2M-100 1.2B has the same per-token decoder KV bandwidth as Phi-3-mini-3.8B, because Phi-3 uses GQA with one-quarter the KV heads. The KV cache wall is set by attention design, not parameter count, and M2M-100 makes this visceral.
It is MIT-licensed (unlike its successor NLLB-200, which is CC-BY-NC 4.0 and unusable in commercial products). The licensing distinction matters more than computingthe technical successor relationship.
It is not on it.Intel's validated NPU model list.
NPUsThis have tiny on-chip memory compared to GPUs. When activations or weights spill to DRAM, you payis a 10–100xfeature latencyfor penaltyour andpurposes. burnProduction significantNPU power.deployment Thisguides shapesusually everyanchor designon decision:
The honest deployment recommendation, which meanswe'll memory-boundbuild workloads
Whenis you'reencoder designingon an agent, this means you can't just port a cloud-side architectureNPU and expectdecoder iton toCPU work.or iGPU. The trade-offsmechanics of that split are different,the andthrough-line of the bottlenecks live in different places.book.
What This MeansSection forBought AgentsYou
TraditionalYou agentshould architecturesnow assume abundant compute and memory: long context windows, multiple model calls per turn, tool selection by a large generalist model. NPUs invert these assumptions. To build agentic systems that actually perform on NPU hardware, you need to:understand:
Right-size the modelto the hardware envelope (more on this in 1.2)
The restnext ofsection thismoves chapterfrom buildsarchitecture outto theconsequence: foundations:given howthese properties, what computational constraints fall out, and what does optimization look like for an encoder-decoder model on Intel NPU constraints translate into model optimization choices, what latency and throughput actually mean on this class of hardware, and the design patterns that consistently work in production.specifically?
Next: 1.2 Computational Constraints & Model Optimization