Foundations of NPU-Optimized Agents
NPU architecture and computational constraints. Model quantization and optimization for NPU deployment. Latency profiles and throughput optimization. Hardware-aware agent design patterns.
1.1 Understanding NPU Architecture
Before talking about agents on NPUs, we need to talk about the NPU itself — what makes it a disti...
1.2 Computational Constraints & Model Optimization
The architecture from Chapter 1.1 sets the rules. This section is about playing inside them: what...
1.3 Latency, Throughput, and Hardware-Aware Patterns
The architecture and constraints from Chapters 1.1 and 1.2 set the ceiling. This section is about...
1.4 The Accuracy Cost of Quantization
Chapter 1.2 laid out the quantization recipes Intel NPU supports: INT8-sym, INT4-sym group-128 or...
1.5 Speculative Decoding
Chapter 1.3 established the bandwidth ceiling as the binding constraint on LLM decode: 136.5 GB/s...