1.2 Computational Constraints & Model Optimization
1.2 Computational Constraints & Model Optimization
If 1.1 was about how NPUs are built, this section is about what that buys you and what it costs. Every NPU deployment is a negotiation between three constraints: memory, operator coverage, and numerical precision. Get any one of them wrong and your agent either won't load, won't run on the NPU, or won't produce useful output.
The Three Hard Limits
1. Memory Budget
The headline number — "8 GB of unified memory" or "16 GB on the SoC" — is not what you have. You're sharing with the OS, the foreground app, other ML workloads, and the GPU. Practical model budgets at the edge:
- Phones: 1–3 GB usable for model weights without aggressive eviction
- Laptops with NPU: 2–6 GB, depending on RAM tier
- Dedicated edge accelerators (Coral, Jetson): often <1 GB
A 7B-parameter LLM at FP16 needs ~14 GB just for weights. The same model at INT4 needs ~3.5 GB. This is why quantization isn't optional at the edge — it's the entry ticket.
2. Operator Coverage
Every NPU has a list of operators it can execute natively. Anything outside that list either falls back to CPU (slow) or fails compilation (catastrophic).
Operators that consistently work well:
- Dense layers / GEMMs
- Convolutions (2D, sometimes 3D)
- Standard activations (ReLU, GELU approximations, sigmoid)
- LayerNorm and RMSNorm (often, but not always)
Operators that frequently cause trouble:
- Custom attention variants (sliding window, sparse attention)
- Dynamic shapes (sequence length varying at runtime)
- Newer activation functions (SwiGLU, GeGLU) without explicit support
- Anything involving complex control flow
The lesson: architecture choice is constrained by operator support, not just by what's state-of-the-art on a research leaderboard. Picking a model with mainstream operators saves weeks of debugging.
3. Numerical Precision
Most NPUs are integer machines. They want INT8 or INT4 weights and activations. Some support FP16 or BF16, but at reduced throughput.
This matters because:
- Not every operation quantizes cleanly (softmax, layernorm tail values, residual additions)
- Quantization-induced accuracy loss is workload-dependent — a code-generation model and a sentiment classifier degrade differently
- Mixed-precision execution introduces conversion overhead at the boundaries
Quantization in Practice
There are two paths to a quantized model:
Post-Training Quantization (PTQ) is the fast path. You take a trained FP16 model, run a calibration dataset through it to gather activation statistics, and convert weights and (optionally) activations to integer format. It's often "good enough" for INT8, but degrades visibly at INT4.
Quantization-Aware Training (QAT) simulates quantization during training, letting the model adapt its weights to integer constraints. It produces better accuracy, especially at INT4 and lower, but costs significant compute and requires the training pipeline.
Practical guidance:
| Model Type | Recommendation |
|---|---|
| Encoder models for classification, retrieval | PTQ INT8 — usually fine |
| Small LLMs (≤3B) for on-device generation | QAT INT4 if available, PTQ INT4 with calibration otherwise |
| Vision models for detection/segmentation | PTQ INT8; watch out for last-layer accuracy |
| Speech models (ASR, TTS) | PTQ INT8 for ASR; TTS often needs FP16 fallback for vocoders |
Always evaluate the quantized model on your actual task distribution. Aggregate benchmarks (perplexity, MMLU) tell you almost nothing about whether your agent's tool-calling behavior survives quantization.
Operator Fusion and Graph Optimization
Beyond quantization, the compiler does a lot of work on the model graph before it runs:
- Fusion: combining adjacent operators (e.g., Conv + BatchNorm + ReLU) into a single kernel that avoids writing intermediate results to memory
- Constant folding: precomputing operations on constant tensors at compile time
- Layout transformation: rearranging tensor memory layouts to match the NPU's preferred access pattern (NCHW vs NHWC, blocked layouts)
- Operator replacement: substituting unsupported ops with NPU-native equivalents
You don't write these passes yourself, but you do influence them. A model exported with messy tensor reshapes between every layer will fuse poorly. A model with clean, contiguous operations will run much closer to peak throughput.
Practical tip: when you export a model to ONNX (or Core ML, or TFLite), inspect the resulting graph. If you see Reshape, Transpose, or Cast operations scattered everywhere, your compiler is going to have a bad time.
The Sizing Heuristic
When you're scoping a new agent, use this as a back-of-envelope check before committing to a model:
required_memory_MB ≈ (params_in_billions × 1024 × bytes_per_param) + activation_overhead
Where bytes_per_param is:
- 2.0 for FP16
- 1.0 for INT8
- 0.5 for INT4
And activation_overhead is roughly 200–800 MB depending on context length and batch size.
For a 1.5B model at INT4: (1.5 × 1024 × 0.5) + 500 ≈ 1.3 GB. That fits comfortably on most mobile NPUs. A 7B model at INT4 lands around 4 GB — feasible on a flagship phone, marginal on a mid-range one, comfortable on a laptop NPU.
What to Take Away
The work of optimizing a model for an NPU isn't separate from agent design — it determines the agent's design envelope. Before you write a line of orchestration code, you should know:
- What model size your target hardware actually fits (after quantization, after activation overhead)
- Which operators the model uses, and whether they're supported on your NPU's compiler
- How much accuracy degradation quantization costs you on your real task distribution
- What falls back to CPU, and whether that fallback is on the critical path
These four answers shape everything downstream. The next section turns to latency and throughput — how to think about time on this hardware, which is the lens that matters most for agent responsiveness.
Previous: 1.1 Understanding NPU Architecture Next: 1.3 Latency, Throughput, and Hardware-Aware Patterns