Skip to main content

1.2 Computational Constraints & Model Optimization

If 1.1 was about how NPUs are built, this section is about what that buys you and what it costs. Every NPU deployment is a negotiation between three constraints: memory, operator coverage, and numerical precision. Get any one of them wrong and your agent either won't load, won't run on the NPU, or won't produce useful output.

The Three Hard Limits

1. Memory Budget

The headline number — "8 GB of unified memory" or "16 GB on the SoC" — is not what you have. You're sharing with the OS, the foreground app, other ML workloads, and the GPU. Practical model budgets at the edge:

  • Phones: 1–3 GB usable for model weights without aggressive eviction
  • Laptops with NPU: 2–6 GB, depending on RAM tier
  • Dedicated edge accelerators (Coral, Jetson): often <1 GB

A 7B-parameter LLM at FP16 needs ~14 GB just for weights. The same model at INT4 needs ~3.5 GB. This is why quantization isn't optional at the edge — it's the entry ticket.

2. Operator Coverage

Every NPU has a list of operators it can execute natively. Anything outside that list either falls back to CPU (slow) or fails compilation (catastrophic).

Operators that consistently work well:

  • Dense layers / GEMMs
  • Convolutions (2D, sometimes 3D)
  • Standard activations (ReLU, GELU approximations, sigmoid)
  • LayerNorm and RMSNorm (often, but not always)

Operators that frequently cause trouble:

  • Custom attention variants (sliding window, sparse attention)
  • Dynamic shapes (sequence length varying at runtime)
  • Newer activation functions (SwiGLU, GeGLU) without explicit support
  • Anything involving complex control flow

The lesson: architecture choice is constrained by operator support, not just by what's state-of-the-art on a research leaderboard. Picking a model with mainstream operators saves weeks of debugging.

3. Numerical Precision

Most NPUs are integer machines. They want INT8 or INT4 weights and activations. Some support FP16 or BF16, but at reduced throughput.

This matters because:

  • Not every operation quantizes cleanly (softmax, layernorm tail values, residual additions)
  • Quantization-induced accuracy loss is workload-dependent — a code-generation model and a sentiment classifier degrade differently
  • Mixed-precision execution introduces conversion overhead at the boundaries

Quantization in Practice

There are two paths to a quantized model:

Post-Training Quantization (PTQ) is the fast path. You take a trained FP16 model, run a calibration dataset through it to gather activation statistics, and convert weights and (optionally) activations to integer format. It's often "good enough" for INT8, but degrades visibly at INT4.

Quantization-Aware Training (QAT) simulates quantization during training, letting the model adapt its weights to integer constraints. It produces better accuracy, especially at INT4 and lower, but costs significant compute and requires the training pipeline.

Practical guidance:

Model Type Recommendation
Encoder models for classification, retrieval PTQ INT8 — usually fine
Small LLMs (≤3B) for on-device generation QAT INT4 if available, PTQ INT4 with calibration otherwise
Vision models for detection/segmentation PTQ INT8; watch out for last-layer accuracy
Speech models (ASR, TTS) PTQ INT8 for ASR; TTS often needs FP16 fallback for vocoders

Always evaluate the quantized model on your actual task distribution. Aggregate benchmarks (perplexity, MMLU) tell you almost nothing about whether your agent's tool-calling behavior survives quantization.

Operator Fusion and Graph Optimization

Beyond quantization, the compiler does a lot of work on the model graph before it runs:

  • Fusion: combining adjacent operators (e.g., Conv + BatchNorm + ReLU) into a single kernel that avoids writing intermediate results to memory
  • Constant folding: precomputing operations on constant tensors at compile time
  • Layout transformation: rearranging tensor memory layouts to match the NPU's preferred access pattern (NCHW vs NHWC, blocked layouts)
  • Operator replacement: substituting unsupported ops with NPU-native equivalents

You don't write these passes yourself, but you do influence them. A model exported with messy tensor reshapes between every layer will fuse poorly. A model with clean, contiguous operations will run much closer to peak throughput.

Practical tip: when you export a model to ONNX (or Core ML, or TFLite), inspect the resulting graph. If you see Reshape, Transpose, or Cast operations scattered everywhere, your compiler is going to have a bad time.

The Sizing Heuristic

When you're scoping a new agent, use this as a back-of-envelope check before committing to a model:

required_memory_MB ≈ (params_in_billions × 1024 × bytes_per_param) + activation_overhead

Where bytes_per_param is:

  • 2.0 for FP16
  • 1.0 for INT8
  • 0.5 for INT4

And activation_overhead is roughly 200–800 MB depending on context length and batch size.

For a 1.5B model at INT4: (1.5 × 1024 × 0.5) + 500 ≈ 1.3 GB. That fits comfortably on most mobile NPUs. A 7B model at INT4 lands around 4 GB — feasible on a flagship phone, marginal on a mid-range one, comfortable on a laptop NPU.

What to Take Away

The work of optimizing a model for an NPU isn't separate from agent design — it determines the agent's design envelope. Before you write a line of orchestration code, you should know:

  1. What model size your target hardware actually fits (after quantization, after activation overhead)
  2. Which operators the model uses, and whether they're supported on your NPU's compiler
  3. How much accuracy degradation quantization costs you on your real task distribution
  4. What falls back to CPU, and whether that fallback is on the critical path

These four answers shape everything downstream. The next section turns to latency and throughput — how to think about time on this hardware, which is the lens that matters most for agent responsiveness.


Previous: 1.1 Understanding NPU Architecture Next: 1.3 Latency, Throughput, and Hardware-Aware Patterns