1.2 Computational Constraints & Model Optimization

IfThe architecture from Chapter 1.1 ~~was~~sets ~~about~~the ~~how~~rules. ~~NPUs are built, this~~This section is about playing inside them: what ~~that~~an ~~buys~~Intel NPU will and won't accept, how to shape a model so it compiles, and how to quantize without quietly losing the quality you ~~and~~paid ~~what~~for itin ~~costs.~~training. ~~Every~~We ~~NPU~~anchor ~~deployment~~on M2M-100 throughout — partly because translation is a ~~negotiation~~clean ~~between~~worked ~~three~~example, ~~constraints:~~partly ~~memory~~,because ~~operator~~M2M-100 ~~coverage~~,is ~~and~~unforgiving ~~numerical~~enough ~~precision~~.about ~~Get~~NPU ~~any~~constraints ~~one~~to ofbe ~~them wrong and your agent either won't load, won't run on the NPU, or won't produce useful output.~~instructive.

The ThreeStatic-Shape Hard LimitsMandate

1. Memory Budget

The ~~headline~~first ~~number~~rule of Intel NPU is that shapes are largely set at compile time, not run time. The compiler tiles your graph across the NCEs and SHAVE DSPs, computes the SRAM allocation, and generates a binary blob. Change the shapes and you compile a new blob — "8which GBtakes seconds to tens of ~~unified~~seconds, ~~memory"~~and which Windows Update or a driver upgrade may invalidate.

For non-LLM workloads this is an absolute constraint. The encoder of M2M-100 has to be reshaped to a fixed sequence length before compile:

encoder_model.reshape({"16input_ids":      GB[1, on128],
                       "attention_mask": [1, 128]})
encoder_npu = core.compile_model(encoder_model, "NPU")

Any input shorter than 128 gets padded; anything longer either truncates or forces a new compile. Pick the ~~SoC"~~sequence —length isonce, ~~not~~pick ~~what~~it ~~you~~to ~~have.~~cover ~~You're~~your ~~sharing~~real workload's 95th percentile, and live with the ~~OS,~~padding waste on short inputs.

For LLM workloads the ~~foreground~~story ~~app,~~has ~~other~~loosened. MLOpenVINO ~~workloads,~~2025.3 introduced dynamic prompts on NPU by default through the LLMPipeline static-shape pipeline with PREFILL_HINT=DYNAMIC and ~~the GPU. Practical model budgets at the edge:~~

~~Phones~~~~: 1–3 GB usable for model weights without aggressive eviction~~ ~~Laptops with NPU~~~~: 2–6 GB, depending on RAM tier~~ ~~Dedicated edge accelerators~~ ~~(Coral, Jetson): often <1 GB~~

~~A 7B-parameter LLM at FP16 needs ~14 GB just for weights. The same model at INT4 needs ~3.5 GB.~~NPUW_LLM_PREFILL_CHUNK_SIZE=1024. This ~~is why quantization~~ isn't ~~optional~~dynamic atshape in the ~~edge~~GPU sense — it's chunked static prefill, where the ~~entry~~compiler ~~ticket.~~emits a fixed-shape kernel and the runtime feeds chunks until the prompt is consumed. The illusion of dynamism, paid for by a fixed chunk granularity. There's no equivalent for OVModelForSeq2SeqLM, which is exactly why M2M-100's decoder doesn't get the same flexibility as a Llama-3 decoder.

2.

Intel NPU Operator Coverage

~~Every~~The canonical list of supported operations lives at docs.openvino.ai/<version>/about-openvino/compatibility-and-support/supported-operations.html, version-stamped per release. Encoder-friendly ops are mature. Transformer encoders compile reliably: MatMul, Add, Multiply, LayerNormalization (with decomposed fallback when the fused op isn't supported), Softmax, Gelu, Reshape, Transpose, Concat, Gather with static indices, ScaledDotProductAttention, Convert, and the FakeQuantize/FakeConvert ops for INT8/FP8 paths. OpenVINO 2025.2 explicitly added QKV-projection and Multi-Head Attention graph-level fusions for encoder-based LLMs, which is exactly the kind of optimization M2M-100's encoder benefits from.

Decoder pain points have specific names, and each is worth recognizing because they appear in real error messages:

DetectionOutput still fails to compile on NPU and iGPU as of OpenVINO 2025.4 (Intel Community thread 1735991, Feb 2026) ScatterNDUpdate has been rejected by the VPU/NPU compiler historically (issue #13594) INT64 indices in Gather and ScatterND routinely cause silent CPU fallback Variable-length Gather, dynamic Slice, dynamic Reshape in autoregressive decoders are the structural reason the whole model historically had to be static

When in doubt about whether your graph compiles, the answer is to try and read the compile log. The error messages are reasonably informative; the failures are usually localizable to a specific op.

Quantization: PTQ, Not QAT

Post-training quantization (PTQ) is the default path on Intel NPU. Quantization-aware training (QAT) is technically supported by NNCF but rarely necessary — the PTQ recipes Intel has tuned for the validated NPU model list are good enough for most use cases, and they don't require retraining.

The path looks like this: export your PyTorch model to OpenVINO IR via Optimum-Intel, pick a quantization recipe (INT8 weight-only, INT4 channel-wise, INT4 group-wise, NF4 on Lunar Lake+, FP8 on Panther Lake+), and let NNCF do the work. The recipe matters because Intel NPU has astrict ~~list~~constraints ofon ~~operators~~which itcombinations ~~can~~work.

~~execute~~

The ~~natively.~~NPU ~~Anything~~LLM ~~outside~~quantization ~~that~~rule ~~list~~from ~~either~~Intel's ~~falls~~GenAI-on-NPU ~~back~~guide is unambiguous: maximize the 4-bit weight ratio (--ratio 1.0), use --group-size 128 for models up to ~~CPU~~~4–5 B parameters, use --group-size -1 (~~slow)~~channel-wise) orfor ~~fails~~larger ~~compilation (catastrophic).~~

~~Operators that consistently work well:~~

~~Dense layers / GEMMs~~ ~~Convolutions (2D, sometimes 3D)~~ ~~Standard activations (ReLU, GELU approximations, sigmoid)~~ ~~LayerNorm~~models, and ~~RMSNorm~~always use symmetric quantization (~~often,~~--sym). ~~but~~Asymmetric ~~not~~quantization ~~always)~~is documented

~~Operators~~to ~~that~~crash ~~frequently~~the ~~cause~~NPU ~~trouble:~~LLM compile path.

~~Custom attention variants (sliding window, sparse attention)~~ ~~Dynamic shapes (sequence length varying at runtime)~~ ~~Newer activation functions (SwiGLU, GeGLU) without explicit support~~ ~~Anything involving complex control flow~~

The ~~lesson:~~precision ~~architecture choice is constrained~~matrix by ~~operator~~NPU ~~support, not just by what's state-of-the-art on a research leaderboard. Picking a model with mainstream operators saves weeks of debugging.~~

3. Numerical Precision

~~Most NPUs are integer machines. They want INT8 or INT4 weights and activations. Some support FP16 or BF16, but at reduced throughput.~~

~~This matters because:~~

~~Not every operation quantizes cleanly (softmax, layernorm tail values, residual additions)~~ ~~Quantization-induced accuracy loss is workload-dependent — a code-generation model and a sentiment classifier degrade differently~~ ~~Mixed-precision execution introduces conversion overhead at the boundaries~~

Quantization in Practice

~~There are two paths to a quantized model:~~

~~Post-Training Quantization (PTQ)~~ is the fast path. You take a trained FP16 model, run a calibration dataset through it to gather activation statistics, and convert weights and (optionally) activations to integer format. It's often "good enough" for INT8, but degrades visibly at INT4.

~~Quantization-Aware Training (QAT)~~ simulates quantization during training, letting the model adapt its weights to integer constraints. It produces better accuracy, especially at INT4 and lower, but costs significant compute and requires the training pipeline.

~~Practical guidance:~~generation:

~~Model Type~~Mode	~~Recommendation~~NPU 3 (MTL)

NPU 4 (LNL) NPU 5 (PTL) ~~Encoder~~INT8-sym ~~models for classification, retrieval~~weights ~~PTQ~~✅ ~~INT8~~✅ ~~— usually fine~~✅ ~~Small~~INT4-sym, ~~LLMs~~group-size ~~(≤3B) for on-device generation~~128 ~~QAT~~✅ ~~INT4~~✅ ~~if available, PTQ INT4 with calibration otherwise~~✅ ~~Vision~~INT4-sym, ~~models for detection/segmentation~~channel-wise ~~PTQ~~✅ ~~INT8;~~✅ ~~watch out for last-layer accuracy~~✅ ~~Speech models~~NF4 (~~ASR,~~channel-wise ~~TTS)~~only) ~~PTQ~~❌ ~~INT8~~✅ ~~for~~✅ ~~ASR;~~ ~~TTSoften~~NF4 ~~needs~~weights + FP16 ~~fallback~~KV ~~for~~❌ ~~vocoders~~✅ (2025.3+) ✅ FP8 (E4M3/E5M2) ❌ ❌ ✅

~~Always~~The ~~evaluate~~NF4 Lunar Lake exclusivity comes verbatim from OpenVINO's GenAI-on-NPU docs: "The NF4 data type is only supported on Intel Core Ultra Processors Series 2 NPUs (formerly codenamed Lunar Lake) and beyond." The FP8 Panther Lake gating is documented in Intel's openvino-ai-plugins-gimp 3.2 release notes: "FP8 model installation is now gated to NPU5000 and newer architectures."

Exporting M2M-100

Here are the ~~quantized~~two optimum-cli invocations you'll actually use:

# INT8 weights, stateful with KV cache (the safe default)
optimum-cli export openvino \
  --model facebook/m2m100_418M \
  --task text2text-generation-with-past \
  --weight-format int8 \
  m2m100_418M_ov_int8

# INT4 group-wise, NPU-targeted
optimum-cli export openvino \
  --model facebook/m2m100_418M \
  --task text2text-generation-with-past \
  --weight-format int4 --sym --ratio 1.0 --group-size 128 \
  m2m100_418M_ov_int4_npu

Two pitfalls worth calling out before you spend an afternoon debugging them. --task translation does not exist in Optimum-Intel; it lives in optimum-neuron for AWS Neuron, which is a different toolkit. The correct task name for M2M-100 is text2text-generation-with-past. And the --with-past suffix is required for a stateful, KV-cached decoder; without it the export produces a stateless decoder that re-encodes the full target prefix on ~~your actual task distribution.~~ ~~Aggregate benchmarks (perplexity, MMLU) tell you almost nothing about whether your agent's tool-calling behavior survives quantization.~~

Operator Fusion and Graph Optimization

~~Beyond quantization, the compiler does a lot of work on the model graph before it runs:~~

~~Fusion~~~~: combining adjacent operators (e.g., Conv + BatchNorm + ReLU) into a single kernel that avoids writing intermediate results to memory~~ ~~Constant folding~~~~: precomputing operations on constant tensors at compile time~~ ~~Layout transformation~~~~: rearranging tensor memory layouts to match the NPU's preferred access pattern (NCHW vs NHWC, blocked layouts)~~ ~~Operator replacement~~~~: substituting unsupported ops with NPU-native equivalents~~

~~You don't write these passes yourself, but you do influence them. A model exported with messy tensor reshapes between~~ every ~~layer~~step, ~~will~~which ~~fuse~~destroys ~~poorly. A model with clean, contiguous operations will run much closer to peak~~decode throughput.

The output is a directory containing openvino_encoder_model.xml, openvino_decoder_model.xml, openvino_decoder_with_past_model.xml, and the tokenizer files. Three separate models, each independently compileable to a different device — which is exactly the lever we need for the hybrid execution pattern.

Why M2M-100 Is Architecturally Expensive

Three reasons M2M-100 is harder to deploy on Intel NPU than a comparably-sized decoder-only model:

~~Practical~~Full ~~tip~~multi-head attention with no GQA or MQA.: ~~when~~Look at modeling_m2m_100.py in HuggingFace Transformers: self.k_proj and self.v_proj both project to full embed_dim, and num_heads == num_kv_heads. The HF config has no num_key_value_heads field at all. A 1.2B-parameter M2M-100 decoder has the same per-token KV bandwidth as a 3.8B-parameter Phi-3-mini, because Phi-3 uses GQA with one-quarter the KV heads. We'll do the math in Chapter 2.1. The implication for NPU deployment: M2M-100's decode is bandwidth-bound at smaller parameter counts than modern models. No retrofit; switching to GQA would require retraining from scratch.

Autoregressive decoder with dynamic sequence length. The decoder generates one token at a time, with the KV cache growing on every step. The 2025.3 chunked-prefill feature relaxes this for decoder-only LLMs via LLMPipeline, but no equivalent pipeline exists for OVModelForSeq2SeqLM. OpenVINO 2026.0's NPU GenAI guide lists Whisper, LLM, and VLM pipelines only. M2M-100's decoder is on its own.

Encoder-decoder cross-attention. The decoder reads its own self-attention KV state and the encoder output every step, doubling the per-layer attention overhead relative to a decoder-only model. M2M-100's cross-attention KV cache is the same size as its self-attention KV cache for any given encoder length. This is the price of being a translation model — you ~~export~~keep athe ~~model~~source sentence accessible throughout decoding — and there's no way to ~~ONNX~~optimize it away.

The honest deployment recommendation that follows: encoder on NPU (orsingle ~~Core~~static ~~ML,~~prefill pass, ideal NPU fit), decoder on CPU or ~~TFLite),~~iGPU ~~inspect~~(dynamic autoregressive, where the ~~resulting~~runtime ~~graph.~~handles Ifvariable ~~you~~shapes ~~see~~well). ~~Reshape,~~Optimum-Intel ~~Transpose,~~does not expose per-component device_map, so this requires either subclassing OVModelForSeq2SeqLM or ~~Cast~~driving ~~operations~~the ~~scattered~~IR ~~everywhere,~~files ~~your~~directly ~~compiler~~via iscore.compile_model(...). ~~going~~Chapter to3.1 ~~have~~shows athe ~~bad time.~~code.

The Sizing HeuristicHeuristic, Specific to Intel

~~When~~For ~~you're~~Intel ~~scoping~~NPU specifically, the rough sizing budget is: a ~~new~~model ~~agent,~~whose ~~use~~post-quantization ~~this~~weight asmemory afits ~~back-of-envelope~~in ~~check~~roughly ~~before~~4–8 ~~committing~~GB towill arun ~~model:~~comfortably on Lunar Lake NPU 4. The 16 GB Copilot+ minimum spec gives you the LPDDR5X room; the static-shape constraint sets compile complexity; the LPDDR5X bandwidth ceiling sets decode throughput.

required_memory_MB

In ≈M2M-100 ~~(params_in_billions~~sizes:

~~1024~~×~~bytes_per_param)~~Variant +Params ~~activation_overhead~~FP16 weights INT8 weights INT4 weights Fit 418M 418M ~840 MB ~420 MB ~210 MB Comfortable on NPU 3+ 1.2B 1.2B ~2.4 GB ~1.2 GB ~600 MB Comfortable on NPU 4+ 12B 12B ~24 GB ~12 GB ~6 GB Infeasible on consumer NPU at FP16; tight at INT4

~~Where~~The bytes_per_param12B ~~is:~~variant essentially doesn't fit on consumer Lunar Lake outside of pathological configurations. The 418M and 1.2B variants are the realistic deployment targets.

What This Section Bought You

You should now understand:

~~2.0~~Static shapes are mandatory for ~~FP16~~non-LLM workloads on Intel NPU; chunked prefill softens this for LLMs since 2025.3 but not for seq2seq
The Intel NPU operator coverage is encoder-friendly and decoder-fragile — DetectionOutput, ScatterNDUpdate, INT64 indices, and dynamic Slice/Gather are recurring landmines

PTQ is the default path; the NPU LLM quantization rule is --sym --ratio 1.0 with group-size 128 (small) or -1 (large) The precision matrix gates by generation: NF4 needs Lunar Lake, FP8 needs Panther Lake M2M-100 export goes through Optimum-Intel with task text2text-generation-with-past; common mistakes are --task translation and missing --with-past M2M-100 is architecturally expensive for ~~INT8~~three structural reasons — full MHA, dynamic decode, cross-attention — none of which is fixable in post-training ~~0.5~~The ~~for~~hybrid ~~INT4~~pattern is encoder-on-NPU, decoder-on-CPU/iGPU, and the rest of the book builds on it

~~And~~ activation_overhead ~~is roughly 200–800 MB depending on context length and batch size.~~

~~For a 1.5B model at INT4:~~ (1.5 × 1024 × 0.5) + 500 ≈ 1.3 GB~~. That fits comfortably on most mobile NPUs. A 7B model at INT4 lands around 4 GB — feasible on a flagship phone, marginal on a mid-range one, comfortable on a laptop NPU.~~

What to Take Away

~~The work of optimizing a model for an NPU isn't separate from agent design — it determines the agent's design envelope. Before you write a line of orchestration code, you should know:~~

~~What model size your target hardware actually fits~~ ~~(after quantization, after activation overhead)~~ ~~Which operators the model uses, and whether they're supported~~ ~~on your NPU's compiler~~ ~~How much accuracy degradation quantization costs you~~ ~~on your real task distribution~~ ~~What falls back to CPU~~~~, and whether that fallback is on the critical path~~

~~These four answers shape everything downstream.~~ The next section turns to performance: given a model that compiles cleanly, what does its latency profile actually look like on Intel hardware, and ~~throughput~~what ~~— how to think about~~ ~~time~~ ~~on this hardware, which is the lens~~does that ~~matters most~~imply for agent ~~responsiveness.~~design patterns?

Previous: 1.1 Understanding NPU Architecture Next: 1.3 Latency, Throughput, and Hardware-Aware Patterns