References

These are the primary sources for the technical claims in the book. Where multiple sources existed for the same fact, the most authoritative (vendor docs first, then peer-reviewed papers, then independent measurement) was used. Sources marked † are referenced but were not directly accessed at time of writing — treat their specific details as load-bearing-but-unverified and re-check before depending on them.

Organized by source type.

Intel & OpenVINO Primary Sources

OpenVINO Documentation (docs.openvino.ai). The canonical reference for OpenVINO Runtime, OpenVINO GenAI, NPU plugin, and Optimum-Intel. Pages cited throughout the book:

about-openvino/compatibility-and-support/supported-operations.html — the operator coverage matrix per release. Used in Chapter 1.2.
openvino-workflow-generative/inference-with-genai-on-npu.html — the canonical "GenAI on NPU" guide. Source for the INT4-sym / --ratio 1.0 / group-size quantization rule, the NF4 Lunar-Lake-only constraint, and the LLMPipeline NPU property table in Chapters 1.2 and 2.2.
openvino-workflow/running-inference/inference-devices-and-modes/npu-device.html — the NPU plugin reference. Source for CACHE_DIR, MAX_PROMPT_LEN, NPUW_LLM_PREFILL_CHUNK_SIZE, PREFILL_HINT, GENERATE_HINT, NPUW_LLM_ENABLE_PREFIX_CACHING, and PERFORMANCE_HINT properties.

OpenVINO Release Notes. Per-version feature deltas:

2025.2 — encoder-side QKV projection and MHA graph-level fusions for transformer encoders.
2025.3 — chunked prefill on NPU (PREFILL_HINT=DYNAMIC, NPUW_LLM_PREFILL_CHUNK_SIZE=1024); NF4 + FP16 KV cache on Lunar Lake.
2025.4 — 8K context GA on NPU, prefix caching (NPUW_LLM_ENABLE_PREFIX_CACHING:YES), multinomial sampling on NPU; memory-mapped cached models.
2026.0 — NPU compiler decoupled from OEM driver; speculative decoding on NPU.
2026.1 — TextEmbeddingPipeline NPU support; current stable as of May 2026.

OpenVINO Model Hub (huggingface.co/OpenVINO). Source of the DeepSeek-R1-Distill-Llama-8B INT4 benchmark at 6.10 tok/s on Intel NUC 14 Pro (Lunar Lake) used as the ITL anchor in Chapter 1.3. Per-model benchmark pages list TTFT, ITL, and target device.

Intel intel/linux-npu-driver (github.com/intel/linux-npu-driver). The in-tree Linux driver for Intel NPU. Apache 2.0. Source for the "OS support spans Windows and Linux" claim in Chapter 1.1.

Intel Lunar Lake Launch (Intel Newsroom, September 3, 2024). "Intel Core Ultra Series 2 Processors Deliver Unmatched Power-Efficient AI Performance and x86 Compatibility." Source for the 48 TOPS NPU 4 figure, the LPDDR5X-8533 / 136.5 GB/s spec, and the single-tile compute architecture.

Intel Panther Lake CES 2026 Announcement. Source for the 50 TOPS NPU 5, native FP8 (E4M3/E5M2), programmable LUT for activations, and Intel 18A process claims. Press materials at intel.com/content/www/us/en/newsroom/news/. † Specific NCE count (3 each ~2× wider) is from secondary press coverage and should be re-verified against Intel's formal whitepapers when available.

Intel openvino-ai-plugins-gimp 3.2 Release Notes (github.com/intel/openvino-ai-plugins-gimp/releases). Source for the verbatim "FP8 model installation is now gated to NPU5000 and newer architectures" quote in Chapter 1.2.

Intel Community Forums (community.intel.com). Thread 1735991 (February 2026) on DetectionOutput NPU/iGPU compile failures; GitHub openvinotoolkit/openvino issue #13594 on ScatterNDUpdate rejection. Background for the operator-coverage landmines in Chapter 1.2.

Microsoft & Windows Copilot+ Sources

"Phi Silica, small but mighty on-device SLM" (Windows Experience Blog, December 2024). The canonical reference for Phi Silica architecture: CPU tokenizer + embedding + LM-head, NPU transformer, CPU decode with N=64 KV sliding window. Source for the verbatim "Context processing involves intense parallel computation..." quote in Chapter 1.1 (the closest Microsoft analog to a "decode is bandwidth-bound" statement).

"DeepSeek-R1-Distill on Phi Silica stack" (Windows Developer Blog, 2026). Microsoft's extension of the Phi Silica architecture to a 1.5B and 14B reasoning model. Source for the 1.5B at ~40 tok/s and 14B at ~8 tok/s figures on Snapdragon X NPU in Chapter 2.3. † Specific numbers vary across blog updates — re-check against the canonical post.

Click to Do documentation (learn.microsoft.com/windows/ai/apis/phi-silica). The Phi Silica frontend's prompt templates and single-turn execution model. Background for the single-shot positioning in Chapter 2.3.

Phi Silica Windows Update KBs. KB5079266, KB5084176, KB5089866 — the cumulative updates that progressively rolled Phi Silica out to Intel Copilot+ hardware. † Specific KB numbers are from third-party Windows news aggregators; verify against the Microsoft Update Catalog before quoting.

Hugging Face & Optimum-Intel

Hugging Face Optimum-Intel Documentation (huggingface.co/docs/optimum/intel). Source for the optimum-cli export openvino command syntax, the task-name conventions (including the text2text-generation-with-past vs. --task translation distinction in Chapter 1.2), and the OVModelForSeq2SeqLM / OVModelForCausalLM class hierarchy.

M2M-100 Model Cards. facebook/m2m100_418M, facebook/m2m100_1.2B, facebook/m2m100-12B-avg-5-ckpt. Source for: model architectures (24 encoder + 24 decoder layers on 1.2B, 16 heads, 64 head_dim), MIT license, the 128,112 vocabulary size, the forced_bos_token_id requirement, and the decoder_start_token_id = eos_token_id = 2 convention. Also: transformers's modeling_m2m_100.py source for the no-GQA architectural claim.

NLLB-200 Model Card (facebook/nllb-200-distilled-600M, etc.). Source for the CC-BY-NC 4.0 license and the shared M2M100ForConditionalGeneration class implementation.

Phi-3-mini-3.8B Model Card (microsoft/Phi-3-mini-4k-instruct). Source for the 32 layers / 32 heads / 8 KV heads / GQA architecture used in the KV-cache comparison in Chapter 2.1.

DeepSeek-R1-Distill-Llama-8B Model Card (deepseek-ai/DeepSeek-R1-Distill-Llama-8B, and its OpenVINO Model Hub quantized variant). The 8B parameter count and Llama architecture lineage are referenced in Chapters 1.3 and 2.3.

Hugging Face × Intel "Build an Agent with Qwen3-8B on Intel iGPU" Blog (huggingface.co/blog). Closest existing analog to a published multi-step agent on Intel hardware. Runs on iGPU, not NPU — used in Chapter 2.3 as the "negative result" reference for the absence of NPU-targeted agent guidance.

Independent Benchmarks & Analysis

MLPerf Client v0.6 (mlcommons.org/benchmarks/client). The industry-standard client-side ML benchmark suite. Intel's Core Ultra Series 1 Meteor Lake numbers (Llama 2 7B: TTFT 1.09 s, 18.55 tok/s sustained) used as the TTFT/ITL anchor in Chapter 1.3 come from MLPerf Client v0.6 submitter data.

Chips and Cheese, "Intel Meteor Lake's NPU" (chipsandcheese.com/p/intel-meteor-lakes-npu). The independent measurement of 9.5 TOPS at 1.16 GHz for NPU 3720, against Intel's marketing claim of ~11.5 TOPS. Source for the "TOPS is the marketing number" framing in Chapter 1.1.

IPEX-LLM Quickstart Documentation. Source for the "30 s to several minutes cold start for 3B–8B LLM INT4 on NPU" anchor in Chapter 1.3.

Markaicode and Audacity OpenVINO Documentation. Source for warm-start LLM load times (<3 s) and the 10–30 s cold / 1–3 s warm range for Whisper/MusicGen/Demucs. † These are secondary developer-blog sources; treat the specific numbers as illustrative ranges, not validated SLAs.

Foundational Papers

Fan, A. et al. (2020). "Beyond English-Centric Multilingual Machine Translation." arXiv:2010.11125. The M2M-100 paper. Background for the architecture and training data choices.

Yao, S. et al. (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629. The foundational ReAct paper. Background for the reasoning-architecture discussion in Chapter 2.3.

Ainslie, J. et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245. The GQA paper. Background for the MHA-vs-GQA KV-cache comparison in Chapter 2.1.

Williams, S., Waterman, A., Patterson, D. (2009). "Roofline: An Insightful Visual Performance Model for Multicore Architectures." Communications of the ACM 52(4). The original roofline-model paper. Background for the bandwidth-vs-compute analysis in Chapter 1.3.

NLLB Team (2022). "No Language Left Behind: Scaling Human-Centered Machine Translation." arXiv:2207.04672. The NLLB-200 paper. Background for M2M-100's relationship to its successor and the architectural-class continuity.

Standards Bodies & Competitor Vendor Pages

Apple Neural Engine — Apple Developer documentation (developer.apple.com/machine-learning). Source for the M4 family 16-core ANE / 38 TOPS figure. Core ML is the only access path. Background in Chapter 1.1.

Qualcomm Snapdragon X Elite Product Page (qualcomm.com). Source for the 45 TOPS Hexagon NPU figure. Background for the Phi Silica-on-Snapdragon-X numbers (TTFT 230 ms, 20 tok/s) referenced throughout. † Phi Silica's published numbers are on Snapdragon X, not Intel — this is called out explicitly in Chapters 1.3 and 2.3 to prevent cross-platform extrapolation errors.

AMD Ryzen AI 300 / XDNA 2 Documentation (amd.com). Source for XDNA 2's 50 TOPS INT8 + 50 TOPS Block FP16 specs and the separate-IP-block integration model contrast.

Google Edge TPU (coral.ai). Background for the systolic-array architecture comparison in Chapter 1.1.

On Verification and Recency

This book was written in May 2026. NPU silicon, OpenVINO releases, and Phi Silica documentation are all moving targets — features cited as "2025.3+" or "2026.0" will be displaced by newer releases within months of publication. When in doubt, re-check the canonical Intel and Microsoft sources for the current state of:

The LLMPipeline NPU property table (configuration knobs change frequently)
The validated NPU model list (grows with each release)
The precision matrix by generation (NF4, FP8 support evolves)
Phi Silica's deployment surface on Intel hardware (specific KB numbers and rollout coverage)
Lunar Lake and Panther Lake performance numbers (Intel publishes refreshed datapoints regularly)

The technical reasoning in the book — bandwidth ceilings, encoder/decoder partition, single-shot-vs-ReAct tradeoff — outlasts any specific version. The numbers don't.

Previous: Glossary

1.1 Understanding NPU Architecture

1.2 Computational Constraints & Model Optimization

1.3 Latency, Throughput, and Hardware-Aware Patterns

1.4 The Accuracy Cost of Quantization

1.5 Speculative Decoding

2.1 Context Windows and the Memory Wall

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.3 Reasoning Loops Under Constraint

3.1 Designing Tools for NPU-Bound Agents

3.2 Local-NPU vs Cloud Tools: A Real Trade-Off Table

3.3 Multi-Device Orchestration on a Single SoC

3.4 Structured Outputs and Constrained Decoding

4.1 Serving NPU Models with OVMS

4.2 Telemetry: What Works, What Doesn't, and What's Missing

4.3 A/B Testing, Canaries, and Hotswaps

4.4 Security and Privacy on the Edge

5.1 What's Actually Shipping on Intel NPUs

5.2 A Worked Agentic Translation Assistant

5.3 Anti-Patterns and Lessons

Glossary

References

References

Intel & OpenVINO Primary Sources

Microsoft & Windows Copilot+ Sources

Hugging Face & Optimum-Intel

Independent Benchmarks & Analysis

Foundational Papers

Standards Bodies & Competitor Vendor Pages

On Verification and Recency