Production Deployment & Observability

Model serving architectures (ONNX, TensorRT, TVM). Monitoring latency, throughput, and reliability. A/B testing and progressive rollout strategies. Cost optimization and resource allocation.

4.1 Serving NPU Models with OVMS

A development-time compile_model(...) call is not a production deployment. Once your agent is rea...

4.2 Telemetry: What Works, What Doesn't, and What's Missing

You can't operate what you can't observe. NPU agents have a harder observability story than CPU- ...

4.3 A/B Testing, Canaries, and Hotswaps

Models drift. Drivers update. Quantization schemes change. The NPU you tested against in February...

4.4 Security and Privacy on the Edge

"It runs on the device, so it's private" is the marketing line. It's also a half-truth that has c...

1.1 Understanding NPU Architecture

1.2 Computational Constraints & Model Optimization

1.3 Latency, Throughput, and Hardware-Aware Patterns

1.4 The Accuracy Cost of Quantization

1.5 Speculative Decoding

2.1 Context Windows and the Memory Wall

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.3 Reasoning Loops Under Constraint

3.1 Designing Tools for NPU-Bound Agents

3.2 Local-NPU vs Cloud Tools: A Real Trade-Off Table

3.3 Multi-Device Orchestration on a Single SoC

3.4 Structured Outputs and Constrained Decoding

4.1 Serving NPU Models with OVMS

4.2 Telemetry: What Works, What Doesn't, and What's Missing

4.3 A/B Testing, Canaries, and Hotswaps

4.4 Security and Privacy on the Edge

5.1 What's Actually Shipping on Intel NPUs

5.2 A Worked Agentic Translation Assistant

5.3 Anti-Patterns and Lessons

Glossary

References

Production Deployment & Observability

4.1 Serving NPU Models with OVMS

4.2 Telemetry: What Works, What Doesn't, and What's Missing

4.3 A/B Testing, Canaries, and Hotswaps

4.4 Security and Privacy on the Edge

Search Results