Agent State & Decision-Making on Constrained Hardware

Managing agent context and memory within NPU limits. Efficient reasoning loops for low-latency inference. Token budget strategies and context windowing. Caching and KV optimization for repeated queries.

2.1 Context Windows and the Memory Wall

The agent's state — what it remembers from past steps and what it uses to make the next decision ...

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

The distinction between KV cache (what you keep in memory) and KV cache bandwidth (what you strea...

2.3 Reasoning Loops Under Constraint

Chapter 2 closes here. We have a model that fits, weights we can stream, KV state we can manage, ...

1.1 Understanding NPU Architecture

1.2 Computational Constraints & Model Optimization

1.3 Latency, Throughput, and Hardware-Aware Patterns

1.4 The Accuracy Cost of Quantization

1.5 Speculative Decoding

2.1 Context Windows and the Memory Wall

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.3 Reasoning Loops Under Constraint

3.1 Designing Tools for NPU-Bound Agents

3.2 Local-NPU vs Cloud Tools: A Real Trade-Off Table

3.3 Multi-Device Orchestration on a Single SoC

3.4 Structured Outputs and Constrained Decoding

4.1 Serving NPU Models with OVMS

4.2 Telemetry: What Works, What Doesn't, and What's Missing

4.3 A/B Testing, Canaries, and Hotswaps

4.4 Security and Privacy on the Edge

5.1 What's Actually Shipping on Intel NPUs

5.2 A Worked Agentic Translation Assistant

5.3 Anti-Patterns and Lessons

Glossary

References

Agent State & Decision-Making on Constrained Hardware

2.1 Context Windows and the Memory Wall

2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing

2.3 Reasoning Loops Under Constraint

Search Results