Agent State & Decision-Making on Constrained Hardware
Managing agent context and memory within NPU limits. Efficient reasoning loops for low-latency inference. Token budget strategies and context windowing. Caching and KV optimization for repeated queries.
2.1 Context Windows and the Memory Wall
The agent's state — what it remembers from past steps and what it uses to make the next decision ...
2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing
The distinction between KV cache (what you keep in memory) and KV cache bandwidth (what you strea...
2.3 Reasoning Loops Under Constraint
Chapter 2 closes here. We have a model that fits, weights we can stream, KV state we can manage, ...