Advanced Search
Search Results
6 total results found
Foundations of NPU-Optimized Agents
NPU architecture and computational constraints. Model quantization and optimization for NPU deployment. Latency profiles and throughput optimization. Hardware-aware agent design patterns.
Agent State & Decision-Making on Constrained Hardware
Managing agent context and memory within NPU limits. Efficient reasoning loops for low-latency inference. Token budget strategies and context windowing. Caching and KV optimization for repeated queries.
Tool Use & Integration Patterns
Designing lightweight tools for NPU-based agents. Async I/O and non-blocking integrations. Local vs. remote tool execution trade-offs. Building tool abstractions that respect hardware constraints.
Production Deployment & Observability
Model serving architectures (ONNX, TensorRT, TVM). Monitoring latency, throughput, and reliability. A/B testing and progressive rollout strategies. Cost optimization and resource allocation.
Real-World Case Studies & Best Practices
Building customer-facing NPU agents (chatbots, assistants). Batch vs. streaming inference strategies. Handling fallbacks and graceful degradation. Lessons learned and anti-patterns to avoid.
Appendices
Glossary of terms and consolidated source references for the book.