Skip to main content

2.3 Reasoning Loops Under Constraint

So far this chapter has been about memory: the cache, the context, what fits and what doesn't. This section is about how decisions get made inside those limits. Cloud agents can afford to think out loud at length; NPU agents have to think efficiently. That changes the architecture of the loop itself.

The Cost of Thinking Out Loud

The dominant agent pattern over the last few years has been some flavor of ReAct: the model alternates "Thought" steps with "Action" steps, narrating its reasoning before each tool call. It's powerful, well-studied, and largely the right idea — but the costs are different on an NPU than in the cloud.

Each turn of a reasoning loop costs you:

  • Decode tokens for the thought (often 50–200 tokens per step at INT4)
  • A tool call round-trip to the CPU and back
  • Prefill of the tool result into the next reasoning step (frequently the dominant cost — tool outputs are often longer than the thought that produced them)
  • Cache growth that brings you closer to eviction

A 5-step ReAct loop with verbose narration can easily run 30 seconds end-to-end on a mobile NPU. The cloud version of the same agent runs in 3. The difference isn't because the NPU is 10x slower at any one step — it's because the loop accumulates many small costs the cloud absorbs invisibly.

This isn't an argument against reasoning loops. It's an argument for being deliberate about what each step buys you.

Three Loop Architectures, From Cheap to Expensive

You have a small set of patterns for reasoning loops, and they trade off latency against capability.

Single-Shot

The model receives the prompt and produces a complete response in one generation, with no intermediate tool calls or reasoning steps. Tools, if any, are called in a separate non-reasoning pass beforehand to gather context.

[gather context with deterministic logic]
    ↓
[single prompt + context]
    ↓
[model generates full response]

This is the fastest pattern. Use it when the task fits in one shot: classification, short answers, templated transformations. It's also the right starting point for any agent — if you can do the job in single-shot, the rest of this section is overhead.

Plan-Then-Execute

The model first generates a plan (a short sequence of intended tool calls), then a deterministic executor runs the plan and returns results, then the model formats the final response. Reasoning happens twice: once to plan, once to summarize.

[prompt]
    ↓
[model generates plan]
    ↓
[executor runs tools in order — no model in the loop]
    ↓
[model generates final response from results]

This is significantly cheaper than ReAct because the executor doesn't need to wake the model between tools. The trade-off is reduced adaptivity — the plan can't respond to surprising tool outputs. For workflows with predictable structure (search → retrieve → summarize, lookup → calculate → format), plan-then-execute hits a sweet spot.

ReAct / Interleaved Reasoning

The model alternates between reasoning and tool calls, deciding each next step based on the result of the previous one. Maximum adaptivity, maximum cost.

[prompt]
    ↓
[thought] → [tool] → [observation]
    ↓
[thought] → [tool] → [observation]
    ↓
... (continue until done)
    ↓
[final response]

Use this when steps genuinely depend on prior results in ways you can't predict. Don't use it as a default — most "agentic" tasks decompose into plan-then-execute or even single-shot if you look at them carefully.

Bounding the Loop

When you do need ReAct-style reasoning, the practical question becomes: how do you stop the loop before it runs forever?

The naive bound is a step count, but step count alone is a blunt instrument. Better bounds combine several signals:

  • Step count with a hard maximum (typically 5–10 on an NPU agent)
  • Token budget for the entire loop, summed across thoughts and observations
  • Latency budget with wall-clock timeout, after which the model is asked to summarize whatever it has
  • Confidence signal from the model itself ("I have enough information to answer now")
  • Tool-call repetition detector — if the model calls the same tool with the same arguments twice, it's stuck

These bounds should be visible to the model in the prompt, so it can self-regulate. A model that knows it has at most 3 more steps allocates them differently than one that thinks it has unlimited time.

The Reasoning-Compression Trade-off

Long reasoning traces are expensive to keep in the cache. The natural reflex is to compress them — summarize older reasoning into a few sentences before the next step. This works, but compression is itself a model call, with its own latency and risk of dropping important state.

The pragmatic patterns:

Don't compress within a turn. Within a single user interaction, keep the reasoning trace verbatim. Compression overhead per step usually exceeds savings.

Do compress between turns. When a user's task completes and a new one begins, summarize the previous task into a compact memory entry and evict the verbose trace. The summary becomes part of long-term memory; the original tokens leave the cache.

Separate working memory from long-term memory. Working memory is the active cache for the current task. Long-term memory is a separate store — vector DB, structured records, or just plain text — that the agent retrieves into context only when relevant. The NPU never tries to hold the user's entire history in attention.

This separation maps cleanly onto how humans operate: you don't hold every conversation you've ever had in active recall, you store summaries and retrieve them on demand.

Tool Selection as a Decision, Not a Search

A common waste pattern on NPUs is listing every available tool in every prompt. If your agent has 30 tools, that's likely 1500+ tokens of tool definitions in the cache for every single decision, when most decisions need only one or two tools.

Better patterns:

Pre-filter tools to the relevant subset. Use a small classifier or simple keyword matching to narrow 30 tools to 3–5 before sending to the model. The model never sees tools it shouldn't be considering.

Hierarchical tool catalogs. Group tools into categories. The model first picks a category (with brief descriptions of ~5 categories), then sees the tools in that category. Two cheap decisions instead of one expensive one.

Implicit defaults. If a tool is overwhelmingly the right choice for a category of input, route to it deterministically rather than asking the model. "Calculate" → calculator; "What time is it in Tokyo?" → time tool. Save the model's attention for ambiguous cases.

These patterns aren't sophisticated, but they're surprisingly absent from many agent implementations because they require deliberate engineering rather than relying on the model. On an NPU, they're the difference between a snappy assistant and a slow one.

A Worked Example: Reasoning Budget for a Voice Assistant

To make this concrete, here's a budget for a hypothetical NPU voice assistant targeting <2 second response time:

Component Budget
ASR (speech to text) 300 ms
Intent classification (tiny model) 50 ms
Tool selection + pre-filter 50 ms
Main model prefill (with prefix cache) 200 ms
Main model decode (~30 tokens) 600 ms
Tool execution (if needed) 200 ms
TTS (text to speech) 400 ms
Orchestration overhead 200 ms
Total 2000 ms

That budget allows essentially zero room for a multi-step reasoning loop. Voice assistants on NPUs are necessarily single-shot or plan-then-execute. ReAct loops add a full second per step and break the conversational rhythm users expect.

The lesson generalizes: your latency budget dictates your loop architecture. Pick the architecture from the budget, not the other way around.

Closing Chapter 2

You came into this chapter with weights, operators, and TOPS. You leave it with a coherent picture of how an agent actually operates within an NPU's limits:

  • Context length translates directly into memory cost via the KV cache, often exceeding the model weights themselves
  • Cache reuse — within sessions and across them — is the highest-leverage latency optimization available
  • Reasoning loops have a real per-step cost that compounds quickly on NPUs and forces architectural restraint
  • Working memory and long-term memory should be separated, with the NPU holding only what's active and retrieving the rest on demand
  • Tool selection is a decision problem in its own right, not something to delegate to a model staring at 30 options at once

Chapter 3 turns to the other side of that last point: how to design the tools themselves, where they should run, and how to integrate them efficiently with an NPU-bound reasoning core.


Previous: 2.2 KV Cache Engineering: Reuse, Eviction, and Prefix Sharing Next: Chapter 3: Tool Use & Integration Patterns