2.3 Reasoning Loops Under Constraint

SoChapter ~~far~~2 closes here. We have a model that fits, weights we can stream, KV state we can manage, and decode at roughly 6–20 tok/s. The question this ~~chapter~~section ~~has~~answers: ~~been~~given ~~about~~that ~~memory:~~decode budget, what reasoning architectures actually work? The naive answer — bolt a ReAct loop on top and let the ~~cache,~~agent think — collides with the ~~context,~~latency ~~what~~ceiling ~~fits~~in ~~and~~a ~~what~~way ~~doesn't.~~worth ~~This~~being ~~section~~specific ~~is about how decisions get made inside those limits. Cloud agents can afford to think out loud at length; NPU agents have to think efficiently. That changes the architecture of the loop itself.~~about.

The CostThree Reasoning Architectures

Three patterns dominate agent design, and they sort cleanly by NPU compatibility:

Single-shot. One prompt, one response. No loop. The agent reads the input, produces the output, done. Translation is the canonical single-shot task: source sentence in, target sentence out. The cost is one prefill plus one decode. Phi Silica's Click to Do affordances are single-shot. This is the NPU-native pattern.

Plan-then-execute. The model produces a plan once, then executes the plan deterministically (often without further model calls, or with a small number of ~~Thinking~~pre-determined ~~Out~~model ~~Loud~~calls).

For a translation assistant: "rewrite this paragraph for a teenage audience and translate to French" decomposes to (1) rewrite via Phi-3.5-mini, (2) translate via M2M-100. The ~~dominant~~plan ~~agent~~is ~~pattern~~one ~~over~~LLM call; the ~~last~~execution ~~few~~is ~~years~~a ~~has~~fixed ~~been~~pipeline. ~~some~~Two ~~flavor~~model ofcalls ~~ReAct:~~total, ~~the~~predictable latency.

ReAct (Reason + Act). The model alternates ~~"Thought"~~between ~~steps~~thinking and tool-calling in a loop, with ~~"Action" steps, narrating its reasoning before~~ each ~~tool~~iteration ~~call.~~informed ~~It's powerful, well-studied, and largely~~by the ~~right~~last. ~~idea~~The —hallmark ~~but~~is that the ~~costs~~number ~~are~~of ~~different~~iterations onis annot ~~NPU than~~known in ~~the~~advance. ~~cloud.~~

This

~~Each turn of a reasoning loop costs you:~~

~~Decode tokens~~ ~~for the thought (often 50–200 tokens per step at INT4)~~ ~~A tool call round-trip~~ ~~to the CPU and back~~ ~~Prefill of the tool result~~ ~~into the next reasoning step (frequently~~is the dominant ~~cost~~pattern —for ~~tool~~cloud ~~outputs~~agents ~~are often longer than~~and the ~~thought~~one developers reach for by default. It's also the pattern that ~~produced~~NPU ~~them)~~latency ~~Cache~~budgets ~~growth~~cannot afford. ~~that~~

The bringsReAct youLatency closer to eviction

Budget

ALet's price out a 5-step ReAct loop ~~with~~on ~~verbose~~Intel ~~narration~~Core ~~can~~Ultra ~~easily~~NPU, ~~run 30 seconds end-to-end~~anchored on aChapter ~~mobile~~1.3's ~~NPU.~~two ~~The~~published ~~cloud~~benchmarks.

~~version~~

Assumptions: of512-token ~~the~~context ~~same agent runs in 3. The difference isn't because the NPU is 10x slower at any one~~per step —(prompt ~~it's~~grows ~~because~~as the loop ~~accumulates many small costs the cloud absorbs invisibly.~~

~~This isn't an argument against reasoning loops. It's an argument for being deliberate about what each step buys you.~~

Three Loop Architectures, From Cheap to Expensive

~~You have a small set of patterns for reasoning loops, and they trade off latency against capability.~~

Single-Shot

The model receives the prompt and produces a complete response in one generation, with no intermediate tool calls or reasoning steps. Tools, if any, are called in a separate non-reasoning pass beforehand to gather context.

[gather context with deterministic logic]
    ↓
[single prompt + context]
    ↓
[model generates full response]

This is the fastest pattern. Use it when the task fits in one shot: classification, short answers, templated transformations. It's also the right starting point for any agent — if you can do the job in single-shot, the rest of this section is overhead.

Plan-Then-Execute

~~The model first generates a plan (a short sequence of intended tool calls)~~accumulates), ~~then~~64-token ~~a deterministic executor runs the plan and returns results, then the model formats the final response. Reasoning happens twice: once to plan, once to summarize.~~

[prompt]
    ↓
[model generates plan]
    ↓
[executor runs tools in order — no model in the loop]
    ↓
[model generates final response from results]

This is significantly cheaper than ReAct because the executor doesn't need to wake the model between tools. The trade-off is reduced adaptivity — the plan can't respond to surprising tool outputs. For workflows with predictable structure (search → retrieve → summarize, lookup → calculate → format), plan-then-execute hits a sweet spot.

ReAct / Interleaved Reasoning

~~The model alternates between reasoning and tool calls, deciding each next step based on the result of the previous one. Maximum adaptivity, maximum cost.~~

[prompt]
    ↓
[thought] → [tool] → [observation]
    ↓
[thought] → [tool] → [observation]
    ↓
... (continue until done)
    ↓
[final response]

Use this when steps genuinely depend on prior results in ways you can't predict. Don't use it as a default — most "agentic" tasks decompose into plan-then-execute or even single-shot if you look at them carefully.

Bounding the Loop

~~When you do need ReAct-style reasoning, the practical question becomes: how do you stop the loop before it runs forever?~~

~~The naive bound is a step count, but step count alone is a blunt instrument. Better bounds combine several signals:~~

~~Step count~~ ~~with a hard maximum (typically 5–10 on an NPU agent)~~ ~~Token budget~~ ~~for the entire loop, summed across thoughts and observations~~ ~~Latency budget~~ ~~with wall-clock timeout, after which the model is asked to summarize whatever it has~~ ~~Confidence signal~~ ~~from the model itself ("I have enough information to answer now")~~ ~~Tool-call repetition detector~~ ~~— if the model calls the same tool with the same arguments twice, it's stuck~~

These bounds should be visible to the model in the prompt, so it can self-regulate. A model that knows it has at most 3 more steps allocates them differently than one that thinks it has unlimited time.

The Reasoning-Compression Trade-off

Long reasoning traces are expensive to keep in the cache. The natural reflex is to compress them — summarize older reasoning into a few sentences before the next step. This works, but compression is itself a model call, with its own latency and risk of dropping important state.

~~The pragmatic patterns:~~

~~Don't compress within a turn.~~ ~~Within a single user interaction, keep the reasoning trace verbatim. Compression overhead~~decode per step ~~usually~~(the ~~exceeds savings.~~

~~Do compress between turns.~~ ~~When a user'~~agent's ~~task~~"Thought ~~completes~~/ Action / Observation" turn). Using Llama 2 7B at MLPerf's TTFT-1.09s/128-tok-prompt and aDeepSeek-Distill-Llama-8B's ~~new~~163 ~~one~~ms/token ~~begins,~~decode ~~summarize~~as the ~~previous~~conservative ~~task into a compact memory entry and evict the verbose trace. The summary becomes part of long-term memory; the original tokens leave the cache.~~

~~Separate working memory from long-term memory.~~ Working memory is the active cache for the current task. Long-term memory is a separate store — vector DB, structured records, or just plain text — that the agent retrieves into context only when relevant. The NPU never tries to hold the user's entire history in attention.

~~This separation maps cleanly onto how humans operate: you don't hold every conversation you've ever had in active recall, you store summaries and retrieve them on demand.~~

Tool Selection as a Decision, Not a Search

A common waste pattern on NPUs is listing every available tool in every prompt. If your agent has 30 tools, that's likely 1500+ tokens of tool definitions in the cache for every single decision, when most decisions need only one or two tools.

~~Better patterns:~~

~~Pre-filter tools to the relevant subset.~~ ~~Use a small classifier or simple keyword matching to narrow 30 tools to 3–5 before sending to the model. The model never sees tools it shouldn't be considering.~~

~~Hierarchical tool catalogs.~~ ~~Group tools into categories. The model first picks a category (with brief descriptions of ~5 categories), then sees the tools in that category. Two cheap decisions instead of one expensive one.~~

~~Implicit defaults.~~ If a tool is overwhelmingly the right choice for a category of input, route to it deterministically rather than asking the model. "Calculate" → calculator; "What time is it in Tokyo?" → time tool. Save the model's attention for ambiguous cases.

These patterns aren't sophisticated, but they're surprisingly absent from many agent implementations because they require deliberate engineering rather than relying on the model. On an NPU, they're the difference between a snappy assistant and a slow one.

A Worked Example: Reasoning Budget for a Voice Assistant

~~To make this concrete, here's a budget for a hypothetical NPU voice assistant targeting <2 second response time:~~anchors:

Component	~~Budget~~Value

Source ~~ASR~~TTFT, ~~(speech~~~128-token ~~to text)~~prompt ~~300~~1.09 mss MLPerf Client v0.6 ~~Intent~~TTFT ~~classification~~extrapolated ~~(tiny~~to ~~model)~~512-token prompt 50~4 mss linear-ish ~~Tool~~ITL ~~selection~~per +decode ~~pre-filter~~token (8B INT4) 50163 ms OpenVINO Model Hub ~~Main~~Decode ~~model~~64 ~~prefill (with prefix cache)~~tokens ~~200~~10.4 ms ~~Main model decode (~30 tokens)~~s ~~600 ms~~ ~~Tool execution (if needed)~~ ~~200 ms~~ ~~TTS (text to speech)~~ ~~400 ms~~ ~~Orchestration overhead~~ ~~200 ms~~computed ~~Total~~One ReAct iteration ~~2000~~~14–15 mss extrapolated 5 iterations ~70–75 s extrapolated

~~That~~On the same SoC's iGPU (12.8 tok/s, ~78 ms/token): one iteration ≈ 7 s, five iterations ≈ 35 s.

A 5-step ReAct agent at this context size on Intel NPU sits in the 60–90 second range — usable for offline summarization, marginal for chat, infeasible for interactive autocomplete. Stretching the loop to 10 steps doubles it. ReAct's behavior of growing the context monotonically with each step makes it worse over time, not better, because every iteration's prefill takes longer than the last.

These numbers are extrapolations from published single-call benchmarks, not measurements of ReAct loops. We flagged in Chapter 1.3 that Intel and Microsoft have published almost nothing about multi-step agents on NPU. Treat the table as the right order of magnitude, not as a precise SLA.

Why Single-Shot Wins on NPU

The structural reasons single-shot translates to NPU and ReAct doesn't:

Each ReAct step pays full TTFT. The prefill is the compute-bound, MAC-array-heavy phase; on NPU it's relatively fast per-prompt, but you do it N times per loop instead of once. A 5-step ReAct burns 5× the TTFT of an equivalent single-shot.

Context grows monotonically. Step 1's prefill is short. Step 5's prefill includes everything that came before. The TTFT cost rises through the loop. Chunked prefill on NPU helps, but doesn't fix the issue: each chunk costs constant time, and step 5 has more chunks.

Cold-cache pressure increases. The KV state from step 1 has to be valid at step 5 — which works fine within LLMPipeline.start_chat() but means the state-variable allocation must accommodate the full final context. You commit to the worst-case footprint up front.

Greedy-only hurts most here. On NPU's static pipeline, no beam search. ReAct's "Thought" outputs are exactly the kind of free-form text that benefits from beam-4 sampling diversity. Greedy ReAct tends to fall into repetitive loops.

The cumulative effect: ReAct on Intel NPU magnifies the very constraints that NPUs are worst at. It's the wrong architecture for the hardware.

What to Do Instead

Prefer single-shot. If your task can be reduced to one prompt and one response, do that. Translation is single-shot. Summarization is single-shot. Tone-rewrite is single-shot. "Explain this code" is single-shot. The cloud-agent culture's enthusiasm for ReAct has obscured how many useful tasks don't actually need a loop.

Use plan-then-execute when you need composition. A planning call decides the structure; deterministic code runs the plan. The planning model needs to produce structured output (JSON, XML), which works fine in single-shot. The execution is fixed-cost, and any individual sub-call can hit its own device — the plan can route one sub-task to NPU, another to iGPU.

Use the cascade pattern for triage. A tiny model on NPU decides whether the request needs the heavy model. The cheap path is sub-second; the expensive path is the budget ~~allows~~you'd ~~essentially~~already ~~zero room~~pay for a ~~multi-step~~single-shot. Worst-case latency is the heavy-model latency, not the heavy-model latency times the number of ReAct iterations.

When you genuinely need ReAct, run it on iGPU. The 2.1× speedup from Chapter 1.3 turns 75-second NPU ReAct into 35-second iGPU ReAct. Still slow by cloud standards; in budget for offline workflows like document analysis. The NPU's role becomes drafting and triage; the iGPU does the reasoning loop.

~~Voice~~

Tighten ~~assistants~~context aggressively. Every byte you can prune from the running prompt is bandwidth you don't pay for at every step. The Phi Silica architecture's N=64 sliding window over context is an aggressive version of this — most of the time you don't need everything in scope.

Working vs Long-Term Memory

The reasoning loop's state — what the agent remembers across steps — splits into two regimes.

Working memory is what's in the prompt this turn. On NPU it's bounded by MAX_PROMPT_LEN. Generous on ~~NPUs~~chunked-prefill-capable ~~are~~models ~~necessarily~~(up ~~single-shot~~to or8K ~~plan-then-execute.~~validated ~~ReAct~~on ~~loops~~Lunar ~~add~~Lake); tighter on encoder-decoder seq2seq like M2M-100. Working memory is fast (it's in the model's attention window) and ephemeral (it doesn't persist across sessions).

Long-term memory lives outside the model — in a ~~full~~SQLite ~~second~~database, ~~per~~a vector store, a key-value cache, a local filesystem. It's persistent and unbounded in size, but accessing it costs an explicit retrieval step ~~and break~~before the ~~conversational~~next ~~rhythm~~prompt. ~~users~~For ~~expect.~~NPU agents, long-term memory needs to be local, which means it's a few milliseconds away and orders of magnitude cheaper than another NPU forward pass.

The ~~lesson~~pattern ~~generalizes:~~that works well on NPU: aggressive working-memory pruning (small context, small TTFT), with retrieval into a local vector store between turns. The vector store is on CPU; the embedding model can be on NPU (which is exactly the kind of single-shot, batch-friendly workload NPU is great at — see Chapter 3.3 for the OpenVINO 2026.1 TextEmbeddingPipeline NPU support). The reasoning model gets short, dense context; the agent stays responsive.

Where Intel and Microsoft Have Been Quiet

Honest gaps to flag, because this is the section most likely to invite extrapolation:

~~your~~No ~~latency~~Intel-published guidance on multi-step LLM agents on NPU. The Hugging Face × Intel Qwen3-8B Agent blog is the closest analog, and it explicitly runs on iGPU, not NPU.

Phi Silica is documented as single-turn. Microsoft routes it through Click to Do prompt templates with no learned router and no documented multi-step loop. The Windows Developer Blog extends the Phi Silica stack to DeepSeek-R1-Distill (1.5B at ~40 tok/s, 14B at ~8 tok/s on Snapdragon X NPU) — a reasoning model on NPU — but does not describe an agent architecture around it.

No published ReAct-loop measurements on Intel NPU exist. The 60–90 second budget ~~dictates your loop architecture.~~ ~~Pick~~in the ~~architecture~~table above is extrapolation from ~~the~~single-call ~~budget,~~benchmarks. ~~not~~If you build a real ReAct agent on NPU, the ~~other~~data ~~way~~points ~~around.~~you collect will be original contributions to the public record.

The chapter's recommendation — prefer single-shot, fall back to plan-then-execute, treat ReAct as the iGPU pattern — reflects the absence of evidence for ReAct working well on NPU as much as it reflects the math. When more data appears the calculus might shift. As of May 2026 it hasn't.

ClosingWhat ChapterThis 2Section Bought You

You ~~came~~should ~~into~~now ~~this chapter with weights, operators, and TOPS. You leave it with a coherent picture of how an agent actually operates within an NPU's limits:~~understand:

~~Context~~Three ~~length~~reasoning ~~translates~~architectures: ~~directly~~single-shot ~~into~~(NPU-native), ~~memory~~plan-then-execute ~~cost~~(decomposable), ~~via~~ReAct ~~the~~(iGPU KVpattern, ~~cache,~~not ~~often exceeding the model weights themselves~~NPU)
~~Cache~~A ~~reuse~~5-step ReAct loop costs ~70–75 seconds on NPU vs ~35 seconds on iGPU for an 8B INT4 model — ~~within~~extrapolated, ~~sessions~~not ~~and~~measured

~~across~~ReAct ~~them~~magnifies the constraints NPUs are worst at: repeated TTFT, growing context, greedy-only sampling, accumulating KV state Single-shot tasks are more common than the cloud-agent literature suggests — translation, summarization, tone-rewrite, code explanation all fit Cascade triage is the ~~highest-leverage~~NPU-native ~~latency optimization available~~ ~~Reasoning loops have a real per-~~multi-step ~~cost~~pattern ~~that~~— ~~compounds~~tiny ~~quickly~~model ondecides ~~NPUs~~whether ~~and~~the ~~forces~~heavy ~~architectural~~model ~~restraint~~needs to run Working memory ~~and~~(prompt) is bounded by MAX_PROMPT_LEN; long-term memory ~~should~~lives bein ~~separated~~local stores, with ~~the~~embedding-model ~~NPU~~retrieval ~~holding~~between ~~only what's active and retrieving the rest on demand~~turns ~~Tool~~Intel ~~selection~~and isMicrosoft ahave ~~decision~~published ~~problem~~almost innothing ~~its~~on ~~own~~multi-step ~~right~~NPU agents, ~~not~~— ~~something~~be tohonest ~~delegate~~about tothe agap ~~model~~when ~~staring~~designing atfor ~~30 options at once~~production

Chapter 2 ends here. The reader now has a working mental model of the constraints, the state, and the decision-making patterns. Chapter 3 turns to ~~the other side of that last point:~~tools: how ~~to design the tools themselves, where they should run, and how to integrate them efficiently with~~does an NPU-bound ~~reasoning~~agent ~~core.~~reach the world, what tool designs survive the latency budget, and where does the cloud fit?

Previous: 2.2 KV Cache ~~Engineering: Reuse, Eviction, and Prefix Sharing~~Engineering Next: Chapter 3: Tool Use & Integration Patterns