Skip to main content

2.3 Reasoning Loops Under Constraint

2.3 Reasoning Loops Under Constraint

SoChapter far2 closes here. We have a model that fits, weights we can stream, KV state we can manage, and decode at roughly 6–20 tok/s. The question this chaptersection hasanswers: beengiven aboutthat memory:decode budget, what reasoning architectures actually work? The naive answer — bolt a ReAct loop on top and let the cache,agent think — collides with the context,latency whatceiling fitsin anda whatway doesn't.worth Thisbeing sectionspecific is about how decisions get made inside those limits. Cloud agents can afford to think out loud at length; NPU agents have to think efficiently. That changes the architecture of the loop itself.about.

The CostThree Reasoning Architectures

Three patterns dominate agent design, and they sort cleanly by NPU compatibility:

Single-shot. One prompt, one response. No loop. The agent reads the input, produces the output, done. Translation is the canonical single-shot task: source sentence in, target sentence out. The cost is one prefill plus one decode. Phi Silica's Click to Do affordances are single-shot. This is the NPU-native pattern.

Plan-then-execute. The model produces a plan once, then executes the plan deterministically (often without further model calls, or with a small number of Thinkingpre-determined Outmodel Loudcalls).

For a translation assistant: "rewrite this paragraph for a teenage audience and translate to French" decomposes to (1) rewrite via Phi-3.5-mini, (2) translate via M2M-100. The dominantplan agentis patternone overLLM call; the lastexecution fewis yearsa hasfixed beenpipeline. someTwo flavormodel ofcalls ReAct:total, thepredictable latency.

ReAct (Reason + Act). The model alternates "Thought"between stepsthinking and tool-calling in a loop, with "Action" steps, narrating its reasoning before each tooliteration call.informed It's powerful, well-studied, and largelyby the rightlast. ideaThe hallmark butis that the costsnumber areof differentiterations onis annot NPU thanknown in theadvance. cloud.

This

Each turn of a reasoning loop costs you:

    Decode tokens for the thought (often 50–200 tokens per step at INT4) A tool call round-trip to the CPU and back Prefill of the tool result into the next reasoning step (frequentlyis the dominant costpattern for toolcloud outputsagents are often longer thanand the thoughtone developers reach for by default. It's also the pattern that producedNPU them)latency Cachebudgets growthcannot afford. that

    The bringsReAct youLatency closer to eviction

    Budget

    ALet's price out a 5-step ReAct loop withon verboseIntel narrationCore canUltra easilyNPU, run 30 seconds end-to-endanchored on aChapter mobile1.3's NPU.two Thepublished cloudbenchmarks.

    version

    Assumptions: of512-token thecontext same agent runs in 3. The difference isn't because the NPU is 10x slower at any oneper step (prompt it'sgrows becauseas the loop accumulates many small costs the cloud absorbs invisibly.

    This isn't an argument against reasoning loops. It's an argument for being deliberate about what each step buys you.

    Three Loop Architectures, From Cheap to Expensive

    You have a small set of patterns for reasoning loops, and they trade off latency against capability.

    Single-Shot

    The model receives the prompt and produces a complete response in one generation, with no intermediate tool calls or reasoning steps. Tools, if any, are called in a separate non-reasoning pass beforehand to gather context.

    [gather context with deterministic logic]
        ↓
    [single prompt + context]
        ↓
    [model generates full response]
    

    This is the fastest pattern. Use it when the task fits in one shot: classification, short answers, templated transformations. It's also the right starting point for any agent — if you can do the job in single-shot, the rest of this section is overhead.

    Plan-Then-Execute

    The model first generates a plan (a short sequence of intended tool calls)accumulates), then64-token a deterministic executor runs the plan and returns results, then the model formats the final response. Reasoning happens twice: once to plan, once to summarize.

    [prompt]
        ↓
    [model generates plan]
        ↓
    [executor runs tools in order — no model in the loop]
        ↓
    [model generates final response from results]
    

    This is significantly cheaper than ReAct because the executor doesn't need to wake the model between tools. The trade-off is reduced adaptivity — the plan can't respond to surprising tool outputs. For workflows with predictable structure (search → retrieve → summarize, lookup → calculate → format), plan-then-execute hits a sweet spot.

    ReAct / Interleaved Reasoning

    The model alternates between reasoning and tool calls, deciding each next step based on the result of the previous one. Maximum adaptivity, maximum cost.

    [prompt]
        ↓
    [thought] → [tool] → [observation]
        ↓
    [thought] → [tool] → [observation]
        ↓
    ... (continue until done)
        ↓
    [final response]
    

    Use this when steps genuinely depend on prior results in ways you can't predict. Don't use it as a default — most "agentic" tasks decompose into plan-then-execute or even single-shot if you look at them carefully.

    Bounding the Loop

    When you do need ReAct-style reasoning, the practical question becomes: how do you stop the loop before it runs forever?

    The naive bound is a step count, but step count alone is a blunt instrument. Better bounds combine several signals:

      Step count with a hard maximum (typically 5–10 on an NPU agent) Token budget for the entire loop, summed across thoughts and observations Latency budget with wall-clock timeout, after which the model is asked to summarize whatever it has Confidence signal from the model itself ("I have enough information to answer now") Tool-call repetition detector — if the model calls the same tool with the same arguments twice, it's stuck

      These bounds should be visible to the model in the prompt, so it can self-regulate. A model that knows it has at most 3 more steps allocates them differently than one that thinks it has unlimited time.

      The Reasoning-Compression Trade-off

      Long reasoning traces are expensive to keep in the cache. The natural reflex is to compress them — summarize older reasoning into a few sentences before the next step. This works, but compression is itself a model call, with its own latency and risk of dropping important state.

      The pragmatic patterns:

      Don't compress within a turn. Within a single user interaction, keep the reasoning trace verbatim. Compression overheaddecode per step usually(the exceeds savings.

      Do compress between turns. When a user'agent's task"Thought completes/ Action / Observation" turn). Using Llama 2 7B at MLPerf's TTFT-1.09s/128-tok-prompt and aDeepSeek-Distill-Llama-8B's new163 onems/token begins,decode summarizeas the previousconservative task into a compact memory entry and evict the verbose trace. The summary becomes part of long-term memory; the original tokens leave the cache.

      Separate working memory from long-term memory. Working memory is the active cache for the current task. Long-term memory is a separate store — vector DB, structured records, or just plain text — that the agent retrieves into context only when relevant. The NPU never tries to hold the user's entire history in attention.

      This separation maps cleanly onto how humans operate: you don't hold every conversation you've ever had in active recall, you store summaries and retrieve them on demand.

      Tool Selection as a Decision, Not a Search

      A common waste pattern on NPUs is listing every available tool in every prompt. If your agent has 30 tools, that's likely 1500+ tokens of tool definitions in the cache for every single decision, when most decisions need only one or two tools.

      Better patterns:

      Pre-filter tools to the relevant subset. Use a small classifier or simple keyword matching to narrow 30 tools to 3–5 before sending to the model. The model never sees tools it shouldn't be considering.

      Hierarchical tool catalogs. Group tools into categories. The model first picks a category (with brief descriptions of ~5 categories), then sees the tools in that category. Two cheap decisions instead of one expensive one.

      Implicit defaults. If a tool is overwhelmingly the right choice for a category of input, route to it deterministically rather than asking the model. "Calculate" → calculator; "What time is it in Tokyo?" → time tool. Save the model's attention for ambiguous cases.

      These patterns aren't sophisticated, but they're surprisingly absent from many agent implementations because they require deliberate engineering rather than relying on the model. On an NPU, they're the difference between a snappy assistant and a slow one.

      A Worked Example: Reasoning Budget for a Voice Assistant

      To make this concrete, here's a budget for a hypothetical NPU voice assistant targeting <2 second response time:anchors:

      Component BudgetValue
      Source ASRTTFT, (speech~128-token to text)prompt 3001.09 mss MLPerf Client v0.6 IntentTTFT classificationextrapolated (tinyto model)512-token prompt 50~4 mss linear-ish ToolITL selectionper +decode pre-filtertoken (8B INT4) 50163 ms OpenVINO Model Hub MainDecode model64 prefill (with prefix cache)tokens 20010.4 ms Main model decode (~30 tokens)s 600 ms Tool execution (if needed) 200 ms TTS (text to speech) 400 ms Orchestration overhead 200 mscomputed TotalOne ReAct iteration 2000~14–15 mss extrapolated 5 iterations ~70–75 s extrapolated

      ThatOn the same SoC's iGPU (12.8 tok/s, ~78 ms/token): one iteration ≈ 7 s, five iterations ≈ 35 s.

      A 5-step ReAct agent at this context size on Intel NPU sits in the 60–90 second range — usable for offline summarization, marginal for chat, infeasible for interactive autocomplete. Stretching the loop to 10 steps doubles it. ReAct's behavior of growing the context monotonically with each step makes it worse over time, not better, because every iteration's prefill takes longer than the last.

      These numbers are extrapolations from published single-call benchmarks, not measurements of ReAct loops. We flagged in Chapter 1.3 that Intel and Microsoft have published almost nothing about multi-step agents on NPU. Treat the table as the right order of magnitude, not as a precise SLA.

      Why Single-Shot Wins on NPU

      The structural reasons single-shot translates to NPU and ReAct doesn't:

      Each ReAct step pays full TTFT. The prefill is the compute-bound, MAC-array-heavy phase; on NPU it's relatively fast per-prompt, but you do it N times per loop instead of once. A 5-step ReAct burns 5× the TTFT of an equivalent single-shot.

      Context grows monotonically. Step 1's prefill is short. Step 5's prefill includes everything that came before. The TTFT cost rises through the loop. Chunked prefill on NPU helps, but doesn't fix the issue: each chunk costs constant time, and step 5 has more chunks.

      Cold-cache pressure increases. The KV state from step 1 has to be valid at step 5 — which works fine within LLMPipeline.start_chat() but means the state-variable allocation must accommodate the full final context. You commit to the worst-case footprint up front.

      Greedy-only hurts most here. On NPU's static pipeline, no beam search. ReAct's "Thought" outputs are exactly the kind of free-form text that benefits from beam-4 sampling diversity. Greedy ReAct tends to fall into repetitive loops.

      The cumulative effect: ReAct on Intel NPU magnifies the very constraints that NPUs are worst at. It's the wrong architecture for the hardware.

      What to Do Instead

      Prefer single-shot. If your task can be reduced to one prompt and one response, do that. Translation is single-shot. Summarization is single-shot. Tone-rewrite is single-shot. "Explain this code" is single-shot. The cloud-agent culture's enthusiasm for ReAct has obscured how many useful tasks don't actually need a loop.

      Use plan-then-execute when you need composition. A planning call decides the structure; deterministic code runs the plan. The planning model needs to produce structured output (JSON, XML), which works fine in single-shot. The execution is fixed-cost, and any individual sub-call can hit its own device — the plan can route one sub-task to NPU, another to iGPU.

      Use the cascade pattern for triage. A tiny model on NPU decides whether the request needs the heavy model. The cheap path is sub-second; the expensive path is the budget allowsyou'd essentiallyalready zero roompay for a multi-stepsingle-shot. Worst-case latency is the heavy-model latency, not the heavy-model latency times the number of ReAct iterations.

      When you genuinely need ReAct, run it on iGPU. The 2.1× speedup from Chapter 1.3 turns 75-second NPU ReAct into 35-second iGPU ReAct. Still slow by cloud standards; in budget for offline workflows like document analysis. The NPU's role becomes drafting and triage; the iGPU does the reasoning loop.

      Voice

      Tighten assistantscontext aggressively. Every byte you can prune from the running prompt is bandwidth you don't pay for at every step. The Phi Silica architecture's N=64 sliding window over context is an aggressive version of this — most of the time you don't need everything in scope.

      Working vs Long-Term Memory

      The reasoning loop's state — what the agent remembers across steps — splits into two regimes.

      Working memory is what's in the prompt this turn. On NPU it's bounded by MAX_PROMPT_LEN. Generous on NPUschunked-prefill-capable aremodels necessarily(up single-shotto or8K plan-then-execute.validated ReActon loopsLunar addLake); tighter on encoder-decoder seq2seq like M2M-100. Working memory is fast (it's in the model's attention window) and ephemeral (it doesn't persist across sessions).

      Long-term memory lives outside the model — in a fullSQLite seconddatabase, pera vector store, a key-value cache, a local filesystem. It's persistent and unbounded in size, but accessing it costs an explicit retrieval step and breakbefore the conversationalnext rhythmprompt. usersFor expect.NPU agents, long-term memory needs to be local, which means it's a few milliseconds away and orders of magnitude cheaper than another NPU forward pass.

      The lessonpattern generalizes:that works well on NPU: aggressive working-memory pruning (small context, small TTFT), with retrieval into a local vector store between turns. The vector store is on CPU; the embedding model can be on NPU (which is exactly the kind of single-shot, batch-friendly workload NPU is great at — see Chapter 3.3 for the OpenVINO 2026.1 TextEmbeddingPipeline NPU support). The reasoning model gets short, dense context; the agent stays responsive.

      Where Intel and Microsoft Have Been Quiet

      Honest gaps to flag, because this is the section most likely to invite extrapolation:

      yourNo latencyIntel-published guidance on multi-step LLM agents on NPU. The Hugging Face × Intel Qwen3-8B Agent blog is the closest analog, and it explicitly runs on iGPU, not NPU.

      Phi Silica is documented as single-turn. Microsoft routes it through Click to Do prompt templates with no learned router and no documented multi-step loop. The Windows Developer Blog extends the Phi Silica stack to DeepSeek-R1-Distill (1.5B at ~40 tok/s, 14B at ~8 tok/s on Snapdragon X NPU) — a reasoning model on NPU — but does not describe an agent architecture around it.

      No published ReAct-loop measurements on Intel NPU exist. The 60–90 second budget dictates your loop architecture. Pickin the architecturetable above is extrapolation from thesingle-call budget,benchmarks. notIf you build a real ReAct agent on NPU, the otherdata waypoints around.you collect will be original contributions to the public record.

      The chapter's recommendation — prefer single-shot, fall back to plan-then-execute, treat ReAct as the iGPU pattern — reflects the absence of evidence for ReAct working well on NPU as much as it reflects the math. When more data appears the calculus might shift. As of May 2026 it hasn't.

      ClosingWhat ChapterThis 2Section Bought You

      You cameshould intonow this chapter with weights, operators, and TOPS. You leave it with a coherent picture of how an agent actually operates within an NPU's limits:understand:

      • ContextThree lengthreasoning translatesarchitectures: directlysingle-shot into(NPU-native), memoryplan-then-execute cost(decomposable), viaReAct the(iGPU KVpattern, cache,not often exceeding the model weights themselvesNPU)
      • CacheA reuse5-step ReAct loop costs ~70–75 seconds on NPU vs ~35 seconds on iGPU for an 8B INT4 modelwithinextrapolated, sessionsnot andmeasured
      acrossReAct themmagnifies the constraints NPUs are worst at: repeated TTFT, growing context, greedy-only sampling, accumulating KV state Single-shot tasks are more common than the cloud-agent literature suggests translation, summarization, tone-rewrite, code explanation all fit Cascade triage is the highest-leverageNPU-native latency optimization available Reasoning loops have a real per-multi-step costpattern that compoundstiny quicklymodel ondecides NPUswhether andthe forcesheavy architecturalmodel restraintneeds to run Working memory and(prompt) is bounded by MAX_PROMPT_LEN; long-term memory shouldlives bein separatedlocal stores, with theembedding-model NPUretrieval holdingbetween only what's active and retrieving the rest on demandturns ToolIntel selectionand isMicrosoft ahave decisionpublished problemalmost innothing itson ownmulti-step rightNPU agents, not somethingbe tohonest delegateabout tothe agap modelwhen staringdesigning atfor 30 options at onceproduction

      Chapter 2 ends here. The reader now has a working mental model of the constraints, the state, and the decision-making patterns. Chapter 3 turns to the other side of that last point:tools: how to design the tools themselves, where they should run, and how to integrate them efficiently withdoes an NPU-bound reasoningagent core.reach the world, what tool designs survive the latency budget, and where does the cloud fit?


      Previous: 2.2 KV Cache Engineering: Reuse, Eviction, and Prefix SharingEngineering Next: Chapter 3: Tool Use & Integration Patterns