interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification
Pith reviewed 2026-05-16 07:24 UTC · model grok-4.3
The pith
Interwhen monitors reasoning traces in real time and steers models by verifying intermediate states against synthesized policy rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Interwhen is a plug-and-play system that periodically polls the reasoning trace, forks the model to obtain verifiable intermediate states, runs code-based verifiers asynchronously, and corrects the trajectory only when a violation occurs. Verifiers are synthesized directly from policy documents, including provably correct ones in Lean and z3. On policy-constrained reasoning benchmarks this yields near-perfect accuracy with a fraction of baseline tokens; on agentic benchmarks it raises task completion for small models, such as lifting Qwen3-30B from 32 percent to 87 percent on the telecom domain of tau2-bench.
What carries the argument
The monitoring system that polls the trace and forks inference to recover intermediate states, paired with automatic synthesis of verifiers from natural-language policies.
If this is right
- Near-perfect accuracy on mathematical and logical reasoning benchmarks while consuming far fewer tokens than existing methods.
- Large gains in task completion for small language models on agentic benchmarks without any fine-tuning.
- Plug-and-play applicability to any reasoning model because no model changes or training are required.
- Negligible overhead on correct executions because verifiers run asynchronously and only intervene on violations.
Where Pith is reading between the lines
- The approach could support runtime policy enforcement in regulated agent deployments where post-hoc checks are too late.
- Extending verifier synthesis to bootstrap policies from few examples might reduce reliance on detailed natural-language documents.
- Applying the same polling-and-fork mechanism to non-policy tasks such as code generation safety checks is a direct next test.
Load-bearing premise
The system can reliably recover accurate intermediate states by forking and that the synthesized verifiers are both correct and complete enough to catch relevant policy violations.
What would settle it
A concrete case where the forking step produces an incorrect intermediate state or the synthesized verifier misses a clear policy violation, causing the model to continue with an erroneous trajectory.
Figures
read the original abstract
Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate reasoning traces. It addresses two key challenges. First, given a set of verifiers, obtaining verifiable states from the reasoning trace typically requires prompt engineering or external task decomposition into fixed steps. Instead, we propose a monitoring system that periodically polls the reasoning trace and forks inference of the reasoning model to recover intermediate states. Verifiers are run asynchronously alongside generation, adding negligible overhead on correct executions and intervening only when violations occur. Second, beyond math and code, a central challenge for process verification is the scarcity of verifiers. interwhen addresses this through automatic verifier synthesis from natural-language policy documents. Given a policy, it can generate code-based verifiers, including provably correct verifiers in Lean and z3. Together, these contributions yield a plug-and-play test-time verification system that can improve task completion and policy compliance of any reasoning agent. On reasoning benchmarks where policies encode mathematical or logical constraints, interwhen achieves near-perfect accuracy for reasoning models using a fraction of the tokens of baselines. On agentic benchmarks with policy-based verifier generation, it enables improvements in task quality for SLMs without any finetuning, e.g., task completion rate of Qwen3-30B jumps from 32% to 87% on the telecom domain in tau2-bench. Code at https://github.com/microsoft/interwhen.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces interwhen, a single-trajectory test-time verification framework for reasoning models. It features a monitoring system that polls reasoning traces and forks inference to obtain intermediate states for verification, along with automatic synthesis of code-based verifiers (including in Lean and z3) from natural-language policy documents. The framework is claimed to achieve near-perfect accuracy on reasoning benchmarks with mathematical/logical constraints using fewer tokens than baselines, and to boost task completion rates on agentic benchmarks, such as increasing Qwen3-30B's performance from 32% to 87% on the telecom domain of tau2-bench without finetuning.
Significance. If the empirical results hold, this work offers a practical, generalizable approach to improving the reliability of reasoning agents at test time without requiring model modifications or extensive finetuning. The provision of open-source code and the ability to generate provably correct verifiers for some cases are notable strengths that could facilitate adoption in safety-critical applications involving policy compliance.
major comments (3)
- Abstract: the claims of near-perfect accuracy on reasoning benchmarks and the specific 32% to 87% jump on tau2-bench telecom lack any mention of baselines, statistical tests, ablation results, or experimental details, preventing assessment of whether the gains are attributable to the framework.
- §3 (Verifier Synthesis): automatic synthesis from natural-language policies is presented as producing correct and complete verifiers, but no completeness argument, coverage metric, or soundness proof is given for the general case (beyond Lean/z3 examples); this is load-bearing because incomplete verifiers would mean the monitoring loop never intervenes on missed violations, undermining the steering claims.
- §4 (Monitoring and Forking): the mechanism for periodically polling traces and forking inference to recover intermediate states is described at a high level with no formal specification or error analysis; without evidence that it reliably extracts accurate states from arbitrary traces, the single-trajectory intervention cannot be shown to work as claimed.
minor comments (2)
- The GitHub link is provided but the manuscript does not specify which components (monitoring code, synthesis prompts, benchmark scripts) are released, reducing reproducibility.
- Figure captions for the system overview could include more detail on the asynchronous verifier execution to clarify overhead claims.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: Abstract: the claims of near-perfect accuracy on reasoning benchmarks and the specific 32% to 87% jump on tau2-bench telecom lack any mention of baselines, statistical tests, ablation results, or experimental details, preventing assessment of whether the gains are attributable to the framework.
Authors: We agree the abstract should contextualize the results. In revision we will expand it to name the primary baselines (vanilla chain-of-thought and branch-and-verify), note that gains are averaged over multiple seeds with reported standard deviations, and cross-reference the ablation studies in §5. This keeps the abstract concise while enabling immediate evaluation of the claims. revision: yes
-
Referee: §3 (Verifier Synthesis): automatic synthesis from natural-language policies is presented as producing correct and complete verifiers, but no completeness argument, coverage metric, or soundness proof is given for the general case (beyond Lean/z3 examples); this is load-bearing because incomplete verifiers would mean the monitoring loop never intervenes on missed violations, undermining the steering claims.
Authors: We acknowledge that a general soundness or completeness proof is not provided, as natural-language policies can be underspecified and synthesis depends on LLM translation. We will add a limitations subsection with empirical coverage metrics (fraction of policy clauses successfully encoded and validated via unit tests) and describe our post-synthesis verification procedure. Formal guarantees remain limited to the Lean/z3 cases; the revision will make this explicit. revision: partial
-
Referee: §4 (Monitoring and Forking): the mechanism for periodically polling traces and forking inference to recover intermediate states is described at a high level with no formal specification or error analysis; without evidence that it reliably extracts accurate states from arbitrary traces, the single-trajectory intervention cannot be shown to work as claimed.
Authors: We will insert pseudocode for the polling-and-fork algorithm and add an error-analysis subsection reporting measured state-recovery accuracy (percentage of correctly reconstructed intermediate states) across models and tasks. These empirical results, drawn from our existing logs, will quantify reliability and support the single-trajectory claim. revision: yes
Circularity Check
No significant circularity; empirical engineering framework
full rationale
The paper presents an engineering system for single-trajectory test-time verification via a polling/forking monitor and automatic code-based verifier synthesis from natural-language policies. All central claims (near-perfect accuracy on reasoning benchmarks, task-completion gains on tau2-bench) are supported solely by direct empirical measurement against external benchmarks rather than any derivation, equation, or fitted parameter that reduces to the framework's own definitions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the results; the system is described as plug-and-play and evaluated on independent task metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural language policy documents can be automatically synthesized into correct code-based verifiers, including provably correct ones in Lean and z3.
invented entities (1)
-
interwhen monitoring and synthesis system
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction, absolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose a monitoring system that periodically polls the reasoning trace and forks inference... Given a policy, it can generate code-based verifiers, including provably correct verifiers in Lean and z3.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On reasoning benchmarks where policies encode mathematical or logical constraints, interwhen achieves near-perfect accuracy... task completion rate of Qwen3-30B jumps from 32% to 87% on the telecom domain
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Early Stopping Chain-of-thoughts in Large Language Models
URLhttps://openreview.net/forum?id=v8L0pN6EOi. B. Y. Lin, R. Le Bras, K. Richardson, A. Sabharwal, R. Poovendran, P. Clark, and Y. Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, and J. Zhu, editors,Proceedings of the 42nd International Confere...
work page internal anchor Pith review arXiv 2025
-
[2]
- A set of features (e.g., color, name, pet, book genre)
Features and Domains - A fixed number of houses, indexed sequentially (e.g., House 1, House 2, ...) from left to right. - A set of features (e.g., color, name, pet, book genre). - Each feature has a finite domain of possible values
-
[6]
Use any feedback to guide your reasoning until a complete solution is reached
You may receive feedback from the user if anything is wrong. Use any feedback to guide your reasoning until a complete solution is reached
-
[7]
Do not stop responding until you’ve assigned each and every variable. # Final Answer Reporting Format ‘‘‘json { "House 1": { "feature1": "value1", "feature2": "value2", ... }, "House 2": { "feature1": "value1", "feature2": "value2", ... }, ... } ‘‘‘ Make sure to use the exact feature/value names as given in the problem description. Make sure the JSON is v...
-
[8]
- A set of features (e.g., color, name, pet, book genre)
Features and Domains 25 - A fixed number of houses, indexed sequentially (e.g., House 1, House 2, ...) from left to right. - A set of features (e.g., color, name, pet, book genre). - Each feature has a finite domain of possible values
-
[9]
- No two houses share the same value for the same feature
Constraint: - Each house has exactly one value per feature. - No two houses share the same value for the same feature
- [10]
-
[11]
Reason about the problem in text
-
[12]
After every inference, no matter how minor or tentative, immediately report the updated partial assignments. - Always output partial assignments frequently as the reasoning progresses, not only at major steps or when confident. - If an inference adds, removes, or narrows even a single possibility, report it. # House/Feature Partial Assignment Reporting Fo...
work page 2025
-
[13]
**REPETITION**: If 3 steps contain only 1 unique idea→score based on that 1 idea only
-
[14]
**WRONG OPERATORS**: ‘^‘ for XOR, Python’s ‘max()‘, etc.→unlikely (3) at best
-
[15]
**IMPERATIVE PATTERNS**: mutable state, imperative iteration in pure function bodies→ unlikely (3) at best
- [16]
-
[17]
What NEW concrete information does each step add? (or note repetition)
-
[18]
Are the Lean 4 constructs mentioned correct?
-
[19]
Does the reasoning lead toward compilable code or toward errors? A.2.6 Verina-Spec Despite multiple attempts to induce diversity in the scores provided using the value prompt, we notice that the critic collapses almost entirely to a single score, with 1–2 other scores appearing infrequently. Evaluate a Lean 4 specification-generation reasoning trajectory....
-
[20]
**ENGLISH PROSE**: "The X must be Y"→unlikely (3) unless followed by Lean syntax
-
[21]
**MADE-UP FUNCTIONS**: Anything not in Lean stdlib without defining it→unlikely (3)
-
[22]
**PYTHON SYNTAX**: ‘==‘, ‘len()‘, ‘max()‘, ‘.count()‘ without module→unlikely (3)
-
[23]
**REPETITION**: Same property phrased multiple ways→score based on unique content only
-
[24]
**NO SOUNDNESS/COMPLETENESS**: Specs without reasoning about what they accept/reject→ likely (7) max ## YOUR EVALUATION Respond with "Confidence: <level>" followed by:
-
[25]
Does the reasoning produce actual Lean Prop syntax or just English?
-
[26]
Are the Lean constructs mentioned real (List.count,∧,∀) or made-up?
-
[27]
Is there soundness/completeness reasoning?
-
[28]
Is there repetition? A.3 Meta Prompts In this section, we provide the exact meta prompts used in our ablation experiments. A.3.1 System Prompt for Maze <|im_start|>system You are a maze-solving AI. Given a maze in ASCII format, analyze it step by step. COORDINATE SYSTEM: - Rows are numbered from top (row 0) to bottom - Columns are numbered from left (col ...
work page 2041
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.