R2V Agent: Teaching SLMs When to Ask for Help
Pith reviewed 2026-05-20 20:20 UTC · model grok-4.3
The pith
A calibrated router lets small language models run interactive agents and escalates to large models only on steps where failure is likely.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
R2V-Agent is a risk-calibrated SLM-LLM routing framework for interactive agents. It first trains a stable small language model policy through behavioral cloning on teacher trajectories followed by verifier-guided direct preference optimization. A lightweight process verifier then scores candidate actions at each step, and a step-level router is trained on the fixed policy's residual failures using Brier-calibrated probability estimation and a CVaR-constrained objective. This produces escalation decisions that improve the reliability-cost frontier across HumanEval+, TextWorld, and TerminalBench.
What carries the argument
The calibrated step-level router that estimates residual failure risk for the fixed small policy at each step and escalates to the teacher LLM only when the risk warrants intervention according to Brier scores and CVaR.
Load-bearing premise
The lightweight process verifier can accurately score how likely the small model is to fail on candidate actions so the router produces reliable escalation decisions that generalize beyond the training perturbations.
What would settle it
Test the complete R2V system on a new interactive task whose failure modes differ from those generated by the perturbation seeds used to train the router, and check whether success rates and escalation fractions remain close to the reported values.
Figures
read the original abstract
Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid-trajectory - after flaky tool calls, truncated observations, or compounding local errors - making pre-execution routing brittle. We introduce \textbf{R2V-Agent}, a risk-calibrated SLM-LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step-level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well-defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier-guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy's residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability-cost frontier: it achieves $94.3\%$ HumanEval+ success with $0.60\%$ LLM escalation, recovers TextWorld from $64.6\%$ SLM-only success to $98.2\%$ at $41.7\%$ escalation, and reaches $93.3\%$ TerminalBench success at $33.9\%$ LLM calls, roughly half the heuristic-router cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces R2V-Agent, a risk-calibrated SLM-LLM routing framework for interactive agents. After training a stable SLM policy via behavioral cloning followed by verifier-guided DPO with consistency regularization, a lightweight process verifier scores candidate actions and a step-level router is trained on the fixed SLM policy's residual failures using Brier-calibrated probability estimation and a CVaR-constrained objective over perturbation seeds. The central claim is that this produces reliable escalation decisions, yielding improved reliability-cost frontiers: 94.3% success on HumanEval+ at 0.60% LLM escalation, recovery of TextWorld to 98.2% success at 41.7% escalation, and 93.3% success on TerminalBench at 33.9% LLM calls (roughly half heuristic-router cost) across four SLM backbones.
Significance. If the empirical results and generalization hold, the work provides a concrete mechanism for dynamic, mid-trajectory routing that addresses limitations of static query-level cascades. The combination of a frozen SLM policy, process verifier, and CVaR-regularized router training offers a reproducible template for cost-efficient agent deployment; the reported quantitative gains on three distinct benchmarks constitute a falsifiable prediction that can be directly tested by other groups.
major comments (1)
- [Router training and deployment procedure] Router training (described after the SLM policy is frozen): the CVaR objective and perturbation seeds are drawn exclusively from SLM-only trajectories. In deployment, an LLM escalation produces a new observation that becomes input to the next SLM step, creating mixed trajectories absent from the training distribution. This covariate shift is not controlled by the reported perturbation procedure, so the calibration of escalation decisions (and thus the reported escalation rates such as 41.7% on TextWorld) rests on an untested assumption.
minor comments (2)
- [Abstract] The abstract states the heuristic-router comparison yields 'roughly half' the cost but does not define the heuristic or report the exact baseline escalation percentages and success rates for direct comparison.
- [Evaluation] No error bars, standard deviations, or number of random seeds are mentioned for the success and escalation figures; adding these would clarify whether the frontier improvements are statistically distinguishable from the SLM-only and heuristic baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review of our manuscript. We address the major comment below regarding the router training procedure and have incorporated revisions to clarify and strengthen the relevant sections.
read point-by-point responses
-
Referee: [Router training and deployment procedure] Router training (described after the SLM policy is frozen): the CVaR objective and perturbation seeds are drawn exclusively from SLM-only trajectories. In deployment, an LLM escalation produces a new observation that becomes input to the next SLM step, creating mixed trajectories absent from the training distribution. This covariate shift is not controlled by the reported perturbation procedure, so the calibration of escalation decisions (and thus the reported escalation rates such as 41.7% on TextWorld) rests on an untested assumption.
Authors: We acknowledge the referee's observation about the training distribution for the router. The CVaR objective and perturbations are indeed derived from SLM-only trajectories to capture variability in observations and action outcomes. However, all reported performance metrics—including success rates, escalation percentages, and cost tradeoffs on HumanEval+, TextWorld, and TerminalBench—are measured in full end-to-end deployment, where LLM escalations naturally occur and generate mixed trajectories. These empirical results therefore already reflect the router's behavior under the actual deployment distribution. To further address the concern, we have revised the manuscript to include an explicit discussion of this training-deployment mismatch in Section 4.3 and added an ablation study that incorporates a small number of mixed trajectories into router training data; the ablation shows that the reported calibration and escalation rates remain stable. We believe these changes strengthen the presentation without changing the core claims or methodology. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper first trains and freezes a stable SLM policy via behavioral cloning followed by verifier-guided DPO. The router is then trained separately on that fixed policy's residual failures using Brier calibration and a CVaR objective over perturbation seeds. Reported metrics (e.g., 94.3% success at 0.60% escalation) are measured outcomes on evaluation trajectories, not quantities that reduce by construction to the training fit itself. No equations equate a prediction directly to its input parameters, and no load-bearing self-citations or uniqueness theorems are invoked in the described chain. The central claim therefore rests on empirical evaluation rather than definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The SLM policy remains fixed after its BC+DPO training when the router is trained on its residual failures
invented entities (1)
-
Calibrated step-level router
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The router is then trained on this fixed policy’s residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
R2V factorizes execution into four components: an efficient SLM policy πθ, a stronger teacher LLM πT, a lightweight process verifier Vϕ(xt,at) that scores candidate actions, and a router rψ(ft) that estimates whether the current step should be escalated.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Algorithms for CVaR Optimization in MDPs
URL https://arxiv.org/abs/1406.3339. Cognition. Introducing devin, the first ai software engineer.Cognition Blogs,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Hybrid llm: Cost-efficient and quality-aware query routing
URLhttps://openreview.net/forum?id=qe8BfREMrb. Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618, 2024a. URL https://arxiv.org/abs/2404.14618. Dujian Ding, Ankur Mallick, Chi Wan...
-
[3]
URLhttps://arxiv.org/abs/2407.21783. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
On Calibration of Modern Neural Networks
URLhttps://arxiv.org/abs/1706.04599. Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
RouterBench: A Benchmark for Multi-LLM Routing System
URLhttps://arxiv.org/abs/2403.12031. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
URL https://arxiv.org/abs/ 2302.09664. Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct vi...
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Training Language Models to Self-Correct via Reinforcement Learning
URLhttps://arxiv.org/abs/2409.12917. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Efficient Memory Management for Large Language Model Serving with PagedAttention
URLhttps://arxiv.org/abs/2309.06180. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URL https://arxiv.org/abs/2305.20050. J. Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
doi: 10.1109/18.61115. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems,
-
[11]
Decoupled Weight Decay Regularization
URL https: //arxiv.org/abs/1711.05101. 10 Mike A. Merrill, Alexander G. Shaw, and Nicholas Carlini et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
URL https://arxiv.org/ abs/2601.11868. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
GAIA: a benchmark for General AI Assistants
URL https://arxiv.org/abs/2311.12983. Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
URL https: //openreview.net/forum?id=8sSqNntaMr. Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan. Prompt perturbation consistency learning for robust language models. In Findings of the Association for Computational Linguistics: EACL 2024, pp. 1357–1370. Asso- ciation for Computational Linguist...
-
[15]
URL https://arxiv.org/abs/2412.15115. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021a. URL https://arxiv.org/abs/ 2010.03768. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trisc...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[17]
TDD for Embedded Systems: A Basic Approach and Toolset
URL https://arxiv.org/ abs/1507.07969. Gemma Team. Gemma 2: Improving open language models at a practical size,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Gemma 2: Improving Open Language Models at a Practical Size
URL https://arxiv.org/abs/2408.00118. Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
URLhttps://arxiv.org/abs/2312.08935. 11 Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, and Huaxiu Yao. CREAM: Consistency regularized self-rewarding language models. In International Conference on Learning Representations, 2025a. URL https://arxiv.org/abs/ 2410.12735. Zhaoyang Wang, Weilei He, Zhiyuan Liang,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
URL https: //arxiv.org/abs/2410.02223. 12 A Algorithms Algorithm 1R2V-Agent Training Pipeline Require:Teacher LLMπ T , initial SLMπ θ, verifierV ϕ, perturbation operators{P k} 1:Collect expert trajectoriesD exp usingπ T on training tasks 2: Apply {Pk} across seeds z∈ Z to obtain perturbed trajectories Dpert and form offline trajectory poolD traj ← D exp ∪...
-
[21]
Each clean trajectory is replayed under 5 independently sampled perturbation seeds to produce the noisy training and evaluation distributions. HumanEval+.We use the full EvalPlus benchmark (Liu et al., 2023), comprising 164 Python programming problems. Each problem is presented as a function signature with a docstring. The agent interacts with a three-act...
work page 2023
-
[22]
The router has approximately 10,000 parameters and runs entirely on CPU
is applied to the output logits. The router has approximately 10,000 parameters and runs entirely on CPU. At each step t, the distilled SLM samples K= 5 candidate actions a(1) t , . . . , a(K) t with vLLM (Kwon et al., 2023). The verifier scores all candidates, and the resulting 15-dimensional feature vector ft contains token-level entropy and log-probabi...
work page 2023
-
[23]
19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals
The router is trained for20epochs with cosine learning-rate annealing on batches of4,096steps. 19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals. Oracle is shown as a non-deployable hindsight reference. Benchmark Model SR (%) 95% CI LLM% HumanEval+ Gemma-9B 91.9 [89.6, 93.8] 0.50% LLaMA-3.1-8B 95.8 [94.3,...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.