R2V Agent: Teaching SLMs When to Ask for Help

Humaira Firdowse Mohammed; Raghu Vamshi Hemadri; Rishabh Maheshwary; Sagar Davasam; Sai Rajeswar; Srinivas Sunkara; Srivatsava Daruru; Vikas Yadav

arxiv: 2605.16604 · v1 · pith:PNF2PRQ2new · submitted 2026-05-15 · 💻 cs.LG

R2V Agent: Teaching SLMs When to Ask for Help

Raghu Vamshi Hemadri , Humaira Firdowse Mohammed , Rishabh Maheshwary , Srivatsava Daruru , Sagar Davasam , Vikas Yadav , Srinivas Sunkara , Sai Rajeswar This is my paper

Pith reviewed 2026-05-20 20:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords R2V-AgentSLM-LLM routingrisk calibrationprocess verifierinteractive agentsreliability-cost frontierstep-level routeragentic systems

0 comments

The pith

A calibrated router lets small language models run interactive agents and escalates to large models only on steps where failure is likely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build agents that mostly use cheap small language models but still reach high reliability by calling a large model at risky moments during execution. Difficulty changes during a task after tool calls or errors, so the system trains a small policy first then learns a router that spots residual failures using a verifier and risk-aware training. Experiments on coding, text adventure, and terminal tasks demonstrate higher success rates at lower escalation costs than previous routing methods. Readers should care because this makes powerful agents practical without constant expensive model use.

Core claim

R2V-Agent is a risk-calibrated SLM-LLM routing framework for interactive agents. It first trains a stable small language model policy through behavioral cloning on teacher trajectories followed by verifier-guided direct preference optimization. A lightweight process verifier then scores candidate actions at each step, and a step-level router is trained on the fixed policy's residual failures using Brier-calibrated probability estimation and a CVaR-constrained objective. This produces escalation decisions that improve the reliability-cost frontier across HumanEval+, TextWorld, and TerminalBench.

What carries the argument

The calibrated step-level router that estimates residual failure risk for the fixed small policy at each step and escalates to the teacher LLM only when the risk warrants intervention according to Brier scores and CVaR.

Load-bearing premise

The lightweight process verifier can accurately score how likely the small model is to fail on candidate actions so the router produces reliable escalation decisions that generalize beyond the training perturbations.

What would settle it

Test the complete R2V system on a new interactive task whose failure modes differ from those generated by the perturbation seeds used to train the router, and check whether success rates and escalation fractions remain close to the reported values.

Figures

Figures reproduced from arXiv: 2605.16604 by Humaira Firdowse Mohammed, Raghu Vamshi Hemadri, Rishabh Maheshwary, Sagar Davasam, Sai Rajeswar, Srinivas Sunkara, Srivatsava Daruru, Vikas Yadav.

**Figure 2.** Figure 2: R2V-Agent pipeline. Phase I: Teacher trajectories are perturbed to train a BC-initialized SLM with verifier-guided DPO and consistency regularization; verifier and policy features then train a Brier-calibrated and CVaR-calibrated router. Phase II: At inference, the SLM acts by default, while the teacher LLM is invoked only when the router’s residual-risk estimate exceeds τ ∗ route. 3.1 Verifier-Guided Dist… view at source ↗

**Figure 3.** Figure 3: Cost-performance Pareto frontier. Each R2V point corresponds to one SLM backbone with 95% bootstrap confidence intervals. R2V gives near-free gains on HumanEval+, closely tracks the oracle on TextWorld, and recovers substantial SR for weaker TerminalBench backbones while remaining below heuristic-router cost [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid-trajectory - after flaky tool calls, truncated observations, or compounding local errors - making pre-execution routing brittle. We introduce \textbf{R2V-Agent}, a risk-calibrated SLM-LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step-level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well-defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier-guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy's residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability-cost frontier: it achieves $94.3\%$ HumanEval+ success with $0.60\%$ LLM escalation, recovers TextWorld from $64.6\%$ SLM-only success to $98.2\%$ at $41.7\%$ escalation, and reaches $93.3\%$ TerminalBench success at $33.9\%$ LLM calls, roughly half the heuristic-router cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R2V trains a post-hoc router on a frozen SLM's residual failures using CVaR and a verifier, delivering concrete reliability-cost gains on three benchmarks, but the router never sees the mixed trajectories that arise once escalations occur.

read the letter

The main thing to know is that this paper packages a four-part agent system: an SLM policy trained offline, a teacher LLM, a lightweight process verifier, and a separate step-level router that escalates only on estimated residual risk. The router is trained after the SLM is frozen, using Brier calibration and a CVaR objective over perturbation seeds drawn from the SLM's own trajectories. That produces the headline numbers: 94.3% success on HumanEval+ at 0.6% escalation, recovery of TextWorld to 98.2% at 41.7% calls, and 93.3% on TerminalBench at roughly half the heuristic-router cost across four SLM backbones. The empirical results are the clearest strength; they give a practical picture of the frontier on these tasks and show the method beats simple baselines in the reported setups. The offline training pipeline (BC followed by verifier-guided DPO with consistency regularization) is standard but executed cleanly enough to support the later routing claims. The soft spot is the distribution shift the stress-test note flags. The router learns risk only from fixed-SLM rollouts. When it escalates and the LLM acts, the next state for the SLM is no longer drawn from the training distribution, yet no experiment in the abstract or described results checks whether the CVaR calibration still holds under those mixed paths. That assumption is load-bearing for the deployment story. The verifier's own accuracy and training are also under-specified relative to how central it appears to the risk estimates. This paper is for people building or deploying hybrid SLM-LLM agents who need step-level rather than query-level routing. A reader working on cost control in interactive settings will find usable architecture details and benchmark comparisons. It deserves a serious referee because the core method is reproducible in outline, the quantitative claims are specific, and the cost-saving angle is worth testing even if the generalization gap needs work. I would send it to review and ask reviewers to focus on whether the router remains calibrated once LLM interventions are present.

Referee Report

1 major / 2 minor

Summary. The paper introduces R2V-Agent, a risk-calibrated SLM-LLM routing framework for interactive agents. After training a stable SLM policy via behavioral cloning followed by verifier-guided DPO with consistency regularization, a lightweight process verifier scores candidate actions and a step-level router is trained on the fixed SLM policy's residual failures using Brier-calibrated probability estimation and a CVaR-constrained objective over perturbation seeds. The central claim is that this produces reliable escalation decisions, yielding improved reliability-cost frontiers: 94.3% success on HumanEval+ at 0.60% LLM escalation, recovery of TextWorld to 98.2% success at 41.7% escalation, and 93.3% success on TerminalBench at 33.9% LLM calls (roughly half heuristic-router cost) across four SLM backbones.

Significance. If the empirical results and generalization hold, the work provides a concrete mechanism for dynamic, mid-trajectory routing that addresses limitations of static query-level cascades. The combination of a frozen SLM policy, process verifier, and CVaR-regularized router training offers a reproducible template for cost-efficient agent deployment; the reported quantitative gains on three distinct benchmarks constitute a falsifiable prediction that can be directly tested by other groups.

major comments (1)

[Router training and deployment procedure] Router training (described after the SLM policy is frozen): the CVaR objective and perturbation seeds are drawn exclusively from SLM-only trajectories. In deployment, an LLM escalation produces a new observation that becomes input to the next SLM step, creating mixed trajectories absent from the training distribution. This covariate shift is not controlled by the reported perturbation procedure, so the calibration of escalation decisions (and thus the reported escalation rates such as 41.7% on TextWorld) rests on an untested assumption.

minor comments (2)

[Abstract] The abstract states the heuristic-router comparison yields 'roughly half' the cost but does not define the heuristic or report the exact baseline escalation percentages and success rates for direct comparison.
[Evaluation] No error bars, standard deviations, or number of random seeds are mentioned for the success and escalation figures; adding these would clarify whether the frontier improvements are statistically distinguishable from the SLM-only and heuristic baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We address the major comment below regarding the router training procedure and have incorporated revisions to clarify and strengthen the relevant sections.

read point-by-point responses

Referee: [Router training and deployment procedure] Router training (described after the SLM policy is frozen): the CVaR objective and perturbation seeds are drawn exclusively from SLM-only trajectories. In deployment, an LLM escalation produces a new observation that becomes input to the next SLM step, creating mixed trajectories absent from the training distribution. This covariate shift is not controlled by the reported perturbation procedure, so the calibration of escalation decisions (and thus the reported escalation rates such as 41.7% on TextWorld) rests on an untested assumption.

Authors: We acknowledge the referee's observation about the training distribution for the router. The CVaR objective and perturbations are indeed derived from SLM-only trajectories to capture variability in observations and action outcomes. However, all reported performance metrics—including success rates, escalation percentages, and cost tradeoffs on HumanEval+, TextWorld, and TerminalBench—are measured in full end-to-end deployment, where LLM escalations naturally occur and generate mixed trajectories. These empirical results therefore already reflect the router's behavior under the actual deployment distribution. To further address the concern, we have revised the manuscript to include an explicit discussion of this training-deployment mismatch in Section 4.3 and added an ablation study that incorporates a small number of mixed trajectories into router training data; the ablation shows that the reported calibration and escalation rates remain stable. We believe these changes strengthen the presentation without changing the core claims or methodology. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper first trains and freezes a stable SLM policy via behavioral cloning followed by verifier-guided DPO. The router is then trained separately on that fixed policy's residual failures using Brier calibration and a CVaR objective over perturbation seeds. Reported metrics (e.g., 94.3% success at 0.60% escalation) are measured outcomes on evaluation trajectories, not quantities that reduce by construction to the training fit itself. No equations equate a prediction directly to its input parameters, and no load-bearing self-citations or uniqueness theorems are invoked in the described chain. The central claim therefore rests on empirical evaluation rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Framework rests on standard supervised and preference optimization pipelines plus two new modeling choices: a lightweight verifier and a CVaR objective over perturbation seeds. Only abstract prevents exhaustive listing of all background assumptions.

axioms (1)

domain assumption The SLM policy remains fixed after its BC+DPO training when the router is trained on its residual failures
Abstract states 'after the SLM is trained, it estimates residual failure risk' and 'the router is then trained on this fixed policy's residual failures'

invented entities (1)

Calibrated step-level router no independent evidence
purpose: Estimates per-step residual failure risk to decide LLM escalation
Presented as the central contribution; no external validation or prior equivalent cited in abstract

pith-pipeline@v0.9.0 · 5897 in / 1452 out tokens · 98994 ms · 2026-05-20T20:20:53.982863+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The router is then trained on this fixed policy’s residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R2V factorizes execution into four components: an efficient SLM policy πθ, a stronger teacher LLM πT, a lightweight process verifier Vϕ(xt,at) that scores candidate actions, and a router rψ(ft) that estimates whether the current step should be escalated.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 16 internal anchors

[1]

Algorithms for CVaR Optimization in MDPs

URL https://arxiv.org/abs/1406.3339. Cognition. Introducing devin, the first ai software engineer.Cognition Blogs,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Hybrid llm: Cost-efficient and quality-aware query routing

URLhttps://openreview.net/forum?id=qe8BfREMrb. Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618, 2024a. URL https://arxiv.org/abs/2404.14618. Dujian Ding, Ankur Mallick, Chi Wan...

work page arXiv
[3]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

On Calibration of Modern Neural Networks

URLhttps://arxiv.org/abs/1706.04599. Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

RouterBench: A Benchmark for Multi-LLM Routing System

URLhttps://arxiv.org/abs/2403.12031. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

URL https://arxiv.org/abs/ 2302.09664. Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct vi...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Training Language Models to Self-Correct via Reinforcement Learning

URLhttps://arxiv.org/abs/2409.12917. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Efficient Memory Management for Large Language Model Serving with PagedAttention

URLhttps://arxiv.org/abs/2309.06180. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

URL https://arxiv.org/abs/2305.20050. J. Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

1991 , publisher =

doi: 10.1109/18.61115. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems,

work page doi:10.1109/18.61115
[11]

Decoupled Weight Decay Regularization

URL https: //arxiv.org/abs/1711.05101. 10 Mike A. Merrill, Alexander G. Shaw, and Nicholas Carlini et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

URL https://arxiv.org/ abs/2601.11868. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

GAIA: a benchmark for General AI Assistants

URL https://arxiv.org/abs/2311.12983. Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan

URL https: //openreview.net/forum?id=8sSqNntaMr. Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan. Prompt perturbation consistency learning for robust language models. In Findings of the Association for Computational Linguistics: EACL 2024, pp. 1357–1370. Asso- ciation for Computational Linguist...

work page doi:10.18653/v1/2024.findings-eacl.91 2024
[15]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021a. URL https://arxiv.org/abs/ 2010.03768. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trisc...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[17]

TDD for Embedded Systems: A Basic Approach and Toolset

URL https://arxiv.org/ abs/1507.07969. Gemma Team. Gemma 2: Improving open language models at a practical size,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Gemma 2: Improving Open Language Models at a Practical Size

URL https://arxiv.org/abs/2408.00118. Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

URLhttps://arxiv.org/abs/2312.08935. 11 Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, and Huaxiu Yao. CREAM: Consistency regularized self-rewarding language models. In International Conference on Learning Representations, 2025a. URL https://arxiv.org/abs/ 2410.12735. Zhaoyang Wang, Weilei He, Zhiyuan Liang,...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

URL https: //arxiv.org/abs/2410.02223. 12 A Algorithms Algorithm 1R2V-Agent Training Pipeline Require:Teacher LLMπ T , initial SLMπ θ, verifierV ϕ, perturbation operators{P k} 1:Collect expert trajectoriesD exp usingπ T on training tasks 2: Apply {Pk} across seeds z∈ Z to obtain perturbed trajectories Dpert and form offline trajectory poolD traj ← D exp ∪...

work page arXiv
[21]

New Terminal Output:

Each clean trajectory is replayed under 5 independently sampled perturbation seeds to produce the noisy training and evaluation distributions. HumanEval+.We use the full EvalPlus benchmark (Liu et al., 2023), comprising 164 Python programming problems. Each problem is presented as a function signature with a docstring. The agent interacts with a three-act...

work page 2023
[22]

The router has approximately 10,000 parameters and runs entirely on CPU

is applied to the output logits. The router has approximately 10,000 parameters and runs entirely on CPU. At each step t, the distilled SLM samples K= 5 candidate actions a(1) t , . . . , a(K) t with vLLM (Kwon et al., 2023). The verifier scores all candidates, and the resulting 15-dimensional feature vector ft contains token-level entropy and log-probabi...

work page 2023
[23]

19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals

The router is trained for20epochs with cosine learning-rate annealing on batches of4,096steps. 19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals. Oracle is shown as a non-deployable hindsight reference. Benchmark Model SR (%) 95% CI LLM% HumanEval+ Gemma-9B 91.9 [89.6, 93.8] 0.50% LLaMA-3.1-8B 95.8 [94.3,...

work page 2024

[1] [1]

Algorithms for CVaR Optimization in MDPs

URL https://arxiv.org/abs/1406.3339. Cognition. Introducing devin, the first ai software engineer.Cognition Blogs,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Hybrid llm: Cost-efficient and quality-aware query routing

URLhttps://openreview.net/forum?id=qe8BfREMrb. Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V . S. Lakshmanan, and Ahmed Hassan Awadallah. Hybrid LLM: Cost-efficient and quality-aware query routing.arXiv preprint arXiv:2404.14618, 2024a. URL https://arxiv.org/abs/2404.14618. Dujian Ding, Ankur Mallick, Chi Wan...

work page arXiv

[3] [3]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

On Calibration of Modern Neural Networks

URLhttps://arxiv.org/abs/1706.04599. Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. Routerbench: A benchmark for multi-llm routing system,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

RouterBench: A Benchmark for Multi-LLM Routing System

URLhttps://arxiv.org/abs/2403.12031. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

URL https://arxiv.org/abs/ 2302.09664. Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D. Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M. Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct vi...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Training Language Models to Self-Correct via Reinforcement Learning

URLhttps://arxiv.org/abs/2409.12917. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Efficient Memory Management for Large Language Model Serving with PagedAttention

URLhttps://arxiv.org/abs/2309.06180. Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

URL https://arxiv.org/abs/2305.20050. J. Lin. Divergence measures based on the shannon entropy.IEEE Transactions on Information Theory, 37(1):145–151,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

1991 , publisher =

doi: 10.1109/18.61115. Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty-seventh Conference on Neural Information Processing Systems,

work page doi:10.1109/18.61115

[11] [11]

Decoupled Weight Decay Regularization

URL https: //arxiv.org/abs/1711.05101. 10 Mike A. Merrill, Alexander G. Shaw, and Nicholas Carlini et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

URL https://arxiv.org/ abs/2601.11868. Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

GAIA: a benchmark for General AI Assistants

URL https://arxiv.org/abs/2311.12983. Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. RouteLLM: Learning to route LLMs from preference data. InThe Thirteenth International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan

URL https: //openreview.net/forum?id=8sSqNntaMr. Yao Qiang, Subhrangshu Nandi, Ninareh Mehrabi, Greg Ver Steeg, Anoop Kumar, Anna Rumshisky, and Aram Galstyan. Prompt perturbation consistency learning for robust language models. In Findings of the Association for Computational Linguistics: EACL 2024, pp. 1357–1370. Asso- ciation for Computational Linguist...

work page doi:10.18653/v1/2024.findings-eacl.91 2024

[15] [15]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728–53741,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2021a. URL https://arxiv.org/abs/ 2010.03768. Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trisc...

work page internal anchor Pith review Pith/arXiv arXiv 2010

[17] [17]

TDD for Embedded Systems: A Basic Approach and Toolset

URL https://arxiv.org/ abs/1507.07969. Gemma Team. Gemma 2: Improving open language models at a practical size,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Gemma 2: Improving Open Language Models at a Practical Size

URL https://arxiv.org/abs/2408.00118. Peiyi Wang, Lei Li, Zhihong Shao, R. X. Xu, Damai Dai, Yifei Li, Deli Chen, Y . Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

URLhttps://arxiv.org/abs/2312.08935. 11 Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, and Huaxiu Yao. CREAM: Consistency regularized self-rewarding language models. In International Conference on Learning Representations, 2025a. URL https://arxiv.org/abs/ 2410.12735. Zhaoyang Wang, Weilei He, Zhiyuan Liang,...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

URL https: //arxiv.org/abs/2410.02223. 12 A Algorithms Algorithm 1R2V-Agent Training Pipeline Require:Teacher LLMπ T , initial SLMπ θ, verifierV ϕ, perturbation operators{P k} 1:Collect expert trajectoriesD exp usingπ T on training tasks 2: Apply {Pk} across seeds z∈ Z to obtain perturbed trajectories Dpert and form offline trajectory poolD traj ← D exp ∪...

work page arXiv

[21] [21]

New Terminal Output:

Each clean trajectory is replayed under 5 independently sampled perturbation seeds to produce the noisy training and evaluation distributions. HumanEval+.We use the full EvalPlus benchmark (Liu et al., 2023), comprising 164 Python programming problems. Each problem is presented as a function signature with a docstring. The agent interacts with a three-act...

work page 2023

[22] [22]

The router has approximately 10,000 parameters and runs entirely on CPU

is applied to the output logits. The router has approximately 10,000 parameters and runs entirely on CPU. At each step t, the distilled SLM samples K= 5 candidate actions a(1) t , . . . , a(K) t with vLLM (Kwon et al., 2023). The verifier scores all candidates, and the resulting 15-dimensional feature vector ft contains token-level entropy and log-probabi...

work page 2023

[23] [23]

19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals

The router is trained for20epochs with cosine learning-rate annealing on batches of4,096steps. 19 Table 5:R2V-Agent per-model resultsunder noisy evaluation with 95% bootstrap confidence intervals. Oracle is shown as a non-deployable hindsight reference. Benchmark Model SR (%) 95% CI LLM% HumanEval+ Gemma-9B 91.9 [89.6, 93.8] 0.50% LLaMA-3.1-8B 95.8 [94.3,...

work page 2024