RL with Learnable Textual Feedback: A Bilevel Approach

Amrit Singh Bedi; Sidhaarth Sredharan; Souradip Chakraborty; Utsav Singh

arxiv: 2605.24547 · v1 · pith:WLW2XT2Unew · submitted 2026-05-23 · 💻 cs.LG

RL with Learnable Textual Feedback: A Bilevel Approach

Utsav Singh , Sidhaarth Sredharan , Souradip Chakraborty , Amrit Singh Bedi This is my paper

Pith reviewed 2026-06-30 14:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords reinforcement learningtextual feedbackbilevel optimizationlarge language modelsreasoningactor-critic

0 comments

The pith

Formalizing textual feedback as a learnable output in a Stackelberg bilevel program lets a jointly trained critic improve an actor's reasoning more efficiently than fixed feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that feedback in RL for LLMs must be optimized for the specific actor policy that will use it, rather than treated as fixed or independently correct. It sets up the interaction as a bilevel optimization where the critic leads by producing feedback that maximizes the actor's subsequent improvement. Bi-NAC solves this joint problem and yields models that reach higher accuracy on reasoning benchmarks while using fewer parameters and samples than standard RL or fixed-critic baselines. The approach rests on the idea that the value of feedback is defined by how well the actor exploits it in context.

Core claim

Bi-NAC solves a Stackelberg bilevel program in which the critic generates natural-language feedback chosen to maximize the actor's performance improvement when that feedback is supplied in context, while the actor learns to better exploit such feedback; the resulting pair outperforms both standard RL and fixed-critic baselines on MATH-500, MBPP, and GPQA, with a 2B model exceeding a 3B GRPO baseline and a 6B model exceeding a 7B GRPO baseline.

What carries the argument

The Stackelberg bilevel program that couples the critic's generation of reward-improving feedback with the actor's policy update.

If this is right

The 2B Bi-NAC model reaches 46.6 percent on MATH-500 while the 3B GRPO baseline reaches 41.4 percent.
The 6B Bi-NAC model reaches 49.3 percent on GPQA while the 7B GRPO baseline reaches 43.6 percent.
Joint optimization reduces both sample and parameter requirements relative to RL with fixed textual feedback.
The same bilevel structure applies across math, coding, and general-knowledge reasoning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The bilevel framing could extend to other settings where one model generates context for another, such as tool-use or multi-agent planning.
If the actor-critic coupling is the dominant factor, then scaling the critic independently may yield diminishing returns once the bilevel objective is satisfied.
The method leaves open whether the learned feedback remains effective when transferred to a different actor architecture or task distribution.

Load-bearing premise

Feedback is useful only to the extent that the current actor policy can exploit it when the feedback is placed in context, so the correct model of their coupling is a Stackelberg bilevel program.

What would settle it

A controlled comparison in which a separately trained critic produces feedback that yields the same actor improvement as the jointly trained Bi-NAC critic on identical tasks and model sizes would falsify the necessity of the bilevel coupling.

read the original abstract

Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning remains sample-inefficient when terminal rewards are sparse. This has motivated a growing line of work on RL with textual feedback, where a critic model generates natural language feedback to guide a reasoning model (the actor), augmenting scalar rewards with richer learning signals. However, existing methods typically treat feedback as fixed or auxiliary, which misses a key property: feedback should not merely be correct, but should improve the policy (actor model) when provided in context. This motivates a paradigm of learnable textual feedback for RL. Yet the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across MATH-500, MBPP, and GPQA, Bi-NAC improves sample and parameter efficiency over RL and fixed-critic baselines: our 2B model outperforms the 3B GRPO baseline, achieving 46.6% versus 41.4% on MATH-500, while our 6B model surpasses the 7B GRPO baseline, achieving 49.3% versus 43.6% on GPQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The bilevel Stackelberg framing for learnable textual feedback is the core idea, but the paper needs to show why that structure drives the reported efficiency gains rather than joint training in general.

read the letter

The paper's main move is to treat textual feedback as something that should be optimized jointly with the actor policy because its usefulness depends on how well the policy can use it. They formalize this as a Stackelberg bilevel program and introduce Bi-NAC to train the critic and actor together.

On the results side, they show their 2B model beating a 3B GRPO baseline on MATH-500 (46.6% vs 41.4%) and their 6B model beating a 7B baseline on GPQA (49.3% vs 43.6%). That kind of parameter efficiency is the practical hook, and the numbers are presented as direct outcomes across three benchmarks.

The new part is the explicit bilevel coupling instead of treating the critic as fixed or auxiliary. Prior work on textual feedback in RL is cited as the baseline, so the joint training objective looks like a step beyond that.

The soft spot is the justification for the Stackelberg structure itself. The abstract claims the problem is inherently bilevel because feedback usefulness depends on the actor's ability to exploit it, but it does not show why other joint objectives or simple alternation would fail to capture the same dependence. Without that, or without ablations that isolate the bilevel solver, it is hard to tie the efficiency gains specifically to the formalization rather than to training both models at once.

No solver details, error bars, or ablations appear in the abstract, which leaves the empirical claims harder to evaluate.

This is for people working on RL for LLM reasoning who want to make feedback adaptive. A reader already thinking about bilevel methods or critic-actor coupling could pull useful ideas from the setup.

It has a clear proposal and concrete numbers, so it deserves a serious referee even if the motivation for the exact bilevel choice needs tightening.

Referee Report

2 major / 1 minor

Summary. The paper claims that RL with learnable textual feedback is inherently a bilevel optimization problem because feedback usefulness depends on the actor policy's ability to exploit it; it formalizes this as a Stackelberg program, derives Bi-NAC to jointly train a critic (leader) that generates reward-improving feedback and an actor (follower) that exploits it, and reports that Bi-NAC yields better sample/parameter efficiency than RL and fixed-critic baselines, with a 2B Bi-NAC model reaching 46.6% vs. 41.4% for a 3B GRPO baseline on MATH-500 and a 6B Bi-NAC model reaching 49.3% vs. 43.6% for a 7B GRPO baseline on GPQA.

Significance. If the bilevel formalization is shown to be necessary and the gains are reproducible with ablations and error bars, the work could offer a structured approach to making textual feedback adaptive in LLM reasoning RL, with notable implications for efficiency when smaller models outperform larger fixed baselines. The empirical claims are the primary strength, but their attribution to the Stackelberg structure rather than joint training in general remains unverified from the supplied abstract.

major comments (2)

[Abstract] Abstract, paragraph 3: the assertion that 'the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem' is presented without a derivation, comparison to simultaneous gradient methods or shared-loss alternatives, or argument why non-Stackelberg joint objectives would fail to capture the same dependence; this is load-bearing for attributing the reported efficiency gains (e.g., 2B vs 3B on MATH-500) specifically to Bi-NAC rather than extra parameters or alternation.
[Abstract] Abstract: no description of the bilevel solver, no error bars on the reported accuracies, and no ablation of the Stackelberg assumption or fixed-critic baselines are supplied, preventing verification that the central efficiency claims arise from the proposed program.

minor comments (1)

[Abstract] Abstract: the benchmarks (MATH-500, MBPP, GPQA) and baselines (GRPO) are named but without reference to their standard definitions or exact evaluation protocols.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment point by point below, clarifying the manuscript's content and committing to revisions that strengthen the presentation of the bilevel motivation and empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph 3: the assertion that 'the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem' is presented without a derivation, comparison to simultaneous gradient methods or shared-loss alternatives, or argument why non-Stackelberg joint objectives would fail to capture the same dependence; this is load-bearing for attributing the reported efficiency gains (e.g., 2B vs 3B on MATH-500) specifically to Bi-NAC rather than extra parameters or alternation.

Authors: Section 3 of the full manuscript derives the Stackelberg formulation by showing that the critic's objective is to select feedback that maximizes the actor's expected return after the actor optimizes its policy in response; this creates an explicit leader-follower dependence. We agree that the abstract would benefit from a brief reference to this structure and a comparison to simultaneous methods. We will revise the abstract and add a discussion in the introduction explaining why the hierarchical objective better captures the feedback-policy coupling than joint losses. revision: partial
Referee: [Abstract] Abstract: no description of the bilevel solver, no error bars on the reported accuracies, and no ablation of the Stackelberg assumption or fixed-critic baselines are supplied, preventing verification that the central efficiency claims arise from the proposed program.

Authors: The bilevel solver (alternating leader-follower updates) is specified in Section 4. Fixed-critic baselines are ablated in Section 5.2 with direct comparisons showing the benefit of learnable feedback. We will add a one-sentence description of the solver to the abstract. To further verify the Stackelberg structure, we will include an ablation against a simultaneous-training variant. Error bars from multiple seeds will be added where feasible or the single-seed protocol will be noted as a limitation. revision: yes

Circularity Check

0 steps flagged

No circularity: bilevel formalization is a modeling choice with independent empirical content

full rationale

The paper motivates treating feedback as learnable by noting its usefulness depends on the actor's ability to exploit it, then chooses to formalize the coupling as a Stackelberg bilevel program and derives Bi-NAC from that choice. This is a conceptual modeling decision rather than a mathematical derivation that reduces to its own inputs by construction. No equations appear in the abstract or supplied text that equate a 'prediction' to a fitted parameter or rename an input. The reported gains (e.g., 2B Bi-NAC at 46.6% vs 3B GRPO at 41.4% on MATH-500) are presented as experimental outcomes against external baselines, not forced statistical artifacts. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided material. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the bilevel formulation itself is treated as the modeling choice whose validity is unexamined here.

pith-pipeline@v0.9.1-grok · 5797 in / 1081 out tokens · 29904 ms · 2026-06-30T14:32:57.716409+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

1901
[2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In D

Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Ad- vances in Neural Information Processing Systems, volume 38, pages 57654–57689. C...

2025
[4]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–
[5]

URLhttps://proceedings.mlr.press/v70/finn17a.html

PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/finn17a.html

2017
[6]

Bilevel programming for hyperparameter optimization and meta-learning

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1568–1577. PMLR, 10–15 ...

2018
[7]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[8]

Training language models to self-correct via reinforcement learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, ShariqIqbal,ColtonBishop,RebeccaRoelofs,LeiMZhang,KayMcKinney,DishaShrivastava,Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning. InThe Thirteenth Internation...

2025
[9]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, ed- itors,International Conference on Learning Representations, volume 2024, pages 39578– 39601, 2024. URL https:/...

2024
[10]

Bome! bilevel optimization made easy: A simple first-order approach.Advances in neural information processing systems, 35:17248–17262, 2022

Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach.Advances in neural information processing systems, 35:17248–17262, 2022

2022
[11]

Towards aligning language models with textual feedback

Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, and Mrinmaya Sachan. Towards aligning language models with textual feedback. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024. URLhttps://openreview.net/forum?id=PurTK6zas8

2024
[12]

From faithfulness to correctness: Generative reward models that think critically, 2025

Qiyao Ma, Yunsheng Shi, Hongtao Tian, Chao Wang, Weiming Chang, and Ting Yao. From faithfulness to correctness: Generative reward models that think critically, 2025. URLhttps: //arxiv.org/abs/2509.25409. 15 RL with Learnable Textual Feedback

work page arXiv 2025
[13]

Self-refine: Iterative refine- ment with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refine- ment with self-feedback. In A. Oh, T. Naumann, A. Globerson, K. Saenko...

2023
[14]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators, 2022. URLhttps://arxiv.org/ abs/2206.05802

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Training language models with language feedback at scale, 2024

Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale, 2024. URL https://arxiv.org/abs/2303.16755

work page arXiv 2024
[16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 8634–8652. Curran Associates, Inc., 2023. URL https://proce...

2023
[18]

Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026. URLhttps://arxiv.org/abs/2602.02482

work page arXiv 2026
[19]

Learning intrinsic rewards as a bi-level optimization problem

Bradly C Stadie, Lunjun Zhang, and Jimmy Ba. Learning intrinsic rewards as a bi-level optimization problem. InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124, pages 111–120, 2020

2020
[20]

Behavioral cloning from observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 4950–4957. AAAI Press, 2018. ISBN 9780999241127

2018
[21]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URLhttps://arxiv.org/abs/2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Text2grad: Reinforcement learning from natural language feedback, 2026

Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Text2grad: Reinforcement learning from natural language feedback, 2026. URLhttps://arxiv.org/abs/2505.22338

work page arXiv 2026
[23]

Reward Modeling from Natural Language Human Feedback

Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun, Yujiu Yang, and Yongbin Li. Reward modeling from natural language human feedback, 2026. URLhttps:// arxiv.org/abs/2601.07349. 16 RL with Learnable Textual Feedback

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022
[25]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms, 2025. URLhttps://arxiv.org/ abs/2506.14245

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. 2023. Publisher Copyright:©2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved.; 11th International Conference on Learning Representations, ICLR 2023 ; Conference...

2023
[27]

Improving reward models with synthetic critiques

Zihuiwen Ye, Fraser David Greenlee, Max Bartolo, Phil Blunsom, Jon Ander Campos, and Matthias Gallé. Improving reward models with synthetic critiques. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 4506–4520, Albuquerque, New Mexico, April 2025. Association for Computational...

work page doi:10.18653/v1/2025.findings-naacl.254 2025
[28]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

2022
[29]

Reinforcing multi-turn reasoning in LLM agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in LLM agents via turn-level credit assignment. In ICML 2025 Workshop on Computer Use Agents, 2025. URLhttps://openreview.net/forum? id=h83vIG5Hre

2025
[30]

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback, 2026. URL https://arxiv.org/abs/2506.03106

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

Final Answer: The final answer is $answer$. I hope it is correct

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. ArCHer: Training language model agents via hierarchical multi-turn RL. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=b6rA0kAHT1. 17 RL with Learnable Textual Feedback Contents 1 Introduction 2 2 Problem Formulation 3 2.1 From Dens...

2024
[32]

Task-Specific Guidance (𝑧𝑙)

First response (𝑦1) (Incorrect): Here also the solver logs both sides, simplifies to8 log 2 =𝑥log 4, useslog 4 = 2 log 2, and divides to get 𝑥= 8. Task-Specific Guidance (𝑧𝑙)
[33]

Rewrite4as2 2 so that both sides share the same base
[34]

Apply the power-of-a-power rule to obtain28 = 22𝑥
[35]

Equate exponents (8 = 2𝑥) and solve for𝑥
[36]

28 RL with Learnable Textual Feedback MATH Example 3 Prompt:Let 𝑓(𝑛) = {︂⌊𝑛⌋if𝑛≥4, ⌈𝑛⌉if𝑛 <4

Improved response (𝑦2) (Correct): By rewriting2 8 = (22)𝑥 to get2 8 = 22𝑥, equating exponents gives8 = 2𝑥, hence𝑥= 4. 28 RL with Learnable Textual Feedback MATH Example 3 Prompt:Let 𝑓(𝑛) = {︂⌊𝑛⌋if𝑛≥4, ⌈𝑛⌉if𝑛 <4. }︂ . Find𝑓 (︀𝜋 3 )︀ +𝑓( √
[37]

Bi-NAC // Step 1: Evaluate the function f(n) for each given input For 𝜋 3, since 𝜋 3 ≥4 , 𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1

+𝑓(8 2/3). Bi-NAC // Step 1: Evaluate the function f(n) for each given input For 𝜋 3, since 𝜋 3 ≥4 , 𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1 . For √ 45, since √ 45<4 , 𝑓( √
[38]

For 82/3, since 82/3 ≥4,𝑓(8 2/3) =⌊8 2/3⌋= 2

=⌈ √ 45⌉= 6 . For 82/3, since 82/3 ≥4,𝑓(8 2/3) =⌊8 2/3⌋= 2. // Step 2: Add the values of f(n) for each input𝑓 (︀𝜋 3 )︀ +𝑓( √
[39]

+𝑓(8 2/3) = 1 + 6 + 2 = 9 The final answer is:9 . Task-Specific Guidance (𝑧𝑙) // Step 1: Evaluate the function f(n) for each given input Understand that the function definition depends on whether the input is greater than or less than 4. Use ceiling if the input is<4, and floor if it’s≥4. // Step 2: Classify the inputs Check whether each of the values —𝜋/...
[40]

Since 𝜋 3 ≈1.047, we have𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1

Improved response (𝑦2) (Correct): // Step 1: Evaluate𝑓 (︀𝜋 3 )︀ . Since 𝜋 3 ≈1.047, we have𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1. // Step 2: Evaluate𝑓( √ 45). Since √ 45≈6.708, we have𝑓( √
[41]

// Step 3: Evaluate𝑓(82/3)

=⌈ √ 45⌉= 7. // Step 3: Evaluate𝑓(82/3). Since8 2/3 = 4, we have𝑓(82/3) =⌊8 2/3⌋= 4. // Step 4: Calculate the final result We have𝑓 (︀𝜋 3 )︀ +𝑓( √
[42]

The final answer is:12

+𝑓(8 2/3) = 1 + 7 + 4 = 12. The final answer is:12 . 29 RL with Learnable Textual Feedback MATH Example 4 Prompt:Let𝑧= 1 +𝑖and𝑤= 3𝑧+ 1 5𝑧+ 7 . Find|𝑤|. Bi-NAC
[43]

We need to calculate the value of𝑧+𝑤

First response (𝑦1) (Incorrect): // Step 1: Calculate the value of𝑧+𝑤 . We need to calculate the value of𝑧+𝑤 . We know that𝑧= 1 +𝑖 and𝑤= 3𝑧+ 1 5𝑧+ 7 . Therefore, we can substitute values of𝑧and𝑤into the equation and simplify. // Step 2: Substitute values of𝑧and𝑤into𝑧+𝑤= (1 +𝑖) + 3(1 +𝑖) + 1 5(1 +𝑖) + 7 = (1 +𝑖) + 4 + 3𝑖 12 + 5𝑖 // Step 3: Simplify the exp...
[44]

// Step 2: Simplify the expression for w To simplify the expression for w, we need to substitute the value of z into the expression for w

Improved response (𝑦2) (Correct): // Step 1: Define the variables Let𝑧= 1 +𝑖and𝑤= 3𝑧+ 1 5𝑧+ 7 . // Step 2: Simplify the expression for w To simplify the expression for w, we need to substitute the value of z into the expression for w. // Step 3: Substitute the value of z into the expression for w𝑤=3(1 +𝑖) + 1 5(1 +𝑖) + 7 // Step 4: Simplify the numerator ...
[45]

The final answer is: 65 169 . 30 RL with Learnable Textual Feedback HumanEval Example 1 Prompt: from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. »> filter_by_prefix([], ’a’) [] »> filter_by_prefix([’abc’, ’bcd’, ’cde’, ’array’], ’a’...
[46]

Args: strings (List[str]): A list of strings to be filtered

First response (𝑦_1) (Incorrect): from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. Args: strings (List[str]): A list of strings to be filtered. prefix (str): The prefix to filter by. Returns: List[str]: A list of strings that start ...
[47]

"" # Check if the input is a list if not isinstance(strings, list): raise TypeError(

Improved response (𝑦_2) (Correct): from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. Args: strings (List[str]): A list of strings to be filtered. prefix (str): The prefix to filter by. Returns: List[str]: A list of strings that start...
[48]

Take smaller lifetime𝜏= 10−9 s,Δ𝐸≈ℏ/𝜏∼10 −9 eV

First response (𝑦1) (Incorrect): “Take smaller lifetime𝜏= 10−9 s,Δ𝐸≈ℏ/𝜏∼10 −9 eV.” Task-Specific Guidance (𝑧𝑙) Use the energy–time uncertainty relationΔ𝐸Δ𝑡∼ℏ for each state. To distinguish, the energy separation must exceed the larger of the uncertainty widths. ComputeΔ𝐸 using 𝜏= 10 −8 and 𝜏= 10 −9, then identify a feasible resolvable difference
[49]

For 𝜏= 10 −9 s, Δ𝐸≈6.6×10 −7 eV; for𝜏= 10 −8 s, Δ𝐸≈6.6×10 −8 eV

Improved response (𝑦2) (Correct): Δ𝐸∼ℏ/𝜏 . For 𝜏= 10 −9 s, Δ𝐸≈6.6×10 −7 eV; for𝜏= 10 −8 s, Δ𝐸≈6.6×10 −8 eV. To clearly resolve, energy difference must be larger than both→of order10−4 eV (realistic discrete choice). The final answer is:10−4 eV . GPQA Example 2 Prompt:A light beam is propagating through a glass with index of refraction𝑛. The glass is movin...
[50]

Take lab velocity=𝑣+ (1/𝑛). So answer(1/𝑛) +𝑣

First response (𝑦1) (Incorrect): “Take lab velocity=𝑣+ (1/𝑛). So answer(1/𝑛) +𝑣.” Task-Specific Guidance (𝑧𝑙) Apply relativistic velocity addition for velocity𝑢= 1/𝑛 inside moving medium with speed𝑣. The formula is(𝑢+𝑣)/(1 +𝑢𝑣). Simplify to get closed form with𝑐= 1
[51]

Using relativistic addition: 𝑢′ = 𝑢+𝑣 1 +𝑢𝑣 = 1/𝑛+𝑣 1 +𝑣/𝑛 = 1 +𝑛𝑣 𝑛+𝑣

Improved response (𝑦2) (Correct): Inside glass:𝑢= 1/𝑛. Using relativistic addition: 𝑢′ = 𝑢+𝑣 1 +𝑢𝑣 = 1/𝑛+𝑣 1 +𝑣/𝑛 = 1 +𝑛𝑣 𝑛+𝑣 . The final answer is: 1+𝑛𝑣 𝑛+𝑣 . 32 RL with Learnable Textual Feedback GPQA Example 3 Prompt:Find the kinetic energies of the product particles in𝜋+ →𝜇 + +𝜈 , where𝜋+ is stationary. Rest masses:𝑚 𝜋+ = 139.6MeV,𝑚 𝜇+ = 105.7MeV. Bi-NAC
[52]

Split energy equally,∼70MeV each

First response (𝑦1) (Incorrect): “Split energy equally,∼70MeV each.” Task-Specific Guidance (𝑧𝑙) Apply 2-body decay kinematics for stationary parent: Step 1: Write𝐸𝜇 = 𝑚2 𝜋+𝑚2 𝜇 2𝑚𝜋 , and 𝐸𝜈 = 𝑚2 𝜋−𝑚2 𝜇 2𝑚𝜋 . Step 2: Subtract the muon’s rest mass to get its KE. Step 3: For the massless neutrino, KE equals total energy. Step 4: Plug in values and compute
[53]

// Step 2: Subtract rest mass to get muon kinetic energy.𝑇𝜇 =𝐸 𝜇 −𝑚 𝜇 = 109.8−105.7≈4.1MeV

Improved response (𝑦2) (Correct): // Step 1: Compute muon total energy.𝐸𝜇 = 𝑚2 𝜋 +𝑚 2 𝜇 2𝑚𝜋 = (139.6)2 + (105.7)2 2×139.6 ≈109.8MeV. // Step 2: Subtract rest mass to get muon kinetic energy.𝑇𝜇 =𝐸 𝜇 −𝑚 𝜇 = 109.8−105.7≈4.1MeV. // Step 3: Compute neutrino energy (massless, so all KE).𝐸𝜈 = 𝑚2 𝜋 −𝑚 2 𝜇 2𝑚𝜋 = (139.6)2 −(105.7) 2 2×139.6 ≈ 29.8MeV. // Step 4: St...

[1] [1]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

1901

[2] [2]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In D

Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Ad- vances in Neural Information Processing Systems, volume 38, pages 57654–57689. C...

2025

[4] [4]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–

[5] [5]

URLhttps://proceedings.mlr.press/v70/finn17a.html

PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/finn17a.html

2017

[6] [6]

Bilevel programming for hyperparameter optimization and meta-learning

Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1568–1577. PMLR, 10–15 ...

2018

[7] [7]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025

[8] [8]

Training language models to self-correct via reinforcement learning

Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, ShariqIqbal,ColtonBishop,RebeccaRoelofs,LeiMZhang,KayMcKinney,DishaShrivastava,Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning. InThe Thirteenth Internation...

2025

[9] [9]

Let's verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, ed- itors,International Conference on Learning Representations, volume 2024, pages 39578– 39601, 2024. URL https:/...

2024

[10] [10]

Bome! bilevel optimization made easy: A simple first-order approach.Advances in neural information processing systems, 35:17248–17262, 2022

Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach.Advances in neural information processing systems, 35:17248–17262, 2022

2022

[11] [11]

Towards aligning language models with textual feedback

Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, and Mrinmaya Sachan. Towards aligning language models with textual feedback. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024. URLhttps://openreview.net/forum?id=PurTK6zas8

2024

[12] [12]

From faithfulness to correctness: Generative reward models that think critically, 2025

Qiyao Ma, Yunsheng Shi, Hongtao Tian, Chao Wang, Weiming Chang, and Ting Yao. From faithfulness to correctness: Generative reward models that think critically, 2025. URLhttps: //arxiv.org/abs/2509.25409. 15 RL with Learnable Textual Feedback

work page arXiv 2025

[13] [13]

Self-refine: Iterative refine- ment with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refine- ment with self-feedback. In A. Oh, T. Naumann, A. Globerson, K. Saenko...

2023

[14] [14]

Self-critiquing models for assisting human evaluators

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators, 2022. URLhttps://arxiv.org/ abs/2206.05802

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

Training language models with language feedback at scale, 2024

Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale, 2024. URL https://arxiv.org/abs/2303.16755

work page arXiv 2024

[16] [16]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Reflexion: language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 8634–8652. Curran Associates, Inc., 2023. URL https://proce...

2023

[18] [18]

Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu

Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026. URLhttps://arxiv.org/abs/2602.02482

work page arXiv 2026

[19] [19]

Learning intrinsic rewards as a bi-level optimization problem

Bradly C Stadie, Lunjun Zhang, and Jimmy Ba. Learning intrinsic rewards as a bi-level optimization problem. InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124, pages 111–120, 2020

2020

[20] [20]

Behavioral cloning from observation

Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 4950–4957. AAAI Press, 2018. ISBN 9780999241127

2018

[21] [21]

Solving math word problems with process- and outcome-based feedback

Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URLhttps://arxiv.org/abs/2211.14275

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Text2grad: Reinforcement learning from natural language feedback, 2026

Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Text2grad: Reinforcement learning from natural language feedback, 2026. URLhttps://arxiv.org/abs/2505.22338

work page arXiv 2026

[23] [23]

Reward Modeling from Natural Language Human Feedback

Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun, Yujiu Yang, and Yongbin Li. Reward modeling from natural language human feedback, 2026. URLhttps:// arxiv.org/abs/2601.07349. 16 RL with Learnable Textual Feedback

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022

[25] [25]

Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms, 2025. URLhttps://arxiv.org/ abs/2506.14245

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. 2023. Publisher Copyright:©2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved.; 11th International Conference on Learning Representations, ICLR 2023 ; Conference...

2023

[27] [27]

Improving reward models with synthetic critiques

Zihuiwen Ye, Fraser David Greenlee, Max Bartolo, Phil Blunsom, Jon Ander Campos, and Matthias Gallé. Improving reward models with synthetic critiques. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 4506–4520, Albuquerque, New Mexico, April 2025. Association for Computational...

work page doi:10.18653/v1/2025.findings-naacl.254 2025

[28] [28]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

2022

[29] [29]

Reinforcing multi-turn reasoning in LLM agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in LLM agents via turn-level credit assignment. In ICML 2025 Workshop on Computer Use Agents, 2025. URLhttps://openreview.net/forum? id=h83vIG5Hre

2025

[30] [30]

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback, 2026. URL https://arxiv.org/abs/2506.03106

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

Final Answer: The final answer is $answer$. I hope it is correct

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. ArCHer: Training language model agents via hierarchical multi-turn RL. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=b6rA0kAHT1. 17 RL with Learnable Textual Feedback Contents 1 Introduction 2 2 Problem Formulation 3 2.1 From Dens...

2024

[32] [32]

Task-Specific Guidance (𝑧𝑙)

First response (𝑦1) (Incorrect): Here also the solver logs both sides, simplifies to8 log 2 =𝑥log 4, useslog 4 = 2 log 2, and divides to get 𝑥= 8. Task-Specific Guidance (𝑧𝑙)

[33] [33]

Rewrite4as2 2 so that both sides share the same base

[34] [34]

Apply the power-of-a-power rule to obtain28 = 22𝑥

[35] [35]

Equate exponents (8 = 2𝑥) and solve for𝑥

[36] [36]

28 RL with Learnable Textual Feedback MATH Example 3 Prompt:Let 𝑓(𝑛) = {︂⌊𝑛⌋if𝑛≥4, ⌈𝑛⌉if𝑛 <4

Improved response (𝑦2) (Correct): By rewriting2 8 = (22)𝑥 to get2 8 = 22𝑥, equating exponents gives8 = 2𝑥, hence𝑥= 4. 28 RL with Learnable Textual Feedback MATH Example 3 Prompt:Let 𝑓(𝑛) = {︂⌊𝑛⌋if𝑛≥4, ⌈𝑛⌉if𝑛 <4. }︂ . Find𝑓 (︀𝜋 3 )︀ +𝑓( √

[37] [37]

Bi-NAC // Step 1: Evaluate the function f(n) for each given input For 𝜋 3, since 𝜋 3 ≥4 , 𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1

+𝑓(8 2/3). Bi-NAC // Step 1: Evaluate the function f(n) for each given input For 𝜋 3, since 𝜋 3 ≥4 , 𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1 . For √ 45, since √ 45<4 , 𝑓( √

[38] [38]

For 82/3, since 82/3 ≥4,𝑓(8 2/3) =⌊8 2/3⌋= 2

=⌈ √ 45⌉= 6 . For 82/3, since 82/3 ≥4,𝑓(8 2/3) =⌊8 2/3⌋= 2. // Step 2: Add the values of f(n) for each input𝑓 (︀𝜋 3 )︀ +𝑓( √

[39] [39]

+𝑓(8 2/3) = 1 + 6 + 2 = 9 The final answer is:9 . Task-Specific Guidance (𝑧𝑙) // Step 1: Evaluate the function f(n) for each given input Understand that the function definition depends on whether the input is greater than or less than 4. Use ceiling if the input is<4, and floor if it’s≥4. // Step 2: Classify the inputs Check whether each of the values —𝜋/...

[40] [40]

Since 𝜋 3 ≈1.047, we have𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1

Improved response (𝑦2) (Correct): // Step 1: Evaluate𝑓 (︀𝜋 3 )︀ . Since 𝜋 3 ≈1.047, we have𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1. // Step 2: Evaluate𝑓( √ 45). Since √ 45≈6.708, we have𝑓( √

[41] [41]

// Step 3: Evaluate𝑓(82/3)

=⌈ √ 45⌉= 7. // Step 3: Evaluate𝑓(82/3). Since8 2/3 = 4, we have𝑓(82/3) =⌊8 2/3⌋= 4. // Step 4: Calculate the final result We have𝑓 (︀𝜋 3 )︀ +𝑓( √

[42] [42]

The final answer is:12

+𝑓(8 2/3) = 1 + 7 + 4 = 12. The final answer is:12 . 29 RL with Learnable Textual Feedback MATH Example 4 Prompt:Let𝑧= 1 +𝑖and𝑤= 3𝑧+ 1 5𝑧+ 7 . Find|𝑤|. Bi-NAC

[43] [43]

We need to calculate the value of𝑧+𝑤

First response (𝑦1) (Incorrect): // Step 1: Calculate the value of𝑧+𝑤 . We need to calculate the value of𝑧+𝑤 . We know that𝑧= 1 +𝑖 and𝑤= 3𝑧+ 1 5𝑧+ 7 . Therefore, we can substitute values of𝑧and𝑤into the equation and simplify. // Step 2: Substitute values of𝑧and𝑤into𝑧+𝑤= (1 +𝑖) + 3(1 +𝑖) + 1 5(1 +𝑖) + 7 = (1 +𝑖) + 4 + 3𝑖 12 + 5𝑖 // Step 3: Simplify the exp...

[44] [44]

// Step 2: Simplify the expression for w To simplify the expression for w, we need to substitute the value of z into the expression for w

Improved response (𝑦2) (Correct): // Step 1: Define the variables Let𝑧= 1 +𝑖and𝑤= 3𝑧+ 1 5𝑧+ 7 . // Step 2: Simplify the expression for w To simplify the expression for w, we need to substitute the value of z into the expression for w. // Step 3: Substitute the value of z into the expression for w𝑤=3(1 +𝑖) + 1 5(1 +𝑖) + 7 // Step 4: Simplify the numerator ...

[45] [45]

The final answer is: 65 169 . 30 RL with Learnable Textual Feedback HumanEval Example 1 Prompt: from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. »> filter_by_prefix([], ’a’) [] »> filter_by_prefix([’abc’, ’bcd’, ’cde’, ’array’], ’a’...

[46] [46]

Args: strings (List[str]): A list of strings to be filtered

First response (𝑦_1) (Incorrect): from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. Args: strings (List[str]): A list of strings to be filtered. prefix (str): The prefix to filter by. Returns: List[str]: A list of strings that start ...

[47] [47]

"" # Check if the input is a list if not isinstance(strings, list): raise TypeError(

Improved response (𝑦_2) (Correct): from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. Args: strings (List[str]): A list of strings to be filtered. prefix (str): The prefix to filter by. Returns: List[str]: A list of strings that start...

[48] [48]

Take smaller lifetime𝜏= 10−9 s,Δ𝐸≈ℏ/𝜏∼10 −9 eV

First response (𝑦1) (Incorrect): “Take smaller lifetime𝜏= 10−9 s,Δ𝐸≈ℏ/𝜏∼10 −9 eV.” Task-Specific Guidance (𝑧𝑙) Use the energy–time uncertainty relationΔ𝐸Δ𝑡∼ℏ for each state. To distinguish, the energy separation must exceed the larger of the uncertainty widths. ComputeΔ𝐸 using 𝜏= 10 −8 and 𝜏= 10 −9, then identify a feasible resolvable difference

[49] [49]

For 𝜏= 10 −9 s, Δ𝐸≈6.6×10 −7 eV; for𝜏= 10 −8 s, Δ𝐸≈6.6×10 −8 eV

Improved response (𝑦2) (Correct): Δ𝐸∼ℏ/𝜏 . For 𝜏= 10 −9 s, Δ𝐸≈6.6×10 −7 eV; for𝜏= 10 −8 s, Δ𝐸≈6.6×10 −8 eV. To clearly resolve, energy difference must be larger than both→of order10−4 eV (realistic discrete choice). The final answer is:10−4 eV . GPQA Example 2 Prompt:A light beam is propagating through a glass with index of refraction𝑛. The glass is movin...

[50] [50]

Take lab velocity=𝑣+ (1/𝑛). So answer(1/𝑛) +𝑣

First response (𝑦1) (Incorrect): “Take lab velocity=𝑣+ (1/𝑛). So answer(1/𝑛) +𝑣.” Task-Specific Guidance (𝑧𝑙) Apply relativistic velocity addition for velocity𝑢= 1/𝑛 inside moving medium with speed𝑣. The formula is(𝑢+𝑣)/(1 +𝑢𝑣). Simplify to get closed form with𝑐= 1

[51] [51]

Using relativistic addition: 𝑢′ = 𝑢+𝑣 1 +𝑢𝑣 = 1/𝑛+𝑣 1 +𝑣/𝑛 = 1 +𝑛𝑣 𝑛+𝑣

Improved response (𝑦2) (Correct): Inside glass:𝑢= 1/𝑛. Using relativistic addition: 𝑢′ = 𝑢+𝑣 1 +𝑢𝑣 = 1/𝑛+𝑣 1 +𝑣/𝑛 = 1 +𝑛𝑣 𝑛+𝑣 . The final answer is: 1+𝑛𝑣 𝑛+𝑣 . 32 RL with Learnable Textual Feedback GPQA Example 3 Prompt:Find the kinetic energies of the product particles in𝜋+ →𝜇 + +𝜈 , where𝜋+ is stationary. Rest masses:𝑚 𝜋+ = 139.6MeV,𝑚 𝜇+ = 105.7MeV. Bi-NAC

[52] [52]

Split energy equally,∼70MeV each

First response (𝑦1) (Incorrect): “Split energy equally,∼70MeV each.” Task-Specific Guidance (𝑧𝑙) Apply 2-body decay kinematics for stationary parent: Step 1: Write𝐸𝜇 = 𝑚2 𝜋+𝑚2 𝜇 2𝑚𝜋 , and 𝐸𝜈 = 𝑚2 𝜋−𝑚2 𝜇 2𝑚𝜋 . Step 2: Subtract the muon’s rest mass to get its KE. Step 3: For the massless neutrino, KE equals total energy. Step 4: Plug in values and compute

[53] [53]

// Step 2: Subtract rest mass to get muon kinetic energy.𝑇𝜇 =𝐸 𝜇 −𝑚 𝜇 = 109.8−105.7≈4.1MeV

Improved response (𝑦2) (Correct): // Step 1: Compute muon total energy.𝐸𝜇 = 𝑚2 𝜋 +𝑚 2 𝜇 2𝑚𝜋 = (139.6)2 + (105.7)2 2×139.6 ≈109.8MeV. // Step 2: Subtract rest mass to get muon kinetic energy.𝑇𝜇 =𝐸 𝜇 −𝑚 𝜇 = 109.8−105.7≈4.1MeV. // Step 3: Compute neutrino energy (massless, so all KE).𝐸𝜈 = 𝑚2 𝜋 −𝑚 2 𝜇 2𝑚𝜋 = (139.6)2 −(105.7) 2 2×139.6 ≈ 29.8MeV. // Step 4: St...