RL with Learnable Textual Feedback: A Bilevel Approach
Pith reviewed 2026-06-30 14:32 UTC · model grok-4.3
The pith
Formalizing textual feedback as a learnable output in a Stackelberg bilevel program lets a jointly trained critic improve an actor's reasoning more efficiently than fixed feedback.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Bi-NAC solves a Stackelberg bilevel program in which the critic generates natural-language feedback chosen to maximize the actor's performance improvement when that feedback is supplied in context, while the actor learns to better exploit such feedback; the resulting pair outperforms both standard RL and fixed-critic baselines on MATH-500, MBPP, and GPQA, with a 2B model exceeding a 3B GRPO baseline and a 6B model exceeding a 7B GRPO baseline.
What carries the argument
The Stackelberg bilevel program that couples the critic's generation of reward-improving feedback with the actor's policy update.
If this is right
- The 2B Bi-NAC model reaches 46.6 percent on MATH-500 while the 3B GRPO baseline reaches 41.4 percent.
- The 6B Bi-NAC model reaches 49.3 percent on GPQA while the 7B GRPO baseline reaches 43.6 percent.
- Joint optimization reduces both sample and parameter requirements relative to RL with fixed textual feedback.
- The same bilevel structure applies across math, coding, and general-knowledge reasoning benchmarks.
Where Pith is reading between the lines
- The bilevel framing could extend to other settings where one model generates context for another, such as tool-use or multi-agent planning.
- If the actor-critic coupling is the dominant factor, then scaling the critic independently may yield diminishing returns once the bilevel objective is satisfied.
- The method leaves open whether the learned feedback remains effective when transferred to a different actor architecture or task distribution.
Load-bearing premise
Feedback is useful only to the extent that the current actor policy can exploit it when the feedback is placed in context, so the correct model of their coupling is a Stackelberg bilevel program.
What would settle it
A controlled comparison in which a separately trained critic produces feedback that yields the same actor improvement as the jointly trained Bi-NAC critic on identical tasks and model sizes would falsify the necessity of the bilevel coupling.
read the original abstract
Reinforcement learning with verifiable rewards can improve LLM reasoning, but learning remains sample-inefficient when terminal rewards are sparse. This has motivated a growing line of work on RL with textual feedback, where a critic model generates natural language feedback to guide a reasoning model (the actor), augmenting scalar rewards with richer learning signals. However, existing methods typically treat feedback as fixed or auxiliary, which misses a key property: feedback should not merely be correct, but should improve the policy (actor model) when provided in context. This motivates a paradigm of learnable textual feedback for RL. Yet the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem. We formalize this coupling as a Stackelberg bilevel program and derive Bilevel Natural Language Actor-Critic (Bi-NAC), which jointly trains a critic to generate reward-improving feedback and an actor to exploit it. Across MATH-500, MBPP, and GPQA, Bi-NAC improves sample and parameter efficiency over RL and fixed-critic baselines: our 2B model outperforms the 3B GRPO baseline, achieving 46.6% versus 41.4% on MATH-500, while our 6B model surpasses the 7B GRPO baseline, achieving 49.3% versus 43.6% on GPQA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that RL with learnable textual feedback is inherently a bilevel optimization problem because feedback usefulness depends on the actor policy's ability to exploit it; it formalizes this as a Stackelberg program, derives Bi-NAC to jointly train a critic (leader) that generates reward-improving feedback and an actor (follower) that exploits it, and reports that Bi-NAC yields better sample/parameter efficiency than RL and fixed-critic baselines, with a 2B Bi-NAC model reaching 46.6% vs. 41.4% for a 3B GRPO baseline on MATH-500 and a 6B Bi-NAC model reaching 49.3% vs. 43.6% for a 7B GRPO baseline on GPQA.
Significance. If the bilevel formalization is shown to be necessary and the gains are reproducible with ablations and error bars, the work could offer a structured approach to making textual feedback adaptive in LLM reasoning RL, with notable implications for efficiency when smaller models outperform larger fixed baselines. The empirical claims are the primary strength, but their attribution to the Stackelberg structure rather than joint training in general remains unverified from the supplied abstract.
major comments (2)
- [Abstract] Abstract, paragraph 3: the assertion that 'the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem' is presented without a derivation, comparison to simultaneous gradient methods or shared-loss alternatives, or argument why non-Stackelberg joint objectives would fail to capture the same dependence; this is load-bearing for attributing the reported efficiency gains (e.g., 2B vs 3B on MATH-500) specifically to Bi-NAC rather than extra parameters or alternation.
- [Abstract] Abstract: no description of the bilevel solver, no error bars on the reported accuracies, and no ablation of the Stackelberg assumption or fixed-critic baselines are supplied, preventing verification that the central efficiency claims arise from the proposed program.
minor comments (1)
- [Abstract] Abstract: the benchmarks (MATH-500, MBPP, GPQA) and baselines (GRPO) are named but without reference to their standard definitions or exact evaluation protocols.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment point by point below, clarifying the manuscript's content and committing to revisions that strengthen the presentation of the bilevel motivation and empirical claims.
read point-by-point responses
-
Referee: [Abstract] Abstract, paragraph 3: the assertion that 'the learnability and usefulness of feedback depend on the policy's ability to learn from it, making RL with learnable feedback an inherently bilevel problem' is presented without a derivation, comparison to simultaneous gradient methods or shared-loss alternatives, or argument why non-Stackelberg joint objectives would fail to capture the same dependence; this is load-bearing for attributing the reported efficiency gains (e.g., 2B vs 3B on MATH-500) specifically to Bi-NAC rather than extra parameters or alternation.
Authors: Section 3 of the full manuscript derives the Stackelberg formulation by showing that the critic's objective is to select feedback that maximizes the actor's expected return after the actor optimizes its policy in response; this creates an explicit leader-follower dependence. We agree that the abstract would benefit from a brief reference to this structure and a comparison to simultaneous methods. We will revise the abstract and add a discussion in the introduction explaining why the hierarchical objective better captures the feedback-policy coupling than joint losses. revision: partial
-
Referee: [Abstract] Abstract: no description of the bilevel solver, no error bars on the reported accuracies, and no ablation of the Stackelberg assumption or fixed-critic baselines are supplied, preventing verification that the central efficiency claims arise from the proposed program.
Authors: The bilevel solver (alternating leader-follower updates) is specified in Section 4. Fixed-critic baselines are ablated in Section 5.2 with direct comparisons showing the benefit of learnable feedback. We will add a one-sentence description of the solver to the abstract. To further verify the Stackelberg structure, we will include an ablation against a simultaneous-training variant. Error bars from multiple seeds will be added where feasible or the single-seed protocol will be noted as a limitation. revision: yes
Circularity Check
No circularity: bilevel formalization is a modeling choice with independent empirical content
full rationale
The paper motivates treating feedback as learnable by noting its usefulness depends on the actor's ability to exploit it, then chooses to formalize the coupling as a Stackelberg bilevel program and derives Bi-NAC from that choice. This is a conceptual modeling decision rather than a mathematical derivation that reduces to its own inputs by construction. No equations appear in the abstract or supplied text that equate a 'prediction' to a fitted parameter or rename an input. The reported gains (e.g., 2B Bi-NAC at 46.6% vs 3B GRPO at 41.4% on MATH-500) are presented as experimental outcomes against external baselines, not forced statistical artifacts. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided material. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert- Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...
1901
-
[2]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In D
Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, and N. Chen, editors,Ad- vances in Neural Information Processing Systems, volume 38, pages 57654–57689. C...
2025
-
[4]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In Doina Precup and Yee Whye Teh, editors,Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 1126–
-
[5]
URLhttps://proceedings.mlr.press/v70/finn17a.html
PMLR, 06–11 Aug 2017. URLhttps://proceedings.mlr.press/v70/finn17a.html
2017
-
[6]
Bilevel programming for hyperparameter optimization and meta-learning
Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel programming for hyperparameter optimization and meta-learning. In Jennifer Dy and Andreas Krause, editors,Proceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 1568–1577. PMLR, 10–15 ...
2018
-
[7]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
-
[8]
Training language models to self-correct via reinforcement learning
Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, ShariqIqbal,ColtonBishop,RebeccaRoelofs,LeiMZhang,KayMcKinney,DishaShrivastava,Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust. Training language models to self-correct via reinforcement learning. InThe Thirteenth Internation...
2025
-
[9]
Let's verify step by step
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let's verify step by step. In B. Kim, Y. Yue, S. Chaudhuri, K. Fragkiadaki, M. Khan, and Y. Sun, ed- itors,International Conference on Learning Representations, volume 2024, pages 39578– 39601, 2024. URL https:/...
2024
-
[10]
Bome! bilevel optimization made easy: A simple first-order approach.Advances in neural information processing systems, 35:17248–17262, 2022
Bo Liu, Mao Ye, Stephen Wright, Peter Stone, and Qiang Liu. Bome! bilevel optimization made easy: A simple first-order approach.Advances in neural information processing systems, 35:17248–17262, 2022
2022
-
[11]
Towards aligning language models with textual feedback
Saüc Abadal Lloret, Shehzaad Dhuliawala, Keerthiram Murugesan, and Mrinmaya Sachan. Towards aligning language models with textual feedback. InICML 2024 Workshop on Models of Human Feedback for AI Alignment, 2024. URLhttps://openreview.net/forum?id=PurTK6zas8
2024
-
[12]
From faithfulness to correctness: Generative reward models that think critically, 2025
Qiyao Ma, Yunsheng Shi, Hongtao Tian, Chao Wang, Weiming Chang, and Ting Yao. From faithfulness to correctness: Generative reward models that think critically, 2025. URLhttps: //arxiv.org/abs/2509.25409. 15 RL with Learnable Textual Feedback
-
[13]
Self-refine: Iterative refine- ment with self-feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refine- ment with self-feedback. In A. Oh, T. Naumann, A. Globerson, K. Saenko...
2023
-
[14]
Self-critiquing models for assisting human evaluators
William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators, 2022. URLhttps://arxiv.org/ abs/2206.05802
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Training language models with language feedback at scale, 2024
Jérémy Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale, 2024. URL https://arxiv.org/abs/2303.16755
-
[16]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y K Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Reflexion: language agents with verbal reinforcement learning
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: language agents with verbal reinforcement learning. In A. Oh, T. Nau- mann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neu- ral Information Processing Systems, volume 36, pages 8634–8652. Curran Associates, Inc., 2023. URL https://proce...
2023
-
[18]
Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu
Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J. Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback, 2026. URLhttps://arxiv.org/abs/2602.02482
-
[19]
Learning intrinsic rewards as a bi-level optimization problem
Bradly C Stadie, Lunjun Zhang, and Jimmy Ba. Learning intrinsic rewards as a bi-level optimization problem. InProceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124, pages 111–120, 2020
2020
-
[20]
Behavioral cloning from observation
Faraz Torabi, Garrett Warnell, and Peter Stone. Behavioral cloning from observation. InProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 4950–4957. AAAI Press, 2018. ISBN 9780999241127
2018
-
[21]
Solving math word problems with process- and outcome-based feedback
Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback, 2022. URLhttps://arxiv.org/abs/2211.14275
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Text2grad: Reinforcement learning from natural language feedback, 2026
Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Text2grad: Reinforcement learning from natural language feedback, 2026. URLhttps://arxiv.org/abs/2505.22338
-
[23]
Reward Modeling from Natural Language Human Feedback
Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun, Yujiu Yang, and Yongbin Li. Reward modeling from natural language human feedback, 2026. URLhttps:// arxiv.org/abs/2601.07349. 16 RL with Learnable Textual Feedback
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
2022
-
[25]
Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms, 2025. URLhttps://arxiv.org/ abs/2506.14245
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. 2023. Publisher Copyright:©2023 11th International Conference on Learning Representations, ICLR 2023. All rights reserved.; 11th International Conference on Learning Representations, ICLR 2023 ; Conference...
2023
-
[27]
Improving reward models with synthetic critiques
Zihuiwen Ye, Fraser David Greenlee, Max Bartolo, Phil Blunsom, Jon Ander Campos, and Matthias Gallé. Improving reward models with synthetic critiques. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 4506–4520, Albuquerque, New Mexico, April 2025. Association for Computational...
-
[28]
Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022
Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022
2022
-
[29]
Reinforcing multi-turn reasoning in LLM agents via turn-level credit assignment
Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, Yang Katie Zhao, and Mingyi Hong. Reinforcing multi-turn reasoning in LLM agents via turn-level credit assignment. In ICML 2025 Workshop on Computer Use Agents, 2025. URLhttps://openreview.net/forum? id=h83vIG5Hre
2025
-
[30]
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback, 2026. URL https://arxiv.org/abs/2506.03106
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[31]
Final Answer: The final answer is $answer$. I hope it is correct
Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, and Aviral Kumar. ArCHer: Training language model agents via hierarchical multi-turn RL. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=b6rA0kAHT1. 17 RL with Learnable Textual Feedback Contents 1 Introduction 2 2 Problem Formulation 3 2.1 From Dens...
2024
-
[32]
Task-Specific Guidance (𝑧𝑙)
First response (𝑦1) (Incorrect): Here also the solver logs both sides, simplifies to8 log 2 =𝑥log 4, useslog 4 = 2 log 2, and divides to get 𝑥= 8. Task-Specific Guidance (𝑧𝑙)
-
[33]
Rewrite4as2 2 so that both sides share the same base
-
[34]
Apply the power-of-a-power rule to obtain28 = 22𝑥
-
[35]
Equate exponents (8 = 2𝑥) and solve for𝑥
-
[36]
28 RL with Learnable Textual Feedback MATH Example 3 Prompt:Let 𝑓(𝑛) = {︂⌊𝑛⌋if𝑛≥4, ⌈𝑛⌉if𝑛 <4
Improved response (𝑦2) (Correct): By rewriting2 8 = (22)𝑥 to get2 8 = 22𝑥, equating exponents gives8 = 2𝑥, hence𝑥= 4. 28 RL with Learnable Textual Feedback MATH Example 3 Prompt:Let 𝑓(𝑛) = {︂⌊𝑛⌋if𝑛≥4, ⌈𝑛⌉if𝑛 <4. }︂ . Find𝑓 (︀𝜋 3 )︀ +𝑓( √
-
[37]
Bi-NAC // Step 1: Evaluate the function f(n) for each given input For 𝜋 3, since 𝜋 3 ≥4 , 𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1
+𝑓(8 2/3). Bi-NAC // Step 1: Evaluate the function f(n) for each given input For 𝜋 3, since 𝜋 3 ≥4 , 𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1 . For √ 45, since √ 45<4 , 𝑓( √
-
[38]
For 82/3, since 82/3 ≥4,𝑓(8 2/3) =⌊8 2/3⌋= 2
=⌈ √ 45⌉= 6 . For 82/3, since 82/3 ≥4,𝑓(8 2/3) =⌊8 2/3⌋= 2. // Step 2: Add the values of f(n) for each input𝑓 (︀𝜋 3 )︀ +𝑓( √
-
[39]
+𝑓(8 2/3) = 1 + 6 + 2 = 9 The final answer is:9 . Task-Specific Guidance (𝑧𝑙) // Step 1: Evaluate the function f(n) for each given input Understand that the function definition depends on whether the input is greater than or less than 4. Use ceiling if the input is<4, and floor if it’s≥4. // Step 2: Classify the inputs Check whether each of the values —𝜋/...
-
[40]
Since 𝜋 3 ≈1.047, we have𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1
Improved response (𝑦2) (Correct): // Step 1: Evaluate𝑓 (︀𝜋 3 )︀ . Since 𝜋 3 ≈1.047, we have𝑓 (︀𝜋 3 )︀ =⌊ 𝜋 3 ⌋= 1. // Step 2: Evaluate𝑓( √ 45). Since √ 45≈6.708, we have𝑓( √
-
[41]
// Step 3: Evaluate𝑓(82/3)
=⌈ √ 45⌉= 7. // Step 3: Evaluate𝑓(82/3). Since8 2/3 = 4, we have𝑓(82/3) =⌊8 2/3⌋= 4. // Step 4: Calculate the final result We have𝑓 (︀𝜋 3 )︀ +𝑓( √
-
[42]
The final answer is:12
+𝑓(8 2/3) = 1 + 7 + 4 = 12. The final answer is:12 . 29 RL with Learnable Textual Feedback MATH Example 4 Prompt:Let𝑧= 1 +𝑖and𝑤= 3𝑧+ 1 5𝑧+ 7 . Find|𝑤|. Bi-NAC
-
[43]
We need to calculate the value of𝑧+𝑤
First response (𝑦1) (Incorrect): // Step 1: Calculate the value of𝑧+𝑤 . We need to calculate the value of𝑧+𝑤 . We know that𝑧= 1 +𝑖 and𝑤= 3𝑧+ 1 5𝑧+ 7 . Therefore, we can substitute values of𝑧and𝑤into the equation and simplify. // Step 2: Substitute values of𝑧and𝑤into𝑧+𝑤= (1 +𝑖) + 3(1 +𝑖) + 1 5(1 +𝑖) + 7 = (1 +𝑖) + 4 + 3𝑖 12 + 5𝑖 // Step 3: Simplify the exp...
-
[44]
// Step 2: Simplify the expression for w To simplify the expression for w, we need to substitute the value of z into the expression for w
Improved response (𝑦2) (Correct): // Step 1: Define the variables Let𝑧= 1 +𝑖and𝑤= 3𝑧+ 1 5𝑧+ 7 . // Step 2: Simplify the expression for w To simplify the expression for w, we need to substitute the value of z into the expression for w. // Step 3: Substitute the value of z into the expression for w𝑤=3(1 +𝑖) + 1 5(1 +𝑖) + 7 // Step 4: Simplify the numerator ...
-
[45]
The final answer is: 65 169 . 30 RL with Learnable Textual Feedback HumanEval Example 1 Prompt: from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. »> filter_by_prefix([], ’a’) [] »> filter_by_prefix([’abc’, ’bcd’, ’cde’, ’array’], ’a’...
-
[46]
Args: strings (List[str]): A list of strings to be filtered
First response (𝑦_1) (Incorrect): from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. Args: strings (List[str]): A list of strings to be filtered. prefix (str): The prefix to filter by. Returns: List[str]: A list of strings that start ...
-
[47]
"" # Check if the input is a list if not isinstance(strings, list): raise TypeError(
Improved response (𝑦_2) (Correct): from typing import List def filter_by_prefix(strings: List[str], prefix: str) -> List[str]: """ Filter an input list of strings only for ones that start with a given prefix. Args: strings (List[str]): A list of strings to be filtered. prefix (str): The prefix to filter by. Returns: List[str]: A list of strings that start...
-
[48]
Take smaller lifetime𝜏= 10−9 s,Δ𝐸≈ℏ/𝜏∼10 −9 eV
First response (𝑦1) (Incorrect): “Take smaller lifetime𝜏= 10−9 s,Δ𝐸≈ℏ/𝜏∼10 −9 eV.” Task-Specific Guidance (𝑧𝑙) Use the energy–time uncertainty relationΔ𝐸Δ𝑡∼ℏ for each state. To distinguish, the energy separation must exceed the larger of the uncertainty widths. ComputeΔ𝐸 using 𝜏= 10 −8 and 𝜏= 10 −9, then identify a feasible resolvable difference
-
[49]
For 𝜏= 10 −9 s, Δ𝐸≈6.6×10 −7 eV; for𝜏= 10 −8 s, Δ𝐸≈6.6×10 −8 eV
Improved response (𝑦2) (Correct): Δ𝐸∼ℏ/𝜏 . For 𝜏= 10 −9 s, Δ𝐸≈6.6×10 −7 eV; for𝜏= 10 −8 s, Δ𝐸≈6.6×10 −8 eV. To clearly resolve, energy difference must be larger than both→of order10−4 eV (realistic discrete choice). The final answer is:10−4 eV . GPQA Example 2 Prompt:A light beam is propagating through a glass with index of refraction𝑛. The glass is movin...
-
[50]
Take lab velocity=𝑣+ (1/𝑛). So answer(1/𝑛) +𝑣
First response (𝑦1) (Incorrect): “Take lab velocity=𝑣+ (1/𝑛). So answer(1/𝑛) +𝑣.” Task-Specific Guidance (𝑧𝑙) Apply relativistic velocity addition for velocity𝑢= 1/𝑛 inside moving medium with speed𝑣. The formula is(𝑢+𝑣)/(1 +𝑢𝑣). Simplify to get closed form with𝑐= 1
-
[51]
Using relativistic addition: 𝑢′ = 𝑢+𝑣 1 +𝑢𝑣 = 1/𝑛+𝑣 1 +𝑣/𝑛 = 1 +𝑛𝑣 𝑛+𝑣
Improved response (𝑦2) (Correct): Inside glass:𝑢= 1/𝑛. Using relativistic addition: 𝑢′ = 𝑢+𝑣 1 +𝑢𝑣 = 1/𝑛+𝑣 1 +𝑣/𝑛 = 1 +𝑛𝑣 𝑛+𝑣 . The final answer is: 1+𝑛𝑣 𝑛+𝑣 . 32 RL with Learnable Textual Feedback GPQA Example 3 Prompt:Find the kinetic energies of the product particles in𝜋+ →𝜇 + +𝜈 , where𝜋+ is stationary. Rest masses:𝑚 𝜋+ = 139.6MeV,𝑚 𝜇+ = 105.7MeV. Bi-NAC
-
[52]
Split energy equally,∼70MeV each
First response (𝑦1) (Incorrect): “Split energy equally,∼70MeV each.” Task-Specific Guidance (𝑧𝑙) Apply 2-body decay kinematics for stationary parent: Step 1: Write𝐸𝜇 = 𝑚2 𝜋+𝑚2 𝜇 2𝑚𝜋 , and 𝐸𝜈 = 𝑚2 𝜋−𝑚2 𝜇 2𝑚𝜋 . Step 2: Subtract the muon’s rest mass to get its KE. Step 3: For the massless neutrino, KE equals total energy. Step 4: Plug in values and compute
-
[53]
// Step 2: Subtract rest mass to get muon kinetic energy.𝑇𝜇 =𝐸 𝜇 −𝑚 𝜇 = 109.8−105.7≈4.1MeV
Improved response (𝑦2) (Correct): // Step 1: Compute muon total energy.𝐸𝜇 = 𝑚2 𝜋 +𝑚 2 𝜇 2𝑚𝜋 = (139.6)2 + (105.7)2 2×139.6 ≈109.8MeV. // Step 2: Subtract rest mass to get muon kinetic energy.𝑇𝜇 =𝐸 𝜇 −𝑚 𝜇 = 109.8−105.7≈4.1MeV. // Step 3: Compute neutrino energy (massless, so all KE).𝐸𝜈 = 𝑚2 𝜋 −𝑚 2 𝜇 2𝑚𝜋 = (139.6)2 −(105.7) 2 2×139.6 ≈ 29.8MeV. // Step 4: St...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.