arxiv: 2604.09741 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.AI

Recognition: unknown

ExecTune: Effective Steering of Black-Box LLMs with Guide Models

Vijay Lingam , Aditya Golatkar , Anwesan Pal , Ben Vo , Narayanan Sadagopan , Alessandro Achille , Jun Huan , Anoop Deoras

show 1 more author

Stefano Soatto

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords black-box LLMsguide modelssteeringmathematical reasoningcode generationacceptance samplingreinforcement learningexecutability

0 comments

The pith

Training a guide model to produce executable strategies lets cheaper black-box LLMs match or beat larger ones on math and code while lowering costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies Guide-Core Policies in which a smaller guide model generates structured strategies that a black-box core LLM then executes. It claims that overall performance is controlled by guide-averaged executability, the probability that the core can faithfully carry out the guide's output, and that prior training approaches fail to optimize this under real cost constraints. ExecTune addresses the gap with a sequence of teacher-guided acceptance sampling to ensure valid outputs, supervised fine-tuning, and structure-aware reinforcement learning that jointly targets syntactic correctness, execution success, and efficiency. The result is reported gains of up to 9.2 percent accuracy and 22.4 percent cost reduction, including cases where a smaller model surpasses a larger one. A reader would care because the method amortizes expensive reasoning steps into reusable guides, directly addressing the recurring API costs that dominate LLM deployment.

Core claim

In Guide-Core Policies a guide model produces a structured strategy that is executed by a black-box core model; end-to-end utility under a cost-sensitive objective is governed by guide-averaged executability, the probability that the core can faithfully realize the generated strategy. Existing instantiations often produce brittle strategies because they do not optimize executability under deployment constraints. ExecTune corrects this by combining teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly maximize syntactic validity, execution success, and cost efficiency, producing the stated accuracy and cost improvements across math-

What carries the argument

Guide-averaged executability: the probability that a strategy generated by the guide model can be faithfully executed by the core model, which directly determines the cost-sensitive utility of the overall policy.

If this is right

GCoP with ExecTune improves accuracy by up to 9.2 percent over prior baselines on mathematical reasoning and code-generation benchmarks.
Inference cost drops by up to 22.4 percent while accuracy holds or rises.
A smaller core model such as Claude Haiku 3.5 can outperform a larger Sonnet 3.5 on both math and code tasks.
The same setup reaches within 1.7 percent absolute accuracy of Sonnet 4 at 38 percent lower cost.
Only the guide needs retraining when requirements change; the core model remains untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same guide-training loop could be applied to other agentic patterns such as tool-use planning or multi-step reasoning where execution reliability is the bottleneck.
Because the core stays frozen, organizations can maintain a single expensive core while rapidly iterating on lightweight guides for different domains or cost targets.
If executability optimization scales, future systems might shift compute budgets away from ever-larger core models and toward reusable strategy generators.
Dynamic selection among several trained guides at inference time could further tune the accuracy-cost frontier without additional core calls.

Load-bearing premise

That the combination of acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning can reliably raise the probability that the core model will execute the guide's strategies without any access to the core model's internal parameters or gradients.

What would settle it

An experiment that measures guide-averaged executability before and after ExecTune training on held-out math or code tasks and finds no statistically significant increase, or finds that the accuracy and cost gains disappear when the same guides are paired with different core models.

Figures

Figures reproduced from arXiv: 2604.09741 by Aditya Golatkar, Alessandro Achille, Anoop Deoras, Anwesan Pal, Ben Vo, Jun Huan, Narayanan Sadagopan, Stefano Soatto, Vijay Lingam.

**Figure 1.** Figure 1: Reward–cost trade-off. Test performance versus total inference cost on GSM8K (left; core=Haiku-3.5) and KodCode (right; core=Haiku-3.0). A small ExecTune guide (Qwen3-1.7B) yields the best accuracy–cost trade-off among GCOP variants, outperforming prompting/ICL and GCOP(Base/SFT/Advisor) while remaining far cheaper than frontier baselines. GCOP(ExecTune) matches or exceeds Sonnet-3.5 and approaches Sonnet-… view at source ↗

**Figure 2.** Figure 2: SFT dataset curation pipeline. A strong model (LLM-1; e.g., Claude Sonnet 4.5) extracts a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of iterative strategy refinement during acceptance sampling: additional refinement iterations increase the probability that the target core solves the problem when conditioned on the proposed strategy [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

For large language models deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExecTune gives a concrete three-stage recipe for raising executability in guide-core LLM setups and reports useful accuracy-cost tradeoffs, but the RL stage adds little clear value over the sampling and SFT steps.

read the letter

The paper's main contribution is ExecTune: teacher-guided acceptance sampling followed by supervised fine-tuning and then structure-aware RL, all aimed at raising the probability that a guide model's strategy executes correctly on a black-box core. They formalize Guide-Core Policies as a broad class and argue that end-to-end performance is controlled by guide-averaged executability under a cost-sensitive utility. On math and code benchmarks they report up to 9.2% higher accuracy and 22.4% lower inference cost than prior baselines, with Claude Haiku 3.5 beating Sonnet 3.5 and nearly matching Sonnet 4 at much lower cost. The modular-adaptation angle is also practical: you can swap or update the guide without touching the core model. That framing and the specific recipe are new enough to be worth noting. The work does a reasonable job identifying why existing advisor-style methods often produce brittle strategies and inefficient traces. The cost and accuracy numbers are the sort of practical signal people deploying agentic systems actually care about. The soft spot is exactly where the stress-test note points: the RL phase. Under black-box constraints the reward signals (syntactic validity, execution success, cost) are noisy and high-variance, and the abstract gives no ablation showing that the RL step produces a statistically detectable lift beyond what acceptance sampling plus SFT already deliver. Without those controls it is hard to credit the full pipeline for the reported gains. The paper also skips statistical tests, exact baseline descriptions, and derivation details in the abstract, which makes the formal claim about executability governing performance harder to verify. This is for readers who build or tune composed LLM systems and need cheaper inference on math or code tasks. A practitioner looking for a training recipe they can try on their own guide model will find something usable here. It deserves a serious referee because the problem is real and the method is concrete, even if the current evidence is incomplete and the RL contribution needs stronger support.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces a framework called Guide-Core Policies (GCoP) for steering black-box large language models using a guide model that generates structured strategies executed by the core model. It formalizes this under a cost-sensitive utility objective and identifies guide-averaged executability as the key determinant of end-to-end performance. The authors propose ExecTune, a training recipe combining teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to optimize for syntactic validity, execution success, and cost efficiency. Experiments on mathematical reasoning and code-generation tasks show that this approach yields accuracy improvements of up to 9.2% and inference cost reductions of up to 22.4% over prior state-of-the-art, enabling smaller models to match or exceed larger ones at lower cost while supporting modular updates to the guide.

Significance. If the results are robust and the contributions of each stage are clearly delineated, this paper could be significant for the development of efficient, cost-effective agentic LLM systems. The formal analysis of GCoP provides a useful abstraction that subsumes various existing approaches and highlights why optimizing executability matters. The practical demonstration of cost savings and performance gains without access to core model internals or gradients is valuable for real-world API-based deployments. The modular adaptation aspect is a notable strength.

major comments (1)

[§4 Experiments] The central claim that ExecTune's three-stage recipe (acceptance sampling + SFT + structure-aware RL) reliably optimizes guide-averaged executability under black-box constraints lacks direct supporting evidence in the form of an ablation study. The manuscript does not show whether the RL stage produces a statistically detectable lift in executability or performance metrics over the acceptance-sampling and SFT phases alone (e.g., via a table comparing variants with variance estimates or significance tests). This is load-bearing because the reported 9.2% accuracy and 22.4% cost gains are attributed to end-to-end optimization, yet RL relies on noisy Monte-Carlo estimates without core gradients; if gains are driven primarily by earlier stages, the attribution to the full GCoP analysis is undermined. (See §4 Experiments and any associated ablation subsection or Table comparing training stages

minor comments (1)

[Abstract] The abstract states maximum gains of 'up to 9.2%' accuracy and 'up to 22.4%' cost reduction without identifying the specific benchmark, baseline method, or model pair that achieves each figure, which would improve reader assessment of the scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the GCoP framework and ExecTune recipe for efficient agentic systems. We address the major comment on the need for ablation studies below and commit to revisions that strengthen the empirical support for the three-stage training process.

read point-by-point responses

Referee: [§4 Experiments] The central claim that ExecTune's three-stage recipe (acceptance sampling + SFT + structure-aware RL) reliably optimizes guide-averaged executability under black-box constraints lacks direct supporting evidence in the form of an ablation study. The manuscript does not show whether the RL stage produces a statistically detectable lift in executability or performance metrics over the acceptance-sampling and SFT phases alone (e.g., via a table comparing variants with variance estimates or significance tests). This is load-bearing because the reported 9.2% accuracy and 22.4% cost gains are attributed to end-to-end optimization, yet RL relies on noisy Monte-Carlo estimates without core gradients; if gains are driven primarily by earlier stages, the attribution to the full GCoP analysis is undermined. (See §4 Experiments and any associated ablation subsection or Table)

Authors: We agree that the absence of a dedicated ablation study isolating the incremental contribution of the structure-aware RL stage represents a gap in the current manuscript. While the full ExecTune pipeline (acceptance sampling + SFT + RL) is evaluated end-to-end against baselines, direct comparisons of intermediate training stages with variance estimates and significance testing are not provided. This limits the strength of attribution to the complete recipe under the GCoP cost-sensitive utility analysis. In the revised manuscript, we will add a new ablation subsection and table in §4 that reports accuracy, cost, and guide-averaged executability for three variants: (i) teacher-guided acceptance sampling alone, (ii) acceptance sampling followed by supervised fine-tuning, and (iii) the full pipeline including structure-aware RL. Results will include means and standard deviations across multiple random seeds, along with paired statistical significance tests (e.g., t-tests) to assess whether the RL stage yields a detectable improvement. This addition will directly address the concern about noisy Monte-Carlo estimates and clarify the role of each stage in optimizing executability without core-model gradients. revision: yes

Circularity Check

1 steps flagged

GCoP performance governance by executability reduces to definitional consequence of the introduced cost-sensitive utility objective

specific steps

self definitional [Abstract]
"We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core."

The utility objective is the formalization of GCoP; executability is introduced as its central probabilistic term. The 'show that performance is governed by' statement is therefore a direct restatement of the objective's construction rather than a derived result from independent premises or external data.

full rationale

The paper's central analytical claim—that end-to-end performance is governed by guide-averaged executability—follows directly from formalizing GCoP under a utility objective whose terms explicitly incorporate executability (as the probability of faithful execution) and cost. This is a self-definitional step rather than an independent derivation. However, the subsequent ExecTune recipe (acceptance sampling + SFT + structure-aware RL), the black-box empirical benchmarks, and the reported accuracy/cost deltas are measured on external tasks and do not reduce to the same definitional move, keeping overall circularity moderate.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.0 · 5625 in / 1210 out tokens · 38513 ms · 2026-05-10T16:50:45.220910+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 24 canonical work pages · 10 internal anchors

[1]

AI Agents as Universal Task Solvers: It’s All About Time,

Alessandro Achille and Stefano Soatto. AI agents as universal task solvers. Entropy (also arXiv:2510.12066), 2026

work page arXiv 2026
[2]

PromptWizard: Task-aware prompt optimization framework, 2024

Eshaan Agarwal, Joykirat Singh, Vivek Dani, Raghav Magazine, Tanuja Ganu, and Akshay Nambi. Promptwizard: Task-aware prompt optimization framework. arXiv preprint arXiv:2405.18369, 2024

work page arXiv 2024
[3]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXiv:2507.19457, 2025

work page internal anchor Pith review arXiv 2025
[4]

Dimakis, and Joseph E

Parth Asawa, Alan Zhu, Matei Zaharia, Alexandros G Dimakis, and Joseph E Gonzalez. How to train your advisor: Steering black-box llms with advisor models. arXiv preprint arXiv:2510.02453, 2025

work page arXiv 2025
[5]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

work page internal anchor Pith review arXiv 2023
[7]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

work page internal anchor Pith review arXiv 2022
[8]

Black-box prompt optimization: Aligning large language models without model training

Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, and Minlie Huang. Black-box prompt optimization: Aligning large language models without model training. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3201--3219, 2024

2024
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Effectiveness of chain-of-thought in distilling reasoning capability from large language models

Cong Thanh Do, Rama Sanand Doddipatla, and Kate Knill. Effectiveness of chain-of-thought in distilling reasoning capability from large language models. In Proceedings of the 18th International Natural Language Generation Conference, pp.\ 833--845, 2025

2025
[11]

Murphy: Reflective multi-turn reinforcement learning for self-correcting code generation in large language

Chanakya Ekbote, Vijay Lingam, Behrooz Omidvar Tehrani, Jun Huan, sujay sanghavi, Anoop Deoras, and Stefano Soatto. Murphy: Reflective multi-turn reinforcement learning for self-correcting code generation in large language. In First Workshop on Foundations of Reasoning in Language Models, 2025. URL https://openreview.net/forum?id=x0Ir7cWEiA

2025
[12]

Smith, and Yejin Choi

Alisa Liu et al. DE xperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 6691--6706, Online, August 2021 a . Association for Computati...

work page doi:10.18653/v1/2021.acl-long.522 2021
[13]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Evaluating Large Language Models Trained on Code

Mark Chen et al. Evaluating large language models trained on code, 2021 b . URL https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Promptbreeder: Self-referential self-improvement via prompt evolution, 2023

Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

work page arXiv 2023
[16]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023

work page internal anchor Pith review arXiv 2023
[17]

Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 8003--8017, 2023

2023
[18]

Black-box behavioral distillation breaks safety alignment in medical llms.arXiv preprint arXiv:2512.09403, 2025

Sohely Jahan and Ruimin Sun. Black-box behavioral distillation breaks safety alignment in medical llms. arXiv preprint arXiv:2512.09403, 2025

work page arXiv 2025
[19]

e1: Learning adaptive control of reasoning effort.arXiv preprint arXiv:2510.27042, 2025

Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, and Stefano Soatto. E1 : Learning adaptive control of reasoning effort. NeurIPS Workshop on Efficient Reasoning (also arXiv:2510.27042), 2025

work page arXiv 2025
[20]

Revisiting cascaded ensembles for efficient inference,

Steven Kolawole, Don Dennis, Ameet Talwalkar, and Virginia Smith. Agreement-based cascading for efficient inference. arXiv preprint arXiv:2407.02348, 2024

work page arXiv 2024
[21]

Matryoshka pilot: Learning to drive black-box llms with llms

ChangHao Li, Yuchen Zhuang, Rushi Qiang, Haotian Sun, Hanjun Dai, Chao Zhang, and Bo Dai. Matryoshka pilot: Learning to drive black-box llms with llms. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[22]

Direct preference knowledge distillation for large language models

Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, and Furu Wei. Direct preference knowledge distillation for large language models. arXiv preprint arXiv:2406.19774, 2024

work page arXiv 2024
[23]

Guiding large language models via directional stimulus prompting

Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large language models via directional stimulus prompting. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.\ 62630--62656, 2023

2023
[24]

Enhancing language model agents using diversity of thoughts

Vijay Lingam, Behrooz Omidvar Tehrani, Sujay Sanghavi, Gaurav Gupta, Sayan Ghosh, Linbo Liu, Jun Huan, and Anoop Deoras. Enhancing language model agents using diversity of thoughts. In The 13th International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=ZsP3YbYeE9

2025
[25]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023

2023
[26]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

2022
[27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 0 68539--68551, 2023

2023
[28]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Logan IV, Eric Wallace, and Sameer Singh

Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 4222--4235, Online, November 2020. Association for Computational Lingu...

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[30]

TRL: Transformer Reinforcement Learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou \'e dec. TRL: Transformer Reinforcement Learning . https://github.com/huggingface/trl, 2020

2020
[31]

URLhttps://openreview.net/forum?id=Pnk7vMbznK

Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. K od C ode: A diverse, challenging, and verifiable synthetic dataset for coding. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 6980--7008, Vienna, Austria, July 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-acl.365. URL ...

work page doi:10.18653/v1/2025.findings-acl.365 2025
[32]

Fudge: Controlled text generation with future discriminators

Kevin Yang and Dan Klein. FUDGE : Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 3511--3535, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.276. URL h...

work page doi:10.18653/v1/2021.naacl-main.276 2021
[33]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

2022
[34]

Re-forc: Adaptive reward prediction for efficient chain-of-thought reasoning

Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, and Stefano Soatto. Re-forc: Adaptive reward prediction for efficient chain-of-thought reasoning. NeurIPS Workshop on Efficient Reasoning (also arXiv:2511.02130), 2025

work page arXiv 2025
[35]

Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

work page arXiv 2022
[36]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

work page internal anchor Pith review arXiv 2022
[37]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
[38]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
[39]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...