pith. machine review for the scientific record. sign in

arxiv: 2604.09741 · v1 · submitted 2026-04-09 · 💻 cs.LG · cs.AI

Recognition: unknown

ExecTune: Effective Steering of Black-Box LLMs with Guide Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords black-box LLMsguide modelssteeringmathematical reasoningcode generationacceptance samplingreinforcement learningexecutability
0
0 comments X

The pith

Training a guide model to produce executable strategies lets cheaper black-box LLMs match or beat larger ones on math and code while lowering costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies Guide-Core Policies in which a smaller guide model generates structured strategies that a black-box core LLM then executes. It claims that overall performance is controlled by guide-averaged executability, the probability that the core can faithfully carry out the guide's output, and that prior training approaches fail to optimize this under real cost constraints. ExecTune addresses the gap with a sequence of teacher-guided acceptance sampling to ensure valid outputs, supervised fine-tuning, and structure-aware reinforcement learning that jointly targets syntactic correctness, execution success, and efficiency. The result is reported gains of up to 9.2 percent accuracy and 22.4 percent cost reduction, including cases where a smaller model surpasses a larger one. A reader would care because the method amortizes expensive reasoning steps into reusable guides, directly addressing the recurring API costs that dominate LLM deployment.

Core claim

In Guide-Core Policies a guide model produces a structured strategy that is executed by a black-box core model; end-to-end utility under a cost-sensitive objective is governed by guide-averaged executability, the probability that the core can faithfully realize the generated strategy. Existing instantiations often produce brittle strategies because they do not optimize executability under deployment constraints. ExecTune corrects this by combining teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly maximize syntactic validity, execution success, and cost efficiency, producing the stated accuracy and cost improvements across math-

What carries the argument

Guide-averaged executability: the probability that a strategy generated by the guide model can be faithfully executed by the core model, which directly determines the cost-sensitive utility of the overall policy.

If this is right

  • GCoP with ExecTune improves accuracy by up to 9.2 percent over prior baselines on mathematical reasoning and code-generation benchmarks.
  • Inference cost drops by up to 22.4 percent while accuracy holds or rises.
  • A smaller core model such as Claude Haiku 3.5 can outperform a larger Sonnet 3.5 on both math and code tasks.
  • The same setup reaches within 1.7 percent absolute accuracy of Sonnet 4 at 38 percent lower cost.
  • Only the guide needs retraining when requirements change; the core model remains untouched.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same guide-training loop could be applied to other agentic patterns such as tool-use planning or multi-step reasoning where execution reliability is the bottleneck.
  • Because the core stays frozen, organizations can maintain a single expensive core while rapidly iterating on lightweight guides for different domains or cost targets.
  • If executability optimization scales, future systems might shift compute budgets away from ever-larger core models and toward reusable strategy generators.
  • Dynamic selection among several trained guides at inference time could further tune the accuracy-cost frontier without additional core calls.

Load-bearing premise

That the combination of acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning can reliably raise the probability that the core model will execute the guide's strategies without any access to the core model's internal parameters or gradients.

What would settle it

An experiment that measures guide-averaged executability before and after ExecTune training on held-out math or code tasks and finds no statistically significant increase, or finds that the accuracy and cost gains disappear when the same guides are paired with different core models.

Figures

Figures reproduced from arXiv: 2604.09741 by Aditya Golatkar, Alessandro Achille, Anoop Deoras, Anwesan Pal, Ben Vo, Jun Huan, Narayanan Sadagopan, Stefano Soatto, Vijay Lingam.

Figure 1
Figure 1. Figure 1: Reward–cost trade-off. Test performance versus total inference cost on GSM8K (left; core=Haiku-3.5) and KodCode (right; core=Haiku-3.0). A small ExecTune guide (Qwen3-1.7B) yields the best accuracy–cost trade-off among GCOP variants, outperforming prompting/ICL and GCOP(Base/SFT/Advisor) while remaining far cheaper than frontier baselines. GCOP(ExecTune) matches or exceeds Sonnet-3.5 and approaches Sonnet-… view at source ↗
Figure 2
Figure 2. Figure 2: SFT dataset curation pipeline. A strong model (LLM-1; e.g., Claude Sonnet 4.5) extracts a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of iterative strategy refinement during acceptance sampling: additional refinement iterations increase the probability that the target core solves the problem when conditioned on the proposed strategy [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

For large language models deployed through black-box APIs, recurring inference costs often exceed one-time training costs. This motivates composed agentic systems that amortize expensive reasoning into reusable intermediate representations. We study a broad class of such systems, termed Guide-Core Policies (GCoP), in which a guide model generates a structured strategy that is executed by a black-box core model. This abstraction subsumes base, supervised, and advisor-style approaches, which differ primarily in how the guide is trained. We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core. Our analysis shows that existing GCoP instantiations often fail to optimize executability under deployment constraints, resulting in brittle strategies and inefficient computation. Motivated by these insights, we propose ExecTune, a principled training recipe that combines teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to directly optimize syntactic validity, execution success, and cost efficiency. Across mathematical reasoning and code-generation benchmarks, GCoP with ExecTune improves accuracy by up to 9.2% over prior state-of-the-art baselines while reducing inference cost by up to 22.4%. It enables Claude Haiku 3.5 to outperform Sonnet 3.5 on both math and code tasks, and to come within 1.7% absolute accuracy of Sonnet 4 at 38% lower cost. Beyond efficiency, GCoP also supports modular adaptation by updating the guide without retraining the core.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces a framework called Guide-Core Policies (GCoP) for steering black-box large language models using a guide model that generates structured strategies executed by the core model. It formalizes this under a cost-sensitive utility objective and identifies guide-averaged executability as the key determinant of end-to-end performance. The authors propose ExecTune, a training recipe combining teacher-guided acceptance sampling, supervised fine-tuning, and structure-aware reinforcement learning to optimize for syntactic validity, execution success, and cost efficiency. Experiments on mathematical reasoning and code-generation tasks show that this approach yields accuracy improvements of up to 9.2% and inference cost reductions of up to 22.4% over prior state-of-the-art, enabling smaller models to match or exceed larger ones at lower cost while supporting modular updates to the guide.

Significance. If the results are robust and the contributions of each stage are clearly delineated, this paper could be significant for the development of efficient, cost-effective agentic LLM systems. The formal analysis of GCoP provides a useful abstraction that subsumes various existing approaches and highlights why optimizing executability matters. The practical demonstration of cost savings and performance gains without access to core model internals or gradients is valuable for real-world API-based deployments. The modular adaptation aspect is a notable strength.

major comments (1)
  1. [§4 Experiments] The central claim that ExecTune's three-stage recipe (acceptance sampling + SFT + structure-aware RL) reliably optimizes guide-averaged executability under black-box constraints lacks direct supporting evidence in the form of an ablation study. The manuscript does not show whether the RL stage produces a statistically detectable lift in executability or performance metrics over the acceptance-sampling and SFT phases alone (e.g., via a table comparing variants with variance estimates or significance tests). This is load-bearing because the reported 9.2% accuracy and 22.4% cost gains are attributed to end-to-end optimization, yet RL relies on noisy Monte-Carlo estimates without core gradients; if gains are driven primarily by earlier stages, the attribution to the full GCoP analysis is undermined. (See §4 Experiments and any associated ablation subsection or Table comparing training stages
minor comments (1)
  1. [Abstract] The abstract states maximum gains of 'up to 9.2%' accuracy and 'up to 22.4%' cost reduction without identifying the specific benchmark, baseline method, or model pair that achieves each figure, which would improve reader assessment of the scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of the GCoP framework and ExecTune recipe for efficient agentic systems. We address the major comment on the need for ablation studies below and commit to revisions that strengthen the empirical support for the three-stage training process.

read point-by-point responses
  1. Referee: [§4 Experiments] The central claim that ExecTune's three-stage recipe (acceptance sampling + SFT + structure-aware RL) reliably optimizes guide-averaged executability under black-box constraints lacks direct supporting evidence in the form of an ablation study. The manuscript does not show whether the RL stage produces a statistically detectable lift in executability or performance metrics over the acceptance-sampling and SFT phases alone (e.g., via a table comparing variants with variance estimates or significance tests). This is load-bearing because the reported 9.2% accuracy and 22.4% cost gains are attributed to end-to-end optimization, yet RL relies on noisy Monte-Carlo estimates without core gradients; if gains are driven primarily by earlier stages, the attribution to the full GCoP analysis is undermined. (See §4 Experiments and any associated ablation subsection or Table)

    Authors: We agree that the absence of a dedicated ablation study isolating the incremental contribution of the structure-aware RL stage represents a gap in the current manuscript. While the full ExecTune pipeline (acceptance sampling + SFT + RL) is evaluated end-to-end against baselines, direct comparisons of intermediate training stages with variance estimates and significance testing are not provided. This limits the strength of attribution to the complete recipe under the GCoP cost-sensitive utility analysis. In the revised manuscript, we will add a new ablation subsection and table in §4 that reports accuracy, cost, and guide-averaged executability for three variants: (i) teacher-guided acceptance sampling alone, (ii) acceptance sampling followed by supervised fine-tuning, and (iii) the full pipeline including structure-aware RL. Results will include means and standard deviations across multiple random seeds, along with paired statistical significance tests (e.g., t-tests) to assess whether the RL stage yields a detectable improvement. This addition will directly address the concern about noisy Monte-Carlo estimates and clarify the role of each stage in optimizing executability without core-model gradients. revision: yes

Circularity Check

1 steps flagged

GCoP performance governance by executability reduces to definitional consequence of the introduced cost-sensitive utility objective

specific steps
  1. self definitional [Abstract]
    "We formalize GCoP under a cost-sensitive utility objective and show that end-to-end performance is governed by guide-averaged executability: the probability that a strategy generated by the guide can be faithfully executed by the core."

    The utility objective is the formalization of GCoP; executability is introduced as its central probabilistic term. The 'show that performance is governed by' statement is therefore a direct restatement of the objective's construction rather than a derived result from independent premises or external data.

full rationale

The paper's central analytical claim—that end-to-end performance is governed by guide-averaged executability—follows directly from formalizing GCoP under a utility objective whose terms explicitly incorporate executability (as the probability of faithful execution) and cost. This is a self-definitional step rather than an independent derivation. However, the subsequent ExecTune recipe (acceptance sampling + SFT + structure-aware RL), the black-box empirical benchmarks, and the reported accuracy/cost deltas are measured on external tasks and do not reduce to the same definitional move, keeping overall circularity moderate.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.0 · 5625 in / 1210 out tokens · 38513 ms · 2026-05-10T16:50:45.220910+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 24 canonical work pages · 10 internal anchors

  1. [1]

    AI Agents as Universal Task Solvers: It’s All About Time,

    Alessandro Achille and Stefano Soatto. AI agents as universal task solvers. Entropy (also arXiv:2510.12066), 2026

  2. [2]

    PromptWizard: Task-aware prompt optimization framework, 2024

    Eshaan Agarwal, Joykirat Singh, Vivek Dani, Raghav Magazine, Tanuja Ganu, and Akshay Nambi. Promptwizard: Task-aware prompt optimization framework. arXiv preprint arXiv:2405.18369, 2024

  3. [3]

    GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, et al. Gepa: Reflective prompt evolution can outperform reinforcement learning. arXiv preprint arXiv:2507.19457, 2025

  4. [4]

    Dimakis, and Joseph E

    Parth Asawa, Alan Zhu, Matei Zaharia, Alexandros G Dimakis, and Joseph E Gonzalez. How to train your advisor: Steering black-box llms with advisor models. arXiv preprint arXiv:2510.02453, 2025

  5. [5]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  6. [6]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023

  7. [7]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022

  8. [8]

    Black-box prompt optimization: Aligning large language models without model training

    Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, and Minlie Huang. Black-box prompt optimization: Aligning large language models without model training. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 3201--3219, 2024

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  10. [10]

    Effectiveness of chain-of-thought in distilling reasoning capability from large language models

    Cong Thanh Do, Rama Sanand Doddipatla, and Kate Knill. Effectiveness of chain-of-thought in distilling reasoning capability from large language models. In Proceedings of the 18th International Natural Language Generation Conference, pp.\ 833--845, 2025

  11. [11]

    Murphy: Reflective multi-turn reinforcement learning for self-correcting code generation in large language

    Chanakya Ekbote, Vijay Lingam, Behrooz Omidvar Tehrani, Jun Huan, sujay sanghavi, Anoop Deoras, and Stefano Soatto. Murphy: Reflective multi-turn reinforcement learning for self-correcting code generation in large language. In First Workshop on Foundations of Reasoning in Language Models, 2025. URL https://openreview.net/forum?id=x0Ir7cWEiA

  12. [12]

    Smith, and Yejin Choi

    Alisa Liu et al. DE xperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 6691--6706, Online, August 2021 a . Association for Computati...

  13. [13]

    Qwen3 Technical Report

    An Yang et al. Qwen3 technical report, 2025. URL https://arxiv.org/abs/2505.09388

  14. [14]

    Evaluating Large Language Models Trained on Code

    Mark Chen et al. Evaluating large language models trained on code, 2021 b . URL https://arxiv.org/abs/2107.03374

  15. [15]

    Promptbreeder: Self-referential self-improvement via prompt evolution, 2023

    Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rockt \"a schel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023

  16. [16]

    MiniLLM: On-Policy Distillation of Large Language Models

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models. arXiv preprint arXiv:2306.08543, 2023

  17. [17]

    Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes

    Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 8003--8017, 2023

  18. [18]

    Black-box behavioral distillation breaks safety alignment in medical llms.arXiv preprint arXiv:2512.09403, 2025

    Sohely Jahan and Ruimin Sun. Black-box behavioral distillation breaks safety alignment in medical llms. arXiv preprint arXiv:2512.09403, 2025

  19. [19]

    e1: Learning adaptive control of reasoning effort.arXiv preprint arXiv:2510.27042, 2025

    Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, and Stefano Soatto. E1 : Learning adaptive control of reasoning effort. NeurIPS Workshop on Efficient Reasoning (also arXiv:2510.27042), 2025

  20. [20]

    Revisiting cascaded ensembles for efficient inference,

    Steven Kolawole, Don Dennis, Ameet Talwalkar, and Virginia Smith. Agreement-based cascading for efficient inference. arXiv preprint arXiv:2407.02348, 2024

  21. [21]

    Matryoshka pilot: Learning to drive black-box llms with llms

    ChangHao Li, Yuchen Zhuang, Rushi Qiang, Haotian Sun, Hanjun Dai, Chao Zhang, and Bo Dai. Matryoshka pilot: Learning to drive black-box llms with llms. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  22. [22]

    Direct preference knowledge distillation for large language models

    Yixing Li, Yuxian Gu, Li Dong, Dequan Wang, Yu Cheng, and Furu Wei. Direct preference knowledge distillation for large language models. arXiv preprint arXiv:2406.19774, 2024

  23. [23]

    Guiding large language models via directional stimulus prompting

    Zekun Li, Baolin Peng, Pengcheng He, Michel Galley, Jianfeng Gao, and Xifeng Yan. Guiding large language models via directional stimulus prompting. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.\ 62630--62656, 2023

  24. [24]

    Enhancing language model agents using diversity of thoughts

    Vijay Lingam, Behrooz Omidvar Tehrani, Sujay Sanghavi, Gaurav Gupta, Sayan Ghosh, Linbo Liu, Jun Huan, and Anoop Deoras. Enhancing language model agents using diversity of thoughts. In The 13th International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=ZsP3YbYeE9

  25. [25]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36: 0 46534--46594, 2023

  26. [26]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

  27. [27]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36: 0 68539--68551, 2023

  28. [28]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/2402.03300

  29. [29]

    Logan IV, Eric Wallace, and Sameer Singh

    Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 4222--4235, Online, November 2020. Association for Computational Lingu...

  30. [30]

    TRL: Transformer Reinforcement Learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallou \'e dec. TRL: Transformer Reinforcement Learning . https://github.com/huggingface/trl, 2020

  31. [31]

    URLhttps://openreview.net/forum?id=Pnk7vMbznK

    Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, and Radha Poovendran. K od C ode: A diverse, challenging, and verifiable synthetic dataset for coding. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 6980--7008, Vienna, Austria, July 2025. Association for Computational Linguistics. doi:10.18653/v1/2025.findings-acl.365. URL ...

  32. [32]

    Fudge: Controlled text generation with future discriminators

    Kevin Yang and Dan Klein. FUDGE : Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 3511--3535, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main.276. URL h...

  33. [33]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022

  34. [34]

    Re-forc: Adaptive reward prediction for efficient chain-of-thought reasoning

    Renos Zabounidis, Aditya Golatkar, Michael Kleinman, Alessandro Achille, Wei Xia, and Stefano Soatto. Re-forc: Adaptive reward prediction for efficient chain-of-thought reasoning. NeurIPS Workshop on Efficient Reasoning (also arXiv:2511.02130), 2025

  35. [35]

    Automatic chain of thought prompting in large language models.arXiv preprint arXiv:2210.03493, 2022

    Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022

  36. [36]

    Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

    Denny Zhou, Nathanael Sch \"a rli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022

  37. [37]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  38. [38]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  39. [39]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...