arxiv: 2309.03409 · v3 · submitted 2023-09-07 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

Large Language Models as Optimizers

Chengrun Yang , Xuezhi Wang , Yifeng Lu , Hanxiao Liu , Quoc V. Le , Denny Zhou , Xinyun Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:59 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords large language modelsprompt optimizationoptimizationOPROGSM8KBig-Bench Harditerative prompting

0 comments

The pith

Large language models can optimize solutions by iteratively generating new candidates from a prompt that lists all prior attempts together with their scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Optimization by PROmpting (OPRO), a method that turns an LLM into an optimizer simply by writing the task in natural language and feeding the model a growing list of earlier solutions paired with their numeric performance values. At each step the LLM proposes fresh candidate solutions, these are evaluated by an external scorer, and the new entries are appended so the next prompt contains the entire history. The approach is demonstrated first on linear regression and the traveling salesman problem, then on the central task of automatically discovering task instructions that maximize accuracy. Across several LLMs the best prompts found by OPRO exceed the best human-written prompts by up to eight percent on GSM8K and fifty percent on Big-Bench Hard tasks.

Core claim

OPRO lets an LLM optimize an objective by describing the task in natural language and iteratively prompting the model with the history of earlier solutions and their numeric scores; each new generation is evaluated and added to the growing prompt, allowing the language model to propose successively better candidates without any gradient information.

What carries the argument

The OPRO loop in which the LLM receives a prompt containing the optimization objective plus the list of prior solutions paired with their scores, then outputs new candidate solutions that are scored and appended.

If this is right

Prompt engineering can be replaced by an automated search that requires only an evaluation function and no hand-crafted heuristics.
The same loop applies directly to other non-differentiable problems such as combinatorial optimization and hyperparameter search.
Gains appear across multiple LLM families, suggesting the method is not tied to one particular model architecture.
No task-specific gradient computation or differentiable surrogate is needed beyond the black-box scorer.
Performance scales with the quality of the underlying LLM used for proposal generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the loop generalizes, production systems could replace static prompt libraries with on-demand optimization runs that adapt to new data or metrics.
The history-of-scored-solutions format resembles population-based search methods, so OPRO could be hybridized with evolutionary or bandit algorithms that maintain explicit populations.
One could test whether the same prompting strategy improves other discrete search problems such as neural architecture search when evaluation is inexpensive.
The method raises the open question of how much the LLM is performing genuine reasoning versus surface-level pattern matching on the supplied score history.

Load-bearing premise

That an LLM, when shown a growing list of prior solutions and their numeric scores inside a prompt, will reliably generate new solutions that improve on the best previous score rather than plateau or regress.

What would settle it

Run the OPRO loop on linear regression for fifty steps and record whether the best validation loss continues to decrease, plateaus, or begins to rise after roughly twenty iterations.

read the original abstract

Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OPRO turns an LLM into a prompt optimizer by feeding it prior candidates plus scores, and the reported lifts on GSM8K and Big-Bench Hard are worth checking, but the work still needs a random-sampling control to show the loop itself matters.

read the letter

The paper's main contribution is a simple loop they call OPRO: describe the optimization goal in natural language, have the LLM generate new candidate solutions (prompts in the main experiments), score them, and append the history of (candidate, score) pairs to the next prompt. They first verify the idea on linear regression and TSP, then scale it to instruction search. The headline numbers are that the best OPRO prompts beat human-designed ones by up to 8% on GSM8K and up to 50% on certain Big-Bench Hard tasks, with code released at the GitHub link in the abstract.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Optimization by PROmpting (OPRO), a technique that employs large language models to solve optimization problems by describing the task in natural language and iteratively generating new candidate solutions based on a prompt containing previous solutions and their evaluated scores. The method is first tested on linear regression and the traveling salesman problem, then applied to prompt optimization, where it is shown to produce instructions that improve task accuracy over human-designed prompts by up to 8% on GSM8K and 50% on Big-Bench Hard tasks across various LLMs.

Significance. If the empirical results are robust, the work is significant for demonstrating a practical, gradient-free optimization framework that leverages the in-context learning abilities of LLMs. This could impact prompt engineering practices and extend to other optimization scenarios where gradients are unavailable. The public release of code supports reproducibility and further exploration.

major comments (2)

[Prompt optimization experiments] The prompt optimization experiments lack a control baseline that draws the same number of candidate instructions randomly (or via any non-iterative sampling procedure) and retains the best performer after an equivalent evaluation budget. Without this comparison, the headline gains (up to 8% on GSM8K, up to 50% on Big-Bench Hard) cannot be attributed to the iterative history mechanism rather than the sheer volume of evaluated candidates.
[Experimental results] No information is given on the number of independent runs, variance across runs, or statistical significance tests for the reported accuracy improvements. This omission leaves open the possibility that the observed margins arise from stochasticity in LLM generation or evaluation rather than systematic optimization progress.

minor comments (2)

[Abstract] The abstract states maximum gains without naming the specific LLM, task variant, or prompt length that achieves each figure; adding these details would improve immediate readability.
[Method] The method description should specify the truncation or summarization policy applied when the growing list of (solution, score) pairs approaches the model's context limit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to strengthen the empirical claims.

read point-by-point responses

Referee: [Prompt optimization experiments] The prompt optimization experiments lack a control baseline that draws the same number of candidate instructions randomly (or via any non-iterative sampling procedure) and retains the best performer after an equivalent evaluation budget. Without this comparison, the headline gains (up to 8% on GSM8K, up to 50% on Big-Bench Hard) cannot be attributed to the iterative history mechanism rather than the sheer volume of evaluated candidates.

Authors: We agree that a non-iterative random baseline with matched evaluation budget is necessary to isolate the contribution of the history-based prompting mechanism. In the revised manuscript we will add results from sampling an identical number of candidate instructions uniformly at random (without feeding prior scores back into the prompt) and retaining the single best performer. This control will be reported alongside the OPRO curves for the GSM8K and Big-Bench Hard tasks. revision: yes
Referee: [Experimental results] No information is given on the number of independent runs, variance across runs, or statistical significance tests for the reported accuracy improvements. This omission leaves open the possibility that the observed margins arise from stochasticity in LLM generation or evaluation rather than systematic optimization progress.

Authors: We acknowledge the omission. The original experiments used fixed random seeds for reproducibility, but we will rerun the key prompt-optimization experiments across multiple independent trials (minimum of three seeds per task), report mean accuracy and standard deviation, and include paired statistical significance tests (e.g., McNemar or bootstrap) between OPRO-optimized prompts and the human-designed baselines. These results and error bars will be added to the revised tables and figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are direct empirical measurements

full rationale

The paper presents OPRO as an iterative prompting procedure in which an LLM is given a growing list of prior (solution, value) pairs and asked to propose new candidates; the candidates are then evaluated on the target task and appended. The headline performance numbers (up to 8 % on GSM8K, 50 % on Big-Bench Hard) are obtained by running this loop to completion and measuring final accuracy on the standard held-out test splits of the benchmarks. No equation or derivation step equates the reported accuracy to any fitted parameter or to the initial prompt set by algebraic identity. The method contains no self-citation load-bearing uniqueness theorem, no ansatz smuggled via prior work, and no renaming of a known result as a new derivation. The evaluation is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach treats the LLM as a black-box generator whose iterative improvement behavior is observed empirically; no explicit free parameters, mathematical axioms, or new invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5494 in / 1055 out tokens · 38608 ms · 2026-05-14T23:59:25.433751+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 7.0

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 7.0

EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
cs.SE 2026-05 unverdicted novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
cs.CR 2026-04 unverdicted novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
Learning, Fast and Slow: Towards LLMs That Adapt Continually
cs.LG 2026-05 unverdicted novelty 6.0

Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
cs.AI 2026-05 unverdicted novelty 6.0

FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
cs.CL 2026-05 conditional novelty 6.0

AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
cs.AI 2026-04 unverdicted novelty 6.0

ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
cs.AI 2026-04 unverdicted novelty 6.0

Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Reflective Context Learning: Studying the Optimization Primitives of Context Space
cs.LG 2026-04 unverdicted novelty 6.0

Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...
Self-Optimizing Multi-Agent Systems for Deep Research
cs.IR 2026-04 unverdicted novelty 6.0

Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience
cs.AI 2026-05 unverdicted novelty 5.0

Iterative distillation of experience trains prompting policies that boost black-box LLM performance on reasoning and tool-use tasks from 55-74% to 90-91%.
Evolutionary Ensemble of Agents
cs.NE 2026-05 unverdicted novelty 5.0

EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
A Control Architecture for Training-Free Memory Use
cs.AI 2026-04 unverdicted novelty 5.0

A training-free control architecture with uncertainty-based routing, confidence-selective acceptance, and evidence-based memory governance improves arithmetic reasoning by +7 points on SVAMP and ASDiv benchmarks.
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis
cs.AI 2026-04 unverdicted novelty 5.0

Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
cs.CY 2026-04 unverdicted novelty 4.0

MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
cs.AI 2025-07 accept novelty 4.0

The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs
cs.CL 2026-05 unverdicted novelty 3.0

A DSPy-based per-stage prompt optimization pipeline with self-consistency achieves second place among full participants in the ArchEHR-QA 2026 EHR QA shared task.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
cs.AI 2024-02 unverdicted novelty 3.0

A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 21 Pith papers · 19 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2305.17126 , year=

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126,

work page arXiv
[4]

Evoprompting: Language models for code-level neural architecture search

Angelica Chen, David M Dohan, and David R So. Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838, 2023a. Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback...

work page arXiv
[5]

Teaching Large Language Models to Self-Debug

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023e. Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc’aurelio Ranzato, et al. Towards learning universal hyperparameter optimizers wit...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

RLPrompt: Optimizing discrete text prompts with reinforcement learning,

Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548,

work page arXiv
[8]

Promptbreeder: Self-referential self-improvement via prompt evolution,

22 Large Language Models as Optimizers Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797,

work page arXiv
[9]

The capacity for moral self-correction in large language models

Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil ˙e Lukoši¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459,

work page arXiv
[10]

Making pre-trained language models better few-shot learners

Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723,

work page arXiv 2012
[11]

Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532,

Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532,

work page arXiv
[12]

Language models can solve computer tasks

Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491,

work page arXiv
[13]

Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

net/forum?id=ByxBFsRqYm

URL https://openreview. net/forum?id=ByxBFsRqYm. Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. arXiv preprint arXiv:2206.08896,

work page arXiv
[15]

The Power of Scale for Parameter-Efficient Prompt Tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2103.10385 , year=

Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385,

work page arXiv
[19]

Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786,

work page arXiv
[20]

Let’s do a thought experiment: Using counterfactuals to improve moral reasoning

Xiao Ma, Swaroop Mishra, Ahmad Beirami, Alex Beutel, and Jilin Chen. Let’s do a thought experiment: Using counterfactuals to improve moral reasoning. arXiv preprint arXiv:2306.14308,

work page arXiv
[21]

and Yazdanbakhsh, A

23 Large Language Models as Optimizers Aman Madaan and Amir Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686,

work page arXiv
[22]

Self-Refine: Iterative Refinement with Self-Feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Language model crossover: Variation through few-shot prompting,

Elliot Meyerson, Mark J Nelson, Herbie Bradley, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. arXiv preprint arXiv:2302.12170,

work page arXiv
[24]

G., Rao, K., Sadigh, D., and Zeng, A

Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721,

work page arXiv
[25]

Dera: Enhancing large language model completions with dialog-enabled resolving agents

Varun Nair, Elliot Schumacher, Geoffrey Tso, and Anitha Kannan. Dera: Enhancing large language model completions with dialog-enabled resolving agents. arXiv preprint arXiv:2303.17071,

work page arXiv
[26]

Demystifying gpt self-repair for code generation

Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896,

work page arXiv
[27]

GrIPS: Gradient-free, edit-based instruction search for prompting large language models, 2023

Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281,

work page arXiv
[28]

gradient descent

Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495,

work page arXiv
[29]

Learning how to ask: Querying lms with mixtures of soft prompts

Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599,

work page arXiv
[30]

Prompt programming for large language models: Beyond the few-shot paradigm

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7,

work page 2021
[31]

Solving General Arithmetic Word Problems

Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts,

Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980,

work page arXiv 2010
[34]

Reflexion: Language Agents with Verbal Reinforcement Learning

Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

24 Large Language Models as Optimizers Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Larger language models do in-context learning differently

Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846,

work page arXiv
[41]

Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023

Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668,

work page arXiv
[42]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Gps: Genetic prompt search for efficient few-shot learning

Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Gps: Genetic prompt search for efficient few-shot learning. arXiv preprint arXiv:2210.17041,

work page arXiv
[44]

System-level natural language feedback.arXiv preprint arXiv:2306.13588,

Weizhe Yuan, Kyunghyun Cho, and Jason Weston. System-level natural language feedback.arXiv preprint arXiv:2306.13588,

work page arXiv
[45]

Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy ...

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Give me a new (w, b) pair that is different from all pairs above

The model will get it right if external tools that can reliably calculate the value are triggered. When and how to trigger such tool use cases remains an interesting topic (see e.g., (Schick et al., 2023; Cai et al., 2023)). • Generating solutions already appeared in context even if we tell it to "Give me a new (w, b) pair that is different from all pairs...

work page 2023
[47]

QA” pattern is present. The “QA

and (20, 400). 26 Large Language Models as Optimizers B P ROMPTING FORMATS FOR SCORER LLM Figure 14, 15, and 16 show examples of the Q_begin, Q_end, and A_begin prompting formats when the “QA” pattern is present. The “QA” pattern is eliminated when prompting instruction-tuned scorer models like text-bison with the Q_begin and Q_end formats (Figure 17 and ...

work page 2017
[48]

Let’s think step by step

with the text-bison scorer and the PaLM 2-L-IT optimizer, Part II. All curves have upward trends. E PROMPT OPTIMIZATION ON BBH T ASKS – TABULATED ACCURACIES AND FOUND INSTRUCTIONS E.1 PALM 2-L-IT AS OPTIMIZER , OPTIMIZATION STARTING FROM THE EMPTY STRING Table 8 and 9 show the instructions found by prompt optimization. A comparison of their accuracies wit...

work page 2022
[49]

training / test / overall (training + test)

31 Large Language Models as Optimizers Table 7: Accuracies on BBH tasks: our found instructions with the PaLM 2-L-IT optimizer vs baseline. The optimization starts from the empty string. Because of the 20-80 train-test split, we show accuracies with the format “training / test / overall (training + test)”. ThePaLM 2-L scores are from A_begin instructions;...

work page 2023
[50]

If today is April 1, 2023, then the date a month ago is March 1,

work page 2023
[51]

A week ago, it was February 21, 2023, and a month ago, it was January 28,

work page 2023