Recognition: 2 theorem links
Large Language Models as Optimizers
Pith reviewed 2026-05-14 23:59 UTC · model grok-4.3
The pith
Large language models can optimize solutions by iteratively generating new candidates from a prompt that lists all prior attempts together with their scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OPRO lets an LLM optimize an objective by describing the task in natural language and iteratively prompting the model with the history of earlier solutions and their numeric scores; each new generation is evaluated and added to the growing prompt, allowing the language model to propose successively better candidates without any gradient information.
What carries the argument
The OPRO loop in which the LLM receives a prompt containing the optimization objective plus the list of prior solutions paired with their scores, then outputs new candidate solutions that are scored and appended.
If this is right
- Prompt engineering can be replaced by an automated search that requires only an evaluation function and no hand-crafted heuristics.
- The same loop applies directly to other non-differentiable problems such as combinatorial optimization and hyperparameter search.
- Gains appear across multiple LLM families, suggesting the method is not tied to one particular model architecture.
- No task-specific gradient computation or differentiable surrogate is needed beyond the black-box scorer.
- Performance scales with the quality of the underlying LLM used for proposal generation.
Where Pith is reading between the lines
- If the loop generalizes, production systems could replace static prompt libraries with on-demand optimization runs that adapt to new data or metrics.
- The history-of-scored-solutions format resembles population-based search methods, so OPRO could be hybridized with evolutionary or bandit algorithms that maintain explicit populations.
- One could test whether the same prompting strategy improves other discrete search problems such as neural architecture search when evaluation is inexpensive.
- The method raises the open question of how much the LLM is performing genuine reasoning versus surface-level pattern matching on the supplied score history.
Load-bearing premise
That an LLM, when shown a growing list of prior solutions and their numeric scores inside a prompt, will reliably generate new solutions that improve on the best previous score rather than plateau or regress.
What would settle it
Run the OPRO loop on linear regression for fifty steps and record whether the best validation loss continues to decrease, plateaus, or begins to rise after roughly twenty iterations.
read the original abstract
Optimization is ubiquitous. While derivative-based algorithms have been powerful tools for various problems, the absence of gradient imposes challenges on many real-world applications. In this work, we propose Optimization by PROmpting (OPRO), a simple and effective approach to leverage large language models (LLMs) as optimizers, where the optimization task is described in natural language. In each optimization step, the LLM generates new solutions from the prompt that contains previously generated solutions with their values, then the new solutions are evaluated and added to the prompt for the next optimization step. We first showcase OPRO on linear regression and traveling salesman problems, then move on to our main application in prompt optimization, where the goal is to find instructions that maximize the task accuracy. With a variety of LLMs, we demonstrate that the best prompts optimized by OPRO outperform human-designed prompts by up to 8% on GSM8K, and by up to 50% on Big-Bench Hard tasks. Code at https://github.com/google-deepmind/opro.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Optimization by PROmpting (OPRO), a technique that employs large language models to solve optimization problems by describing the task in natural language and iteratively generating new candidate solutions based on a prompt containing previous solutions and their evaluated scores. The method is first tested on linear regression and the traveling salesman problem, then applied to prompt optimization, where it is shown to produce instructions that improve task accuracy over human-designed prompts by up to 8% on GSM8K and 50% on Big-Bench Hard tasks across various LLMs.
Significance. If the empirical results are robust, the work is significant for demonstrating a practical, gradient-free optimization framework that leverages the in-context learning abilities of LLMs. This could impact prompt engineering practices and extend to other optimization scenarios where gradients are unavailable. The public release of code supports reproducibility and further exploration.
major comments (2)
- [Prompt optimization experiments] The prompt optimization experiments lack a control baseline that draws the same number of candidate instructions randomly (or via any non-iterative sampling procedure) and retains the best performer after an equivalent evaluation budget. Without this comparison, the headline gains (up to 8% on GSM8K, up to 50% on Big-Bench Hard) cannot be attributed to the iterative history mechanism rather than the sheer volume of evaluated candidates.
- [Experimental results] No information is given on the number of independent runs, variance across runs, or statistical significance tests for the reported accuracy improvements. This omission leaves open the possibility that the observed margins arise from stochasticity in LLM generation or evaluation rather than systematic optimization progress.
minor comments (2)
- [Abstract] The abstract states maximum gains without naming the specific LLM, task variant, or prompt length that achieves each figure; adding these details would improve immediate readability.
- [Method] The method description should specify the truncation or summarization policy applied when the growing list of (solution, score) pairs approaches the model's context limit.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to strengthen the empirical claims.
read point-by-point responses
-
Referee: [Prompt optimization experiments] The prompt optimization experiments lack a control baseline that draws the same number of candidate instructions randomly (or via any non-iterative sampling procedure) and retains the best performer after an equivalent evaluation budget. Without this comparison, the headline gains (up to 8% on GSM8K, up to 50% on Big-Bench Hard) cannot be attributed to the iterative history mechanism rather than the sheer volume of evaluated candidates.
Authors: We agree that a non-iterative random baseline with matched evaluation budget is necessary to isolate the contribution of the history-based prompting mechanism. In the revised manuscript we will add results from sampling an identical number of candidate instructions uniformly at random (without feeding prior scores back into the prompt) and retaining the single best performer. This control will be reported alongside the OPRO curves for the GSM8K and Big-Bench Hard tasks. revision: yes
-
Referee: [Experimental results] No information is given on the number of independent runs, variance across runs, or statistical significance tests for the reported accuracy improvements. This omission leaves open the possibility that the observed margins arise from stochasticity in LLM generation or evaluation rather than systematic optimization progress.
Authors: We acknowledge the omission. The original experiments used fixed random seeds for reproducibility, but we will rerun the key prompt-optimization experiments across multiple independent trials (minimum of three seeds per task), report mean accuracy and standard deviation, and include paired statistical significance tests (e.g., McNemar or bootstrap) between OPRO-optimized prompts and the human-designed baselines. These results and error bars will be added to the revised tables and figures. revision: yes
Circularity Check
No significant circularity; results are direct empirical measurements
full rationale
The paper presents OPRO as an iterative prompting procedure in which an LLM is given a growing list of prior (solution, value) pairs and asked to propose new candidates; the candidates are then evaluated on the target task and appended. The headline performance numbers (up to 8 % on GSM8K, 50 % on Big-Bench Hard) are obtained by running this loop to completion and measuring final accuracy on the standard held-out test splits of the benchmarks. No equation or derivation step equates the reported accuracy to any fitted parameter or to the initial prompt set by algebraic identity. The method contains no self-citation load-bearing uniqueness theorem, no ansatz smuggled via prior work, and no renaming of a known result as a new derivation. The evaluation is therefore self-contained against external benchmarks and does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 23 Pith papers
-
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
Evolutionary Ensemble of Agents
EvE uses co-evolving populations of solvers and guidance states with Elo-based evaluation to autonomously discover a rescale-then-interpolate mechanism for better generalization in In-Context Operator Networks.
-
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
-
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
AloLab, an iterative meta-agent prompt optimizer, raises structured output accuracy for 7-9B models from 0% to 84-87% on GSM8K while preserving near-native inference speed.
-
ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis
ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.
-
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Reflective Context Learning: Studying the Optimization Primitives of Context Space
Reflective Context Learning unifies context optimization for agents by recasting prior methods as instances of a shared learning problem and extending them with classical primitives such as batching, failure replay, a...
-
Self-Optimizing Multi-Agent Systems for Deep Research
Multi-agent deep research systems self-optimize prompts through self-play to match or outperform expert-crafted versions.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience
Iterative distillation of experience trains prompting policies that boost black-box LLM performance on reasoning and tool-use tasks from 55-74% to 90-91%.
-
Evolutionary Ensemble of Agents
EvE co-evolves code solvers and guidance states via synchronous races and Elo updates, discovering a rescale-then-interpolate mechanism that enables example-count generalization in ICON.
-
A Control Architecture for Training-Free Memory Use
A training-free control architecture with uncertainty-based routing, confidence-selective acceptance, and evidence-based memory governance improves arithmetic reasoning by +7 points on SVAMP and ASDiv benchmarks.
-
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
-
MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction
MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs
A DSPy-based per-stage prompt optimization pipeline with self-consistency achieves second place among full participants in the ArchEHR-QA 2026 EHR QA shared task.
-
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...
Reference graph
Works this paper leans on
-
[1]
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
arXiv preprint arXiv:2305.17126 , year=
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint arXiv:2305.17126,
-
[4]
Evoprompting: Language models for code-level neural architecture search
Angelica Chen, David M Dohan, and David R So. Evoprompting: Language models for code-level neural architecture search. arXiv preprint arXiv:2302.14838, 2023a. Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R Bowman, Kyunghyun Cho, and Ethan Perez. Improving code generation by training with natural language feedback...
-
[5]
Teaching Large Language Models to Self-Debug
Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023e. Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Richard Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc’aurelio Ranzato, et al. Towards learning universal hyperparameter optimizers wit...
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
RLPrompt: Optimizing discrete text prompts with reinforcement learning,
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yihan Wang, Han Guo, Tianmin Shu, Meng Song, Eric P Xing, and Zhiting Hu. Rlprompt: Optimizing discrete text prompts with reinforcement learning. arXiv preprint arXiv:2205.12548,
-
[8]
Promptbreeder: Self-referential self-improvement via prompt evolution,
22 Large Language Models as Optimizers Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rock- täschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797,
-
[9]
The capacity for moral self-correction in large language models
Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamil ˙e Lukoši¯ut˙e, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, et al. The capacity for moral self-correction in large language models. arXiv preprint arXiv:2302.07459,
-
[10]
Making pre-trained language models better few-shot learners
Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre-trained language models better few-shot learners. arXiv preprint arXiv:2012.15723,
-
[11]
Qingyan Guo, Rui Wang, Junliang Guo, Bei Li, Kaitao Song, Xu Tan, Guoqing Liu, Jiang Bian, and Yujiu Yang. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. arXiv preprint arXiv:2309.08532,
-
[12]
Language models can solve computer tasks
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491,
-
[13]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
URL https://openreview. net/forum?id=ByxBFsRqYm. Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. arXiv preprint arXiv:2206.08896,
-
[15]
The Power of Scale for Parameter-Efficient Prompt Tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Program Induction by Rationale Generation : Learning to Solve and Explain Algebraic Word Problems
Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale genera- tion: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2103.10385 , year=
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385,
-
[19]
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786,
-
[20]
Let’s do a thought experiment: Using counterfactuals to improve moral reasoning
Xiao Ma, Swaroop Mishra, Ahmad Beirami, Alex Beutel, and Jilin Chen. Let’s do a thought experiment: Using counterfactuals to improve moral reasoning. arXiv preprint arXiv:2306.14308,
-
[21]
23 Large Language Models as Optimizers Aman Madaan and Amir Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686,
-
[22]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Language model crossover: Variation through few-shot prompting,
Elliot Meyerson, Mark J Nelson, Herbie Bradley, Arash Moradi, Amy K Hoover, and Joel Lehman. Language model crossover: Variation through few-shot prompting. arXiv preprint arXiv:2302.12170,
-
[24]
G., Rao, K., Sadigh, D., and Zeng, A
Suvir Mirchandani, Fei Xia, Pete Florence, Brian Ichter, Danny Driess, Montserrat Gonzalez Arenas, Kanishka Rao, Dorsa Sadigh, and Andy Zeng. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721,
-
[25]
Dera: Enhancing large language model completions with dialog-enabled resolving agents
Varun Nair, Elliot Schumacher, Geoffrey Tso, and Anitha Kannan. Dera: Enhancing large language model completions with dialog-enabled resolving agents. arXiv preprint arXiv:2303.17071,
-
[26]
Demystifying gpt self-repair for code generation
Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896,
-
[27]
GrIPS: Gradient-free, edit-based instruction search for prompting large language models, 2023
Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281,
-
[28]
Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495,
-
[29]
Learning how to ask: Querying lms with mixtures of soft prompts
Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599,
-
[30]
Prompt programming for large language models: Beyond the few-shot paradigm
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7,
work page 2021
-
[31]
Solving General Arithmetic Word Problems
Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Autoprompt: Eliciting knowledge from lan- guage models with automatically generated prompts,
Taylor Shin, Yasaman Razeghi, Robert L Logan IV , Eric Wallace, and Sameer Singh. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980,
-
[34]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Beck Labash, and Ashwin Gopinath. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
24 Large Language Models as Optimizers Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Voyager: An Open-Ended Embodied Agent with Large Language Models
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh- ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Larger language models do in-context learning differently
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846,
-
[41]
Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668,
-
[42]
WizardLM: Empowering large pre-trained language models to follow complex instructions
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Gps: Genetic prompt search for efficient few-shot learning
Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Gps: Genetic prompt search for efficient few-shot learning. arXiv preprint arXiv:2210.17041,
-
[44]
System-level natural language feedback.arXiv preprint arXiv:2306.13588,
Weizhe Yuan, Kyunghyun Cho, and Jason Weston. System-level natural language feedback.arXiv preprint arXiv:2306.13588,
-
[45]
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a. Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Give me a new (w, b) pair that is different from all pairs above
The model will get it right if external tools that can reliably calculate the value are triggered. When and how to trigger such tool use cases remains an interesting topic (see e.g., (Schick et al., 2023; Cai et al., 2023)). • Generating solutions already appeared in context even if we tell it to "Give me a new (w, b) pair that is different from all pairs...
work page 2023
-
[47]
QA” pattern is present. The “QA
and (20, 400). 26 Large Language Models as Optimizers B P ROMPTING FORMATS FOR SCORER LLM Figure 14, 15, and 16 show examples of the Q_begin, Q_end, and A_begin prompting formats when the “QA” pattern is present. The “QA” pattern is eliminated when prompting instruction-tuned scorer models like text-bison with the Q_begin and Q_end formats (Figure 17 and ...
work page 2017
-
[48]
with the text-bison scorer and the PaLM 2-L-IT optimizer, Part II. All curves have upward trends. E PROMPT OPTIMIZATION ON BBH T ASKS – TABULATED ACCURACIES AND FOUND INSTRUCTIONS E.1 PALM 2-L-IT AS OPTIMIZER , OPTIMIZATION STARTING FROM THE EMPTY STRING Table 8 and 9 show the instructions found by prompt optimization. A comparison of their accuracies wit...
work page 2022
-
[49]
training / test / overall (training + test)
31 Large Language Models as Optimizers Table 7: Accuracies on BBH tasks: our found instructions with the PaLM 2-L-IT optimizer vs baseline. The optimization starts from the empty string. Because of the 20-80 train-test split, we show accuracies with the format “training / test / overall (training + test)”. ThePaLM 2-L scores are from A_begin instructions;...
work page 2023
-
[50]
If today is April 1, 2023, then the date a month ago is March 1,
work page 2023
-
[51]
A week ago, it was February 21, 2023, and a month ago, it was January 28,
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.