Large Language Models Are Human-Level Prompt Engineers
Pith reviewed 2026-05-24 09:39 UTC · model grok-4.3
The pith
Large language models can generate task instructions that match or beat human-written ones on most NLP benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large language models can serve as prompt engineers: by treating instructions as programs to be synthesized, an LLM proposes a pool of candidates and another LLM scores them by zero-shot accuracy, yielding instructions that outperform prior LLM baselines and match human annotators on 19 of 24 tasks.
What carries the argument
Automatic Prompt Engineer (APE), a search loop in which one LLM generates instruction candidates and a held-out LLM scores each candidate by its zero-shot performance on a validation set.
If this is right
- Prepending the automatically selected instructions to standard few-shot prompts raises task accuracy.
- The same instructions can steer an LLM toward more truthful or more informative outputs.
- The procedure applies across a diverse collection of 24 NLP tasks without task-specific human tuning.
- Prompt quality can be treated as an optimizable quantity rather than a fixed human input.
Where Pith is reading between the lines
- The search approach could be iterated across multiple rounds of proposal and scoring to refine instructions further.
- The same candidate-generation and scoring loop might transfer to non-classification domains such as code synthesis or open-ended reasoning.
- Treating instructions as searchable objects opens the possibility of combining APE with other optimization techniques like gradient-based methods on continuous prompt embeddings.
Load-bearing premise
Zero-shot accuracy of a held-out LLM on a validation set is a reliable stand-in for how well the instruction will work with other models or on new data.
What would settle it
Running the generated instructions on a fresh set of tasks or with models that were never used for scoring and finding they fall below human-written instructions.
read the original abstract
By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Automatic Prompt Engineer (APE), which generates candidate natural language instructions using an LLM and selects the best one by maximizing the zero-shot accuracy of a separate held-out LLM on a validation split. On 24 NLP tasks, the selected instructions are reported to outperform a prior LLM baseline by a large margin and to match or exceed human-written instructions on 19/24 tasks. The method is also shown to improve truthfulness/informativeness and few-shot performance when prepended to standard prompts.
Significance. If the proxy-based selection reliably identifies instructions that generalize across models and data regimes, the work would establish a practical, automated alternative to manual prompt engineering and demonstrate that LLMs can reach human-level performance on this meta-task. The approach draws on program synthesis ideas and supplies both quantitative results across many tasks and qualitative analyses; however, the absence of direct evidence that the zero-shot proxy ranking aligns with human judgments or other LLMs limits the strength of the generalization claim.
major comments (3)
- [Method and §4] The central selection procedure (described in the method and §4) maximizes zero-shot accuracy of a held-out LLM on a validation split; no experiment is reported that checks whether the induced ranking of instructions correlates with performance under other LLMs, human raters, or held-out test distributions. This proxy is load-bearing for the claim that the selected instructions are “human-level.”
- [Abstract and §4] The abstract and experimental sections state that APE outperforms “the prior LLM baseline by a large margin” and matches human instructions on 19/24 tasks, yet no statistical significance tests, confidence intervals, or details on data splits and whether selection was performed on the same data used for final reporting are provided.
- [§4] Table or figure reporting per-task results (presumably in §4) does not include the exact baseline instruction templates or the precise zero-shot evaluation protocol used for selection, making it impossible to verify that the reported gains are not artifacts of the particular evaluator LLM.
minor comments (2)
- The webpage link is given but no repository or code release is mentioned; adding a pointer to reproducible artifacts would strengthen the paper.
- [Method] Notation for the score function and the two LLMs (generator vs. evaluator) should be introduced once and used consistently throughout the method section.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method and §4] The central selection procedure (described in the method and §4) maximizes zero-shot accuracy of a held-out LLM on a validation split; no experiment is reported that checks whether the induced ranking of instructions correlates with performance under other LLMs, human raters, or held-out test distributions. This proxy is load-bearing for the claim that the selected instructions are “human-level.”
Authors: The selection uses a held-out LLM on a validation split precisely to identify instructions that perform well under zero-shot evaluation for that model family. The human-level claim is grounded in direct comparison: the APE-selected instructions match or exceed human-written ones on 19/24 tasks under identical evaluation. While explicit ranking-correlation experiments across additional LLMs or human raters were not performed, the consistent multi-task results provide supporting evidence for the proxy's utility. We will revise §4 and the method section to clarify this rationale and explicitly note the scope of the generalization claim. revision: partial
-
Referee: [Abstract and §4] The abstract and experimental sections state that APE outperforms “the prior LLM baseline by a large margin” and matches human instructions on 19/24 tasks, yet no statistical significance tests, confidence intervals, or details on data splits and whether selection was performed on the same data used for final reporting are provided.
Authors: We agree that statistical significance tests, confidence intervals, and explicit data-split details would improve reporting. The selection was performed on a held-out validation split distinct from the test data used for final numbers. We will add these elements (including per-task significance tests where feasible) to the revised experimental section and abstract if space permits. revision: yes
-
Referee: [§4] Table or figure reporting per-task results (presumably in §4) does not include the exact baseline instruction templates or the precise zero-shot evaluation protocol used for selection, making it impossible to verify that the reported gains are not artifacts of the particular evaluator LLM.
Authors: We will revise the paper to include the exact baseline templates and a precise description of the zero-shot evaluation protocol (including the evaluator LLM and split usage) either in the main text or a dedicated appendix, enabling full verification and reproduction. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's APE method generates instruction candidates with one LLM and selects via zero-shot accuracy of a separate held-out LLM on a validation split; this selection uses an external performance metric rather than any quantity derived from the generation process itself. Central claims rest on empirical results across 24 NLP tasks showing outperformance vs. LLM baselines and comparability to human instructions on 19/24 tasks. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, or ansatzes smuggled via citation appear in the derivation. The approach is self-contained against external benchmarks with no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption An LLM prompted appropriately can generate a diverse pool of task instructions that includes high-quality candidates.
- domain assumption Zero-shot accuracy of a separate LLM on a validation set is a monotonic indicator of instruction quality for the target task.
Forward citations
Cited by 24 Pith papers
-
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
-
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
-
PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI
PRISM automates continuous prompt creation, simulation-based testing, diagnosis, and repair for enterprise LLM agents, cutting authoring time to under 30 minutes while reaching 99% reliability and catching drift withi...
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...
-
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
Agile Deliberation: Concept Deliberation for Subjective Visual Classification
Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
-
Reflective Prompt Tuning through Language Model Function-Calling
Reflective Prompt Tuning uses LLM function calling and diagnostic reports to iteratively optimize prompts, yielding up to 12.9 point gains on reasoning tasks while improving calibration.
-
optimize_anything: A Universal API for Optimizing any Text Parameter
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
-
Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering
NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.
-
Learning, Fast and Slow: Towards LLMs That Adapt Continually
Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems
Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
-
LLM-Guided Prompt Evolution for Password Guessing
LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.
-
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
Less Back-and-Forth: A Comparative Study of Structured Prompting
Checklist-improved prompts achieve the highest mean rubric score (7.50/8) and best quality-effort tradeoff compared to raw prompts (5.67) and clarifying-question prompts (6.67) across four task types and three LLMs.
-
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
-
Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM
AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.
-
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
-
Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models
GPT-4o and Claude 3.5 Sonnet reach 73.7-74% accuracy on gastroenterology questions; VLMs gain nothing from images and lose accuracy with LLM-generated captions.
-
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...
-
Natural Language Processing in the Legal Domain
A survey of nearly 1000 NLP & Law papers from 2013-2024 documenting increases in publication volume, scope, methodological sophistication, and data/code availability.
-
Bridging Language Models and Financial Analysis
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.
Reference graph
Works this paper leans on
-
[1]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Efficient Training of Language Models to Fill in the Middle
Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Pada: A prompt-based autoregressive approach for adaptation to unseen domains
Eyal Ben-David, Nadav Oved, and Roi Reichart. Pada: A prompt-based autoregressive approach for adaptation to unseen domains. arXiv preprint arXiv:2102.12206,
-
[6]
Thinking aloud: Dynamic context generation improves zero-shot reasoning performance of gpt-2
Gregor Betz, Kyle Richardson, and Christian V oigt. Thinking aloud: Dynamic context generation improves zero-shot reasoning performance of gpt-2. arXiv preprint arXiv:2103.13033,
-
[7]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[8]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Commonsense knowledge mining from pretrained models
Joe Davison, Joshua Feldman, and Alexander M Rush. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 1173–1178,
work page 2019
-
[11]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
GLM: General language model pretraining with autoregressive blank infilling
10 Published as a conference paper at ICLR 2023 Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, Dublin, Ireland, May
work page 2023
-
[13]
doi: 10.18653/ v1/2022.acl-long.26
Association for Computational Linguistics. doi: 10.18653/ v1/2022.acl-long.26. URL https://aclanthology.org/2022.acl-long.26. Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and Josh Tenen- baum. Learning libraries of subroutines for neurally–guided bayesian program induction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. ...
work page 2022
-
[14]
URL https://proceedings.neurips.cc/paper/2018/file/ 7aa685b3b1dc1d6780bf36f7340078c9-Paper.pdf. Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning. In Proceedings of the 42nd...
work page 2018
-
[15]
doi: 10.18653/v1/2021.acl-long.295
Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long
-
[16]
Instruction induction: From few examples to natural language task descriptions
Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782,
-
[17]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[18]
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
The power of scale for parameter-efficient prompt tuning
Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059,
work page 2021
-
[20]
Competition-Level Code Generation with AlphaCode
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach. In Johannes Fürnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel , pp. 639–646. Omnipress,
work page 2010
-
[22]
11 Published as a conference paper at ICLR 2023 Stephanie Lin, Jacob Hilton, and Owain Evans
URL https://icml.cc/Conferences/2010/papers/568.pdf. 11 Published as a conference paper at ICLR 2023 Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic hu- man falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pp. 3214–3252, Dublin, Ireland, May
work page 2010
-
[23]
doi: 10.18653/v1/2022.acl-long.229
As- sociation for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.org/2022.acl-long.229. Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385,
-
[24]
Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786,
-
[25]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Learning how to ask: Querying lms with mixtures of soft prompts
Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5203–5212,
work page 2021
-
[28]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Prompt programming for large language models: Beyond the few-shot paradigm
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7,
work page 2021
-
[30]
Solving General Arithmetic Word Problems
Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Logan IV , Eric Wallace, and Sameer Singh
12 Published as a conference paper at ICLR 2023 Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP),
work page 2023
-
[32]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247,
-
[35]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Star: Bootstrapping reasoning with reasoning
Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465,
-
[37]
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Fine-Tuning Language Models from Human Preferences
Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[40]
13 Published as a conference paper at ICLR 2023 A P ROMPT ENGINEERING IN THE WILD Large models with natural language interfaces, including models for text generation and image synthesis, have seen an increasing amount of public usage in recent years. As finding the right prompt can be difficult for humans, a number of guides on prompt engineering as well as...
work page 2023
-
[41]
As shown in Figure 8, adding an instruction achieves a comparable or better test performance than the standard in-context learning performance on 21 of 24 tasks. Counter-intuitively, adding in-context examples for Rhymes, Large Animal, and Second Letters hurts model performance. We conjecture that it may be because the selected instructions overfit the zer...
work page 2023
-
[42]
Let’s work this out in a step by step way to be sure we have the right answer
Figure 10: The performance of APE discovered prompt "Let’s work this out in a step by step way to be sure we have the right answer." on the 12 tasks from Kojima et al. (2022). We collect a CoT dataset from the original paper and filter out incorrect answers. We then use APE to optimize the CoT prompt. We improve performance on 6/12 tasks and nearly match h...
work page 2022
-
[43]
Template (*1) was proposed in Kojima et al
dataset using InstructGPT (text-davinci-002). Template (*1) was proposed in Kojima et al. (2022) to enable the zero-shot chain of thoughts reasoning of large language models, while template (*2) and (*3) were used in Ahn et al. (2022) and Reynolds & McDonell (2021), respectively. No. Category Zero-shot CoT Trigger Prompt Accuracy 1 APE Let’s work this out...
work page 2022
-
[44]
Write a word that rhymes with each of the following words
6These six tasks are chosen such that two of them are worse than humans, and the other four are human-level. They cover six categories (spelling, morphosyntax, lexical semantics, semantics, multi-lingual, and GLUE). 23 Published as a conference paper at ICLR 2023 D C OST ANALYSIS More powerful models are cost-efficient for instruction proposal Despite high...
work page 2023
-
[45]
Table 17: APE hyperparameter tuning improvements on instruction induction. Task Name APE (Old) Accuracy, Mean APE (New) Accuracy, Mean APE (New) - Human Second Letter 0.596 0.8 0.034 Pluralization 0.984 0.996 -0.004 Passivization 0.622 1 0.001 Sentence Similarity 0.186 0.256 -0.01 Membership 0.126 0.612 -0.001 Antonyms Cause Selection Common Concept Diff ...
work page 2023
-
[46]
We compare the performance of different templates used to propose instruction
Figure 22: Few-shot test accuracy on 6 Instruction Induction tasks. We compare the performance of different templates used to propose instruction. Insert Template 1 is adpted from instruction induction, while Insert Template 2 is from TruthfulQA. 38 Published as a conference paper at ICLR 2023 Antonyms Cause Selection Common Concept Diff First Letter Form...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.