pith. sign in

arxiv: 2211.01910 · v2 · pith:KMZSOMYPnew · submitted 2022-11-03 · 💻 cs.LG · cs.AI· cs.CL

Large Language Models Are Human-Level Prompt Engineers

Pith reviewed 2026-05-24 09:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords automatic prompt engineeringlarge language modelsinstruction generationzero-shot performanceNLP tasksfew-shot learningprompt optimization
0
0 comments X

The pith

Large language models can generate task instructions that match or beat human-written ones on most NLP benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Automatic Prompt Engineer, a procedure that has one LLM propose many candidate instructions for a task and uses the zero-shot performance of a second LLM on a validation set to select the strongest one. Across 24 NLP tasks the selected instructions surpass earlier automatic baselines and reach or exceed the performance of human-written instructions in 19 cases. The same instructions can be prepended to few-shot examples to raise accuracy and can also be used to increase a model's truthfulness or informativeness. If the method works as described, prompt engineering shifts from a manual, trial-and-error activity to an automated search problem.

Core claim

Large language models can serve as prompt engineers: by treating instructions as programs to be synthesized, an LLM proposes a pool of candidates and another LLM scores them by zero-shot accuracy, yielding instructions that outperform prior LLM baselines and match human annotators on 19 of 24 tasks.

What carries the argument

Automatic Prompt Engineer (APE), a search loop in which one LLM generates instruction candidates and a held-out LLM scores each candidate by its zero-shot performance on a validation set.

If this is right

  • Prepending the automatically selected instructions to standard few-shot prompts raises task accuracy.
  • The same instructions can steer an LLM toward more truthful or more informative outputs.
  • The procedure applies across a diverse collection of 24 NLP tasks without task-specific human tuning.
  • Prompt quality can be treated as an optimizable quantity rather than a fixed human input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The search approach could be iterated across multiple rounds of proposal and scoring to refine instructions further.
  • The same candidate-generation and scoring loop might transfer to non-classification domains such as code synthesis or open-ended reasoning.
  • Treating instructions as searchable objects opens the possibility of combining APE with other optimization techniques like gradient-based methods on continuous prompt embeddings.

Load-bearing premise

Zero-shot accuracy of a held-out LLM on a validation set is a reliable stand-in for how well the instruction will work with other models or on new data.

What would settle it

Running the generated instructions on a fresh set of tasks or with models that were never used for scoring and finding they fall below human-written instructions.

read the original abstract

By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Automatic Prompt Engineer (APE), which generates candidate natural language instructions using an LLM and selects the best one by maximizing the zero-shot accuracy of a separate held-out LLM on a validation split. On 24 NLP tasks, the selected instructions are reported to outperform a prior LLM baseline by a large margin and to match or exceed human-written instructions on 19/24 tasks. The method is also shown to improve truthfulness/informativeness and few-shot performance when prepended to standard prompts.

Significance. If the proxy-based selection reliably identifies instructions that generalize across models and data regimes, the work would establish a practical, automated alternative to manual prompt engineering and demonstrate that LLMs can reach human-level performance on this meta-task. The approach draws on program synthesis ideas and supplies both quantitative results across many tasks and qualitative analyses; however, the absence of direct evidence that the zero-shot proxy ranking aligns with human judgments or other LLMs limits the strength of the generalization claim.

major comments (3)
  1. [Method and §4] The central selection procedure (described in the method and §4) maximizes zero-shot accuracy of a held-out LLM on a validation split; no experiment is reported that checks whether the induced ranking of instructions correlates with performance under other LLMs, human raters, or held-out test distributions. This proxy is load-bearing for the claim that the selected instructions are “human-level.”
  2. [Abstract and §4] The abstract and experimental sections state that APE outperforms “the prior LLM baseline by a large margin” and matches human instructions on 19/24 tasks, yet no statistical significance tests, confidence intervals, or details on data splits and whether selection was performed on the same data used for final reporting are provided.
  3. [§4] Table or figure reporting per-task results (presumably in §4) does not include the exact baseline instruction templates or the precise zero-shot evaluation protocol used for selection, making it impossible to verify that the reported gains are not artifacts of the particular evaluator LLM.
minor comments (2)
  1. The webpage link is given but no repository or code release is mentioned; adding a pointer to reproducible artifacts would strengthen the paper.
  2. [Method] Notation for the score function and the two LLMs (generator vs. evaluator) should be introduced once and used consistently throughout the method section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method and §4] The central selection procedure (described in the method and §4) maximizes zero-shot accuracy of a held-out LLM on a validation split; no experiment is reported that checks whether the induced ranking of instructions correlates with performance under other LLMs, human raters, or held-out test distributions. This proxy is load-bearing for the claim that the selected instructions are “human-level.”

    Authors: The selection uses a held-out LLM on a validation split precisely to identify instructions that perform well under zero-shot evaluation for that model family. The human-level claim is grounded in direct comparison: the APE-selected instructions match or exceed human-written ones on 19/24 tasks under identical evaluation. While explicit ranking-correlation experiments across additional LLMs or human raters were not performed, the consistent multi-task results provide supporting evidence for the proxy's utility. We will revise §4 and the method section to clarify this rationale and explicitly note the scope of the generalization claim. revision: partial

  2. Referee: [Abstract and §4] The abstract and experimental sections state that APE outperforms “the prior LLM baseline by a large margin” and matches human instructions on 19/24 tasks, yet no statistical significance tests, confidence intervals, or details on data splits and whether selection was performed on the same data used for final reporting are provided.

    Authors: We agree that statistical significance tests, confidence intervals, and explicit data-split details would improve reporting. The selection was performed on a held-out validation split distinct from the test data used for final numbers. We will add these elements (including per-task significance tests where feasible) to the revised experimental section and abstract if space permits. revision: yes

  3. Referee: [§4] Table or figure reporting per-task results (presumably in §4) does not include the exact baseline instruction templates or the precise zero-shot evaluation protocol used for selection, making it impossible to verify that the reported gains are not artifacts of the particular evaluator LLM.

    Authors: We will revise the paper to include the exact baseline templates and a precise description of the zero-shot evaluation protocol (including the evaluator LLM and split usage) either in the main text or a dedicated appendix, enabling full verification and reproduction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's APE method generates instruction candidates with one LLM and selects via zero-shot accuracy of a separate held-out LLM on a validation split; this selection uses an external performance metric rather than any quantity derived from the generation process itself. Central claims rest on empirical results across 24 NLP tasks showing outperformance vs. LLM baselines and comparability to human instructions on 19/24 tasks. No self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, or ansatzes smuggled via citation appear in the derivation. The approach is self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on domain assumptions about LLM generative and evaluative capabilities rather than fitted parameters or new mathematical axioms.

axioms (2)
  • domain assumption An LLM prompted appropriately can generate a diverse pool of task instructions that includes high-quality candidates.
    Core to the generation stage of APE.
  • domain assumption Zero-shot accuracy of a separate LLM on a validation set is a monotonic indicator of instruction quality for the target task.
    Used to rank and select the final instruction.

pith-pipeline@v0.9.0 · 5776 in / 1129 out tokens · 26145 ms · 2026-05-24T09:39:45.478697+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

  2. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

    cs.AI 2026-05 unverdicted novelty 7.0

    Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

  3. PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

    cs.AI 2026-05 unverdicted novelty 7.0

    PRISM automates continuous prompt creation, simulation-based testing, diagnosis, and repair for enterprise LLM agents, cutting authoring time to under 30 minutes while reaching 99% reliability and catching drift withi...

  4. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  5. TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

    cs.SE 2026-05 unverdicted novelty 7.0

    TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

  6. Unlocking Prompt Infilling Capability for Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

  7. Agile Deliberation: Concept Deliberation for Subjective Visual Classification

    cs.AI 2025-12 conditional novelty 7.0

    Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.

  8. Reflective Prompt Tuning through Language Model Function-Calling

    cs.CL 2026-05 unverdicted novelty 6.0

    Reflective Prompt Tuning uses LLM function calling and diagnostic reports to iteratively optimize prompts, yielding up to 12.9 point gains on reasoning tasks while improving calibration.

  9. optimize_anything: A Universal API for Optimizing any Text Parameter

    cs.CL 2026-05 unverdicted novelty 6.0

    A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.

  10. Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

    cs.CL 2026-05 conditional novelty 6.0

    NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.

  11. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  12. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  13. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.

  14. LLM-Guided Prompt Evolution for Password Guessing

    cs.CR 2026-04 unverdicted novelty 6.0

    LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.

  15. Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

    cs.AI 2026-04 unverdicted novelty 6.0

    POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

  16. ART: Automatic multi-step reasoning and tool-use for large language models

    cs.CL 2023-03 unverdicted novelty 6.0

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  17. Less Back-and-Forth: A Comparative Study of Structured Prompting

    cs.CL 2026-05 unverdicted novelty 5.0

    Checklist-improved prompts achieve the highest mean rubric score (7.50/8) and best quality-effort tradeoff compared to raw prompts (5.67) and clarifying-question prompts (6.67) across four task types and three LLMs.

  18. Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

    cs.AI 2026-04 unverdicted novelty 5.0

    Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.

  19. Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

    cs.CL 2026-04 unverdicted novelty 5.0

    AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.

  20. Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

    cs.AI 2026-05 unverdicted novelty 4.0

    TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.

  21. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

    cs.CL 2024-08 unverdicted novelty 4.0

    GPT-4o and Claude 3.5 Sonnet reach 73.7-74% accuracy on gastroenterology questions; VLMs gain nothing from images and lose accuracy with LLM-generated captions.

  22. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    cs.AI 2024-02 unverdicted novelty 3.0

    A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

  23. Natural Language Processing in the Legal Domain

    cs.CL 2023-02 unverdicted novelty 3.0

    A survey of nearly 1000 NLP & Law papers from 2013-2024 documenting increases in publication volume, scope, methodological sophistication, and data/code availability.

  24. Bridging Language Models and Financial Analysis

    q-fin.ST 2025-03 unverdicted novelty 2.0

    A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · cited by 23 Pith papers · 20 internal anchors

  1. [1]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691,

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861,

  3. [3]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732,

  4. [4]

    Efficient Training of Language Models to Fill in the Middle

    Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, and Mark Chen. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255,

  5. [5]

    Pada: A prompt-based autoregressive approach for adaptation to unseen domains

    Eyal Ben-David, Nadav Oved, and Roi Reichart. Pada: A prompt-based autoregressive approach for adaptation to unseen domains. arXiv preprint arXiv:2102.12206,

  6. [6]

    Thinking aloud: Dynamic context generation improves zero-shot reasoning performance of gpt-2

    Gregor Betz, Kyle Richardson, and Christian V oigt. Thinking aloud: Dynamic context generation improves zero-shot reasoning performance of gpt-2. arXiv preprint arXiv:2103.13033,

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

  8. [8]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

  10. [10]

    Commonsense knowledge mining from pretrained models

    Joe Davison, Joshua Feldman, and Alexander M Rush. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 1173–1178,

  11. [11]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,

  12. [12]

    GLM: General language model pretraining with autoregressive blank infilling

    10 Published as a conference paper at ICLR 2023 Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 320–335, Dublin, Ireland, May

  13. [13]

    doi: 10.18653/ v1/2022.acl-long.26

    Association for Computational Linguistics. doi: 10.18653/ v1/2022.acl-long.26. URL https://aclanthology.org/2022.acl-long.26. Kevin Ellis, Lucas Morales, Mathias Sablé-Meyer, Armando Solar-Lezama, and Josh Tenen- baum. Learning libraries of subroutines for neurally–guided bayesian program induction. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. ...

  14. [14]

    Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum

    URL https://proceedings.neurips.cc/paper/2018/file/ 7aa685b3b1dc1d6780bf36f7340078c9-Paper.pdf. Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program synthesis with wake-sleep library learning. In Proceedings of the 42nd...

  15. [15]

    doi: 10.18653/v1/2021.acl-long.295

    Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.295. URL https://aclanthology.org/2021.acl-long

  16. [16]

    Instruction induction: From few examples to natural language task descriptions

    Or Honovich, Uri Shaham, Samuel R Bowman, and Omer Levy. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782,

  17. [17]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

  18. [18]

    Large Language Models are Zero-Shot Reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916,

  19. [19]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059,

  20. [20]

    Competition-Level Code Generation with AlphaCode

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. arXiv preprint arXiv:2203.07814,

  21. [21]

    Jordan, and Dan Klein

    Percy Liang, Michael I. Jordan, and Dan Klein. Learning programs: A hierarchical bayesian approach. In Johannes Fürnkranz and Thorsten Joachims (eds.), Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel , pp. 639–646. Omnipress,

  22. [22]

    11 Published as a conference paper at ICLR 2023 Stephanie Lin, Jacob Hilton, and Owain Evans

    URL https://icml.cc/Conferences/2010/papers/568.pdf. 11 Published as a conference paper at ICLR 2023 Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic hu- man falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computa- tional Linguistics (Volume 1: Long Papers) , pp. 3214–3252, Dublin, Ireland, May

  23. [23]

    doi: 10.18653/v1/2022.acl-long.229

    As- sociation for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.org/2022.acl-long.229. Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too. arXiv preprint arXiv:2103.10385,

  24. [24]

    Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786,

  25. [25]

    Show Your Work: Scratchpads for Intermediate Computation with Language Models

    Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114,

  26. [26]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155,

  27. [27]

    Learning how to ask: Querying lms with mixtures of soft prompts

    Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5203–5212,

  28. [28]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125,

  29. [29]

    Prompt programming for large language models: Beyond the few-shot paradigm

    Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7,

  30. [30]

    Solving General Arithmetic Word Problems

    Subhro Roy and Dan Roth. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413,

  31. [31]

    Logan IV , Eric Wallace, and Sameer Singh

    12 Published as a conference paper at ICLR 2023 Taylor Shin, Yasaman Razeghi, Robert L. Logan IV , Eric Wallace, and Sameer Singh. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In Empirical Methods in Natural Language Processing (EMNLP),

  32. [32]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

  33. [33]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261,

  34. [34]

    Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247,

    Albert Webson and Ellie Pavlick. Do prompt-based models really understand the meaning of their prompts? arXiv preprint arXiv:2109.01247,

  35. [35]

    Emergent Abilities of Large Language Models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits ...

  36. [36]

    Wu, andN.D.Goodman

    Eric Zelikman, Yuhuai Wu, and Noah D Goodman. Star: Bootstrapping reasoning with reasoning. arXiv preprint arXiv:2203.14465,

  37. [37]

    GLM-130B: An Open Bilingual Pre-trained Model

    Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414,

  38. [38]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068,

  39. [39]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

  40. [40]

    Instruc- tion + In-context

    13 Published as a conference paper at ICLR 2023 A P ROMPT ENGINEERING IN THE WILD Large models with natural language interfaces, including models for text generation and image synthesis, have seen an increasing amount of public usage in recent years. As finding the right prompt can be difficult for humans, a number of guides on prompt engineering as well as...

  41. [41]

    Let’s think step by step

    As shown in Figure 8, adding an instruction achieves a comparable or better test performance than the standard in-context learning performance on 21 of 24 tasks. Counter-intuitively, adding in-context examples for Rhymes, Large Animal, and Second Letters hurts model performance. We conjecture that it may be because the selected instructions overfit the zer...

  42. [42]

    Let’s work this out in a step by step way to be sure we have the right answer

    Figure 10: The performance of APE discovered prompt "Let’s work this out in a step by step way to be sure we have the right answer." on the 12 tasks from Kojima et al. (2022). We collect a CoT dataset from the original paper and filter out incorrect answers. We then use APE to optimize the CoT prompt. We improve performance on 6/12 tasks and nearly match h...

  43. [43]

    Template (*1) was proposed in Kojima et al

    dataset using InstructGPT (text-davinci-002). Template (*1) was proposed in Kojima et al. (2022) to enable the zero-shot chain of thoughts reasoning of large language models, while template (*2) and (*3) were used in Ahn et al. (2022) and Reynolds & McDonell (2021), respectively. No. Category Zero-shot CoT Trigger Prompt Accuracy 1 APE Let’s work this out...

  44. [44]

    Write a word that rhymes with each of the following words

    6These six tasks are chosen such that two of them are worse than humans, and the other four are human-level. They cover six categories (spelling, morphosyntax, lexical semantics, semantics, multi-lingual, and GLUE). 23 Published as a conference paper at ICLR 2023 D C OST ANALYSIS More powerful models are cost-efficient for instruction proposal Despite high...

  45. [45]

    Table 17: APE hyperparameter tuning improvements on instruction induction. Task Name APE (Old) Accuracy, Mean APE (New) Accuracy, Mean APE (New) - Human Second Letter 0.596 0.8 0.034 Pluralization 0.984 0.996 -0.004 Passivization 0.622 1 0.001 Sentence Similarity 0.186 0.256 -0.01 Membership 0.126 0.612 -0.001 Antonyms Cause Selection Common Concept Diff ...

  46. [46]

    We compare the performance of different templates used to propose instruction

    Figure 22: Few-shot test accuracy on 6 Instruction Induction tasks. We compare the performance of different templates used to propose instruction. Insert Template 1 is adpted from instruction induction, while Insert Template 2 is from TruthfulQA. 38 Published as a conference paper at ICLR 2023 Antonyms Cause Selection Common Concept Diff First Letter Form...