pith. machine review for the scientific record. sign in

arxiv: 2310.11324 · v2 · submitted 2023-10-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords prompt sensitivityLLM evaluationfew-shot promptingprompt formattingmodel comparison
0
0 comments X

The pith

Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that popular large language models are highly sensitive to subtle, meaning-preserving changes in how few-shot prompts are formatted. This sensitivity produces large performance gaps, such as 76 accuracy points on LLaMA-2-13B, and persists even when models grow larger, receive more examples, or undergo instruction tuning. Different models also favor different formats, so comparing them with one arbitrarily chosen format lacks validity. To make evaluation more reliable, the authors introduce FormatSpread, an algorithm that samples many plausible formats and reports the resulting performance interval without needing model weights.

Core claim

Large language models exhibit extreme sensitivity to prompt formatting in few-shot settings, where minor changes that preserve meaning can alter performance by as much as 76 accuracy points on models like LLaMA-2-13B. This effect holds across increases in model size, number of shots, and even after instruction tuning. Analysis reveals weak correlation in format preferences across models, challenging the practice of evaluating and comparing models with a single fixed prompt format. The work introduces FormatSpread to efficiently sample and report performance ranges over plausible formats.

What carries the argument

FormatSpread, an algorithm that samples a set of plausible prompt formats for a task and computes the expected performance interval without requiring access to model weights.

If this is right

  • Evaluations of LLMs should report performance ranges across multiple plausible prompt formats rather than single-point estimates.
  • Direct comparisons between models using one fixed prompt format are unreliable because format performance correlates only weakly across models.
  • Sensitivity to formatting persists even as models increase in size or receive instruction tuning.
  • Researchers must consider a wider space of prompt designs when assessing model capabilities on few-shot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If format sensitivity reaches this level, many published benchmark scores may partly reflect the chosen formatting rather than intrinsic model ability.
  • Applying FormatSpread to additional tasks and models could reveal whether the issue is concentrated in particular domains or architectures.
  • One extension would be to check whether closed-source models accessed only via API show comparable sensitivity when FormatSpread is run through their interfaces.

Load-bearing premise

The tested formatting variations and the formats sampled by FormatSpread adequately represent the full space of meaning-preserving prompt designs that real users might employ.

What would settle it

Testing a much broader set of prompt formats on LLaMA-2-13B and finding performance variation substantially below 76 accuracy points would weaken the sensitivity claim.

read the original abstract

As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points on LLaMA-2-13B. Sensitivity persists across model sizes, few-shot example counts, and instruction tuning. The authors argue that single-format reporting is insufficient and propose FormatSpread, an algorithm that samples plausible prompt formats via atomic perturbations (separators, verbalizers, ordering) to report performance intervals without model weight access. They further show weak cross-model format correlations, questioning fixed-format model comparisons, and analyze the influence of specific perturbations on internal representations.

Significance. If the empirical results hold after addressing validation gaps, this work would meaningfully advance LLM evaluation methodology by demonstrating that prompt formatting can dominate reported accuracies and by providing a practical, model-agnostic tool (FormatSpread) for quantifying variability. It supplies concrete numbers across multiple models and regimes, supporting calls for range reporting over single-point estimates and highlighting risks in current benchmarking practices.

major comments (2)
  1. [§3 (FormatSpread)] §3 (FormatSpread description): The central claim that sampled formats are both plausible and meaning-preserving rests on the unvalidated assumption that combinations of atomic changes (e.g., bullet styles, newlines, label verbalizers) produce a representative distribution of user-like prompts. No human semantic-equivalence ratings, usage-log comparison, or external anchor is reported, which directly affects whether the 76-point spread on LLaMA-2-13B should be interpreted as genuine sensitivity or as an artifact of implausible tokenization/attention shifts.
  2. [Experimental results] Experimental setup and results sections: The headline sensitivity numbers and persistence claims lack reported details on data splits, number of evaluation runs per format, or statistical tests for the performance intervals. Without these, it is difficult to gauge whether the observed ranges are robust or sensitive to sampling variance in the FormatSpread procedure.
minor comments (2)
  1. [Abstract] Abstract: The maximum 76-point difference is stated without naming the specific task or dataset; adding this detail would improve immediate interpretability.
  2. [Analysis section] Analysis of atomic perturbations: The discussion of which formatting elements drive the largest changes could be strengthened by reporting effect sizes or ablation tables rather than qualitative observations alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below. Where the manuscript is missing details or justification, we will revise accordingly.

read point-by-point responses
  1. Referee: [§3 (FormatSpread)] §3 (FormatSpread description): The central claim that sampled formats are both plausible and meaning-preserving rests on the unvalidated assumption that combinations of atomic changes (e.g., bullet styles, newlines, label verbalizers) produce a representative distribution of user-like prompts. No human semantic-equivalence ratings, usage-log comparison, or external anchor is reported, which directly affects whether the 76-point spread on LLaMA-2-13B should be interpreted as genuine sensitivity or as an artifact of implausible tokenization/attention shifts.

    Authors: The atomic perturbations are restricted to formatting elements that do not alter the underlying task semantics or example content; each variant remains a valid few-shot prompt for the same classification task. While we did not conduct new human equivalence ratings, the perturbation types are drawn from variations routinely used in the prompting literature and in public model cards. The observed spreads therefore demonstrate sensitivity to these standard formatting choices rather than to arbitrary or nonsensical strings. We will add an explicit discussion of the design rationale for the perturbation set and a note on its relation to documented prompting practices. revision: partial

  2. Referee: [Experimental results] Experimental setup and results sections: The headline sensitivity numbers and persistence claims lack reported details on data splits, number of evaluation runs per format, or statistical tests for the performance intervals. Without these, it is difficult to gauge whether the observed ranges are robust or sensitive to sampling variance in the FormatSpread procedure.

    Authors: We will expand the experimental sections to specify the exact train/test splits, the number of independent evaluation runs performed per format, and the statistical measures (e.g., standard deviation or bootstrap intervals) used to characterize the reported performance ranges. These additions will allow readers to assess sampling variance directly. revision: yes

Circularity Check

0 steps flagged

No circularity: results from direct empirical runs on held-out tasks

full rationale

The paper performs direct model evaluations on sampled prompt formats for standard benchmarks, measuring accuracy differences without any derivation chain, equations, or parameters that reduce to author-defined inputs by construction. FormatSpread is a sampling procedure whose outputs are observed performance intervals, not predictions fitted to the same data. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the core claims; the work is self-contained against external model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard few-shot evaluation assumptions and the representativeness of the chosen formatting perturbations; no new entities or fitted parameters are introduced in the abstract.

axioms (1)
  • domain assumption Standard few-shot in-context learning evaluation practices produce meaningful performance estimates
    The paper measures accuracy on tasks under few-shot prompting without questioning the validity of that paradigm itself.

pith-pipeline@v0.9.0 · 5578 in / 1199 out tokens · 60740 ms · 2026-05-17T01:55:08.702657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

    cs.LG 2026-04 accept novelty 8.0

    Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

  2. CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

    cs.LG 2026-05 unverdicted novelty 7.0

    CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

  3. CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

    cs.CR 2026-05 unverdicted novelty 7.0

    LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.

  4. The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

    cs.CL 2026-05 unverdicted novelty 7.0

    LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.

  5. Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

    cs.CL 2026-05 unverdicted novelty 7.0

    A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

  6. SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.

  7. Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

    cs.LG 2026-05 conditional novelty 6.0

    Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.

  8. Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

    cs.AI 2026-05 unverdicted novelty 6.0

    LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

  9. Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs exhibit prompt-variant output-mode collapse, preserving requested bare-label formats in only about 22% of semantically equivalent prompt variants across tested models and tasks.

  10. Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.

  11. What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.

  12. Compared to What? Baselines and Metrics for Counterfactual Prompting

    cs.CL 2026-05 conditional novelty 6.0

    Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...

  13. Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

    cs.AI 2026-04 unverdicted novelty 6.0

    POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

  14. Collective AI can amplify tiny perturbations into divergent decisions

    cs.AI 2026-03 conditional novelty 6.0

    Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.

  15. Lessons from the Trenches on Reproducible Evaluation of Language Models

    cs.CL 2024-05 accept novelty 6.0

    The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

  16. Benchmarking Local Language Models for Social Robots using Edge Devices

    cs.RO 2026-05 unverdicted novelty 5.0

    Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.

  17. The Cartesian Cut in Agentic AI

    cs.AI 2026-04 unverdicted novelty 5.0

    LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.

  18. The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure

    cs.CL 2026-04 accept novelty 5.0

    PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt ...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 17 Pith papers · 5 internal anchors

  1. [1]

    Tweet: Susan & I found MMLU performance jump 6-10 points in the 40s by formatting multiple choice as (A) not A in MMLU (for internal model)

    Armen Aghajanyan. Tweet: Susan & I found MMLU performance jump 6-10 points in the 40s by formatting multiple choice as (A) not A in MMLU (for internal model). All evaluation of LLM's are broken. Evaluating a task requires marginalizing across all prompts that describe the task, not point estimate of one. June 2023. URL https://twitter.com/ArmenAgha/status...

  2. [2]

    Falcon-40B : an open large language model with state-of-the-art performance

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. Falcon-40B : an open large language model with state-of-the-art performance. 2023

  3. [3]

    An empirical evaluation of thompson sampling

    Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011

  4. [5]

    Better hypothesis testing for statistical machine translation: Controlling for optimizer instability

    Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 176--181, 2011

  5. [7]

    GPT 3.int8(): 8-bit matrix multiplication for transformers at scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. GPT 3.int8(): 8-bit matrix multiplication for transformers at scale. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=dXiGWqBoxaD

  6. [8]

    Openprompt: An open-source framework for prompt-learning

    Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun. Openprompt: An open-source framework for prompt-learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.\ 105--113, 2022

  7. [9]

    Measuring and improving consistency in pretrained language models

    Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9: 0 1012--1031, 2021

  8. [12]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

  9. [14]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In International Conference on Learning Representations, 2023

  10. [16]

    How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

    Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

  11. [18]

    Asymptotically efficient adaptive allocation rules

    Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 0 (1): 0 4--22, 1985

  12. [19]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 3045--3059, 2021

  13. [20]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.\ 74--81, 2004

  14. [22]

    What makes chain-of-thought prompting effective? a counterfactual study

    Aman Madaan, Katherine Hermann, and Amir Yazdanbakhsh. What makes chain-of-thought prompting effective? a counterfactual study. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1448--1535, 2023

  15. [23]

    Stereoset: Measuring stereotypical bias in pretrained language models

    Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 5356--5371, 2021

  16. [26]

    Learning how to ask: Querying lms with mixtures of soft prompts

    Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021

  17. [27]

    Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp

    Timo Schick, Sahana Udupa, and Hinrich Sch \"u tze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics, 9: 0 1408--1424, 2021

  18. [28]

    Chatgpt: Optimizing language models for dialogue

    John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, et al. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022

  19. [35]

    Bertscore: Evaluating text generation with bert

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019

  20. [36]

    Large language models are human-level prompt engineers

    Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=92gvk82DE-

  21. [38]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    What makes chain-of-thought prompting effective? a counterfactual study , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  22. [39]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  23. [40]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  24. [41]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  25. [42]

    Transactions of the Association for Computational Linguistics , volume=

    Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

  26. [43]

    Super- N atural I nstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

    Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Naik, Atharva and Ashok, Arjun and Dhanasekaran, Arut Selvan and Arunkumar, Anjana and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Doshi, Kr...

  27. [44]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  28. [45]

    Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme , year=

  29. [46]

    arXiv preprint arXiv:2109.07830 , year=

    Reframing Instructional Prompts to GPTk's Language , author=. arXiv preprint arXiv:2109.07830 , year=

  30. [47]

    Transactions of the Association for Computational Linguistics , volume=

    How can we know what language models know? , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

  31. [48]

    and Wallace, Eric and Singh, Sameer

    Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

  32. [49]

    RLP rompt: Optimizing Discrete Text Prompts with Reinforcement Learning

    Deng, Mingkai and Wang, Jianyu and Hsieh, Cheng-Ping and Wang, Yihan and Guo, Han and Shu, Tianmin and Song, Meng and Xing, Eric and Hu, Zhiting. RLP rompt: Optimizing Discrete Text Prompts with Reinforcement Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.222

  33. [50]

    G r IPS : Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

    Prasad, Archiki and Hase, Peter and Zhou, Xiang and Bansal, Mohit. G r IPS : Gradient-free, Edit-based Instruction Search for Prompting Large Language Models. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.277

  34. [51]

    arXiv preprint arXiv:2212.10539 , year=

    Toward Human Readable Prompt Tuning: Kubrick's The Shining is a good movie, and a good prompt too? , author=. arXiv preprint arXiv:2212.10539 , year=

  35. [52]

    The Eleventh International Conference on Learning Representations , year=

    Large Language Models are Human-Level Prompt Engineers , author=. The Eleventh International Conference on Learning Representations , year=

  36. [53]

    gradient descent

    Automatic prompt optimization with" gradient descent" and beam search , author=. arXiv preprint arXiv:2305.03495 , year=

  37. [54]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  38. [55]

    International Conference on Learning Representations , year=

    BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

  39. [56]

    The Eleventh International Conference on Learning Representations , year=

    Tempera: Test-time prompt editing via reinforcement learning , author=. The Eleventh International Conference on Learning Representations , year=

  40. [57]

    2023 , url =

    Aghajanyan, Armen , title =. 2023 , url =

  41. [58]

    2023 , eprint=

    Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots , author=. 2023 , eprint=

  42. [59]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  43. [60]

    Jailbroken: How Does LLM Safety Training Fail?

    Jailbroken: How does llm safety training fail? , author=. arXiv preprint arXiv:2307.02483 , year=

  44. [61]

    2022 , url=

    Tim Dettmers and Mike Lewis and Younes Belkada and Luke Zettlemoyer , booktitle=. 2022 , url=

  45. [62]

    Advances in neural information processing systems , volume=

    An empirical evaluation of thompson sampling , author=. Advances in neural information processing systems , volume=

  46. [63]

    Advances in applied mathematics , volume=

    Asymptotically efficient adaptive allocation rules , author=. Advances in applied mathematics , volume=

  47. [64]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    A prompt pattern catalog to enhance prompt engineering with chatgpt , author=. arXiv preprint arXiv:2302.11382 , year=

  48. [65]

    Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

    Lu, Yao and Bartolo, Max and Moore, Alastair and Riedel, Sebastian and Stenetorp, Pontus. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.556

  49. [66]

    arXiv preprint arXiv:2212.04037 , year=

    Demystifying prompts in language models via perplexity estimation , author=. arXiv preprint arXiv:2212.04037 , year=

  50. [67]

    Making Pre-trained Language Models Better Few-shot Learners

    Gao, Tianyu and Fisch, Adam and Chen, Danqi. Making Pre-trained Language Models Better Few-shot Learners. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.295

  51. [68]

    and Levy, Omer

    Honovich, Or and Shaham, Uri and Bowman, Samuel R. and Levy, Omer. Instruction Induction: From Few Examples to Natural Language Task Descriptions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.108

  52. [69]

    XGBoost: A scalable tree boosting system

    Chen, Tianqi and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , series =. 2016 , isbn =. doi:10.1145/2939672.2939785 , acmid =

  53. [70]

    StereoSet: Measuring stereotypical bias in pretrained language models , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

  54. [71]

    Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

    Learning How to Ask: Querying LMs with Mixtures of Soft Prompts , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

  55. [72]

    Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

    OpenPrompt: An Open-source Framework for Prompt-learning , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

  56. [73]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    The Power of Scale for Parameter-Efficient Prompt Tuning , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  57. [74]

    Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts

    Khashabi, Daniel and Lyu, Xinxi and Min, Sewon and Qin, Lianhui and Richardson, Kyle and Welleck, Sean and Hajishirzi, Hannaneh and Khot, Tushar and Sabharwal, Ashish and Singh, Sameer and Choi, Yejin. Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts. Proceedings of the 2022 Conference of the North American Chapter ...

  58. [75]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

    Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

  59. [76]

    Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

    Better hypothesis testing for statistical machine translation: Controlling for optimizer instability , author=. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

  60. [77]

    Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

    Reproducibility of benchmarked deep reinforcement learning tasks for continuous control , author=. arXiv preprint arXiv:1708.04133 , year=

  61. [78]

    OpenAI blog , year=

    ChatGPT: Optimizing language models for dialogue , author=. OpenAI blog , year=

  62. [79]

    Transactions of the Association for Computational Linguistics , volume=

    Measuring and improving consistency in pretrained language models , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

  63. [80]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Deep reinforcement learning that matters , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  64. [81]

    2023 , booktitle=

    Editing Models with Task Arithmetic , author=. 2023 , booktitle=