arxiv: 2310.11324 · v2 · submitted 2023-10-17 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar , Yejin Choi , Yulia Tsvetkov , Alane Suhr

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:55 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords prompt sensitivityLLM evaluationfew-shot promptingprompt formattingmodel comparison

0 comments

The pith

Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that popular large language models are highly sensitive to subtle, meaning-preserving changes in how few-shot prompts are formatted. This sensitivity produces large performance gaps, such as 76 accuracy points on LLaMA-2-13B, and persists even when models grow larger, receive more examples, or undergo instruction tuning. Different models also favor different formats, so comparing them with one arbitrarily chosen format lacks validity. To make evaluation more reliable, the authors introduce FormatSpread, an algorithm that samples many plausible formats and reports the resulting performance interval without needing model weights.

Core claim

Large language models exhibit extreme sensitivity to prompt formatting in few-shot settings, where minor changes that preserve meaning can alter performance by as much as 76 accuracy points on models like LLaMA-2-13B. This effect holds across increases in model size, number of shots, and even after instruction tuning. Analysis reveals weak correlation in format preferences across models, challenging the practice of evaluating and comparing models with a single fixed prompt format. The work introduces FormatSpread to efficiently sample and report performance ranges over plausible formats.

What carries the argument

FormatSpread, an algorithm that samples a set of plausible prompt formats for a task and computes the expected performance interval without requiring access to model weights.

If this is right

Evaluations of LLMs should report performance ranges across multiple plausible prompt formats rather than single-point estimates.
Direct comparisons between models using one fixed prompt format are unreliable because format performance correlates only weakly across models.
Sensitivity to formatting persists even as models increase in size or receive instruction tuning.
Researchers must consider a wider space of prompt designs when assessing model capabilities on few-shot tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If format sensitivity reaches this level, many published benchmark scores may partly reflect the chosen formatting rather than intrinsic model ability.
Applying FormatSpread to additional tasks and models could reveal whether the issue is concentrated in particular domains or architectures.
One extension would be to check whether closed-source models accessed only via API show comparable sensitivity when FormatSpread is run through their interfaces.

Load-bearing premise

The tested formatting variations and the formats sampled by FormatSpread adequately represent the full space of meaning-preserving prompt designs that real users might employ.

What would settle it

Testing a much broader set of prompt formats on LLaMA-2-13B and finding performance variation substantially below 76 accuracy points would weaken the sensitivity claim.

read the original abstract

As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM few-shot accuracy can swing by up to 76 points from small prompt formatting changes, and FormatSpread gives a practical way to report ranges instead of single numbers.

read the letter

The main thing to take away is that open-source LLMs like LLaMA-2 can have their few-shot accuracy swing by up to 76 points depending on small details in how the prompt is laid out. The authors show this sensitivity does not disappear with bigger models, more examples, or instruction tuning, and they argue that reporting a single format is not enough. They do solid work by measuring the effect across several models and tasks, breaking down which formatting elements drive the changes, and pointing out that different models do not agree on which formats work best. FormatSpread stands out as a useful method because it samples from a space of formats and gives a performance interval without requiring model weights or full enumeration. This makes it feasible to adopt in practice for more transparent evaluation. The softer part is the lack of grounding for the format space itself. The paper generates variations by mixing atomic changes such as separators, ordering, and verbalizers, but it does not test whether these combinations match prompts that users actually create or whether they all keep the task semantics intact from the model's perspective. Changes in formatting can affect tokenization and internal representations in non-obvious ways, so the observed range might partly reflect formats that no one would choose in real applications. Adding some external validation here would make the claims tighter. This paper targets researchers who evaluate language models with prompting methods. People working on benchmarks or trying to make deployment decisions based on reported numbers will find the results and the FormatSpread procedure directly applicable. The work shows clear thinking on an empirical issue and engages with the practical side of LLM use, so it deserves a serious referee. I would recommend sending it to peer review, with the suggestion to address how well the sampled formats represent real-world prompt design.

Referee Report

2 major / 2 minor

Summary. The paper claims that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points on LLaMA-2-13B. Sensitivity persists across model sizes, few-shot example counts, and instruction tuning. The authors argue that single-format reporting is insufficient and propose FormatSpread, an algorithm that samples plausible prompt formats via atomic perturbations (separators, verbalizers, ordering) to report performance intervals without model weight access. They further show weak cross-model format correlations, questioning fixed-format model comparisons, and analyze the influence of specific perturbations on internal representations.

Significance. If the empirical results hold after addressing validation gaps, this work would meaningfully advance LLM evaluation methodology by demonstrating that prompt formatting can dominate reported accuracies and by providing a practical, model-agnostic tool (FormatSpread) for quantifying variability. It supplies concrete numbers across multiple models and regimes, supporting calls for range reporting over single-point estimates and highlighting risks in current benchmarking practices.

major comments (2)

[§3 (FormatSpread)] §3 (FormatSpread description): The central claim that sampled formats are both plausible and meaning-preserving rests on the unvalidated assumption that combinations of atomic changes (e.g., bullet styles, newlines, label verbalizers) produce a representative distribution of user-like prompts. No human semantic-equivalence ratings, usage-log comparison, or external anchor is reported, which directly affects whether the 76-point spread on LLaMA-2-13B should be interpreted as genuine sensitivity or as an artifact of implausible tokenization/attention shifts.
[Experimental results] Experimental setup and results sections: The headline sensitivity numbers and persistence claims lack reported details on data splits, number of evaluation runs per format, or statistical tests for the performance intervals. Without these, it is difficult to gauge whether the observed ranges are robust or sensitive to sampling variance in the FormatSpread procedure.

minor comments (2)

[Abstract] Abstract: The maximum 76-point difference is stated without naming the specific task or dataset; adding this detail would improve immediate interpretability.
[Analysis section] Analysis of atomic perturbations: The discussion of which formatting elements drive the largest changes could be strengthened by reporting effect sizes or ablation tables rather than qualitative observations alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below. Where the manuscript is missing details or justification, we will revise accordingly.

read point-by-point responses

Referee: [§3 (FormatSpread)] §3 (FormatSpread description): The central claim that sampled formats are both plausible and meaning-preserving rests on the unvalidated assumption that combinations of atomic changes (e.g., bullet styles, newlines, label verbalizers) produce a representative distribution of user-like prompts. No human semantic-equivalence ratings, usage-log comparison, or external anchor is reported, which directly affects whether the 76-point spread on LLaMA-2-13B should be interpreted as genuine sensitivity or as an artifact of implausible tokenization/attention shifts.

Authors: The atomic perturbations are restricted to formatting elements that do not alter the underlying task semantics or example content; each variant remains a valid few-shot prompt for the same classification task. While we did not conduct new human equivalence ratings, the perturbation types are drawn from variations routinely used in the prompting literature and in public model cards. The observed spreads therefore demonstrate sensitivity to these standard formatting choices rather than to arbitrary or nonsensical strings. We will add an explicit discussion of the design rationale for the perturbation set and a note on its relation to documented prompting practices. revision: partial
Referee: [Experimental results] Experimental setup and results sections: The headline sensitivity numbers and persistence claims lack reported details on data splits, number of evaluation runs per format, or statistical tests for the performance intervals. Without these, it is difficult to gauge whether the observed ranges are robust or sensitive to sampling variance in the FormatSpread procedure.

Authors: We will expand the experimental sections to specify the exact train/test splits, the number of independent evaluation runs performed per format, and the statistical measures (e.g., standard deviation or bootstrap intervals) used to characterize the reported performance ranges. These additions will allow readers to assess sampling variance directly. revision: yes

Circularity Check

0 steps flagged

No circularity: results from direct empirical runs on held-out tasks

full rationale

The paper performs direct model evaluations on sampled prompt formats for standard benchmarks, measuring accuracy differences without any derivation chain, equations, or parameters that reduce to author-defined inputs by construction. FormatSpread is a sampling procedure whose outputs are observed performance intervals, not predictions fitted to the same data. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the core claims; the work is self-contained against external model runs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard few-shot evaluation assumptions and the representativeness of the chosen formatting perturbations; no new entities or fitted parameters are introduced in the abstract.

axioms (1)

domain assumption Standard few-shot in-context learning evaluation practices produce meaningful performance estimates
The paper measures accuracy on tasks under few-shot prompting without questioning the validity of that paradigm itself.

pith-pipeline@v0.9.0 · 5578 in / 1199 out tokens · 60740 ms · 2026-05-17T01:55:08.702657+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens
cs.LG 2026-04 accept novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
cs.LG 2026-05 unverdicted novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
cs.CR 2026-05 unverdicted novelty 7.0

LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
cs.CL 2026-05 unverdicted novelty 7.0

LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
cs.CL 2026-05 unverdicted novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models
cs.CL 2026-04 unverdicted novelty 7.0

SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML
cs.LG 2026-05 conditional novelty 6.0

Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
cs.AI 2026-05 unverdicted novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
cs.CL 2026-05 unverdicted novelty 6.0

LLMs exhibit prompt-variant output-mode collapse, preserving requested bare-label formats in only about 22% of semantically equivalent prompt variants across tested models and tasks.
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs
cs.CL 2026-05 unverdicted novelty 6.0

LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees
cs.AI 2026-04 unverdicted novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Collective AI can amplify tiny perturbations into divergent decisions
cs.AI 2026-03 conditional novelty 6.0

Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.
Lessons from the Trenches on Reproducible Evaluation of Language Models
cs.CL 2024-05 accept novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Benchmarking Local Language Models for Social Robots using Edge Devices
cs.RO 2026-05 unverdicted novelty 5.0

Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.
The Cartesian Cut in Agentic AI
cs.AI 2026-04 unverdicted novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
cs.CL 2026-04 accept novelty 5.0

PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt ...

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 17 Pith papers · 5 internal anchors

[1]

Tweet: Susan & I found MMLU performance jump 6-10 points in the 40s by formatting multiple choice as (A) not A in MMLU (for internal model)

Armen Aghajanyan. Tweet: Susan & I found MMLU performance jump 6-10 points in the 40s by formatting multiple choice as (A) not A in MMLU (for internal model). All evaluation of LLM's are broken. Evaluating a task requires marginalizing across all prompts that describe the task, not point estimate of one. June 2023. URL https://twitter.com/ArmenAgha/status...

work page arXiv 2023
[2]

Falcon-40B : an open large language model with state-of-the-art performance

Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. Falcon-40B : an open large language model with state-of-the-art performance. 2023

work page 2023
[3]

An empirical evaluation of thompson sampling

Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. Advances in neural information processing systems, 24, 2011

work page 2011
[5]

Better hypothesis testing for statistical machine translation: Controlling for optimizer instability

Jonathan H Clark, Chris Dyer, Alon Lavie, and Noah A Smith. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 176--181, 2011

work page 2011
[7]

GPT 3.int8(): 8-bit matrix multiplication for transformers at scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. GPT 3.int8(): 8-bit matrix multiplication for transformers at scale. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=dXiGWqBoxaD

work page 2022
[8]

Openprompt: An open-source framework for prompt-learning

Ning Ding, Shengding Hu, Weilin Zhao, Yulin Chen, Zhiyuan Liu, Haitao Zheng, and Maosong Sun. Openprompt: An open-source framework for prompt-learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.\ 105--113, 2022

work page 2022
[9]

Measuring and improving consistency in pretrained language models

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Sch \"u tze, and Yoav Goldberg. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9: 0 1012--1031, 2021

work page 2021
[12]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[14]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. Editing models with task arithmetic. In International Conference on Learning Representations, 2023

work page 2023
[16]

How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020

work page 2020
[18]

Asymptotically efficient adaptive allocation rules

Tze Leung Lai, Herbert Robbins, et al. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6 0 (1): 0 4--22, 1985

work page 1985
[19]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 3045--3059, 2021

work page 2021
[20]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.\ 74--81, 2004

work page 2004
[22]

What makes chain-of-thought prompting effective? a counterfactual study

Aman Madaan, Katherine Hermann, and Amir Yazdanbakhsh. What makes chain-of-thought prompting effective? a counterfactual study. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 1448--1535, 2023

work page 2023
[23]

Stereoset: Measuring stereotypical bias in pretrained language models

Moin Nadeem, Anna Bethke, and Siva Reddy. Stereoset: Measuring stereotypical bias in pretrained language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ 5356--5371, 2021

work page 2021
[26]

Learning how to ask: Querying lms with mixtures of soft prompts

Guanghui Qin and Jason Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2021

work page 2021
[27]

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp

Timo Schick, Sahana Udupa, and Hinrich Sch \"u tze. Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp. Transactions of the Association for Computational Linguistics, 9: 0 1408--1424, 2021

work page 2021
[28]

Chatgpt: Optimizing language models for dialogue

John Schulman, Barret Zoph, Christina Kim, Jacob Hilton, Jacob Menick, Jiayi Weng, Juan Felipe Ceron Uribe, Liam Fedus, Luke Metz, Michael Pokorny, et al. Chatgpt: Optimizing language models for dialogue. OpenAI blog, 2022

work page 2022
[35]

Bertscore: Evaluating text generation with bert

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019

work page 2019
[36]

Large language models are human-level prompt engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=92gvk82DE-

work page 2023
[38]

Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

What makes chain-of-thought prompting effective? a counterfactual study , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

work page 2023
[39]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[40]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[41]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[42]

Transactions of the Association for Computational Linguistics , volume=

Self-diagnosis and self-debiasing: A proposal for reducing corpus-based bias in nlp , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

work page 2021
[43]

Super- N atural I nstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Naik, Atharva and Ashok, Arjun and Dhanasekaran, Arut Selvan and Arunkumar, Anjana and Stap, David and Pathak, Eshaan and Karamanolakis, Giannis and Lai, Haizhi and Purohit, Ishan and Mondal, Ishani and Anderson, Jacob and Kuznia, Kirby and Doshi, Kr...

work page doi:10.18653/v1/2022.emnlp-main.340 2022
[44]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme , year=

work page
[46]

arXiv preprint arXiv:2109.07830 , year=

Reframing Instructional Prompts to GPTk's Language , author=. arXiv preprint arXiv:2109.07830 , year=

work page arXiv
[47]

Transactions of the Association for Computational Linguistics , volume=

How can we know what language models know? , author=. Transactions of the Association for Computational Linguistics , volume=. 2020 , publisher=

work page 2020
[48]

and Wallace, Eric and Singh, Sameer

Shin, Taylor and Razeghi, Yasaman and Logan IV, Robert L. and Wallace, Eric and Singh, Sameer. A uto P rompt: E liciting K nowledge from L anguage M odels with A utomatically G enerated P rompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.346

work page doi:10.18653/v1/2020.emnlp-main.346 2020
[49]

RLP rompt: Optimizing Discrete Text Prompts with Reinforcement Learning

Deng, Mingkai and Wang, Jianyu and Hsieh, Cheng-Ping and Wang, Yihan and Guo, Han and Shu, Tianmin and Song, Meng and Xing, Eric and Hu, Zhiting. RLP rompt: Optimizing Discrete Text Prompts with Reinforcement Learning. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.222

work page doi:10.18653/v1/2022.emnlp-main.222 2022
[50]

G r IPS : Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

Prasad, Archiki and Hase, Peter and Zhou, Xiang and Bansal, Mohit. G r IPS : Gradient-free, Edit-based Instruction Search for Prompting Large Language Models. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 2023. doi:10.18653/v1/2023.eacl-main.277

work page doi:10.18653/v1/2023.eacl-main.277 2023
[51]

arXiv preprint arXiv:2212.10539 , year=

Toward Human Readable Prompt Tuning: Kubrick's The Shining is a good movie, and a good prompt too? , author=. arXiv preprint arXiv:2212.10539 , year=

work page arXiv
[52]

The Eleventh International Conference on Learning Representations , year=

Large Language Models are Human-Level Prompt Engineers , author=. The Eleventh International Conference on Learning Representations , year=

work page
[53]

gradient descent

Automatic prompt optimization with" gradient descent" and beam search , author=. arXiv preprint arXiv:2305.03495 , year=

work page arXiv
[54]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page
[55]

International Conference on Learning Representations , year=

BERTScore: Evaluating Text Generation with BERT , author=. International Conference on Learning Representations , year=

work page
[56]

The Eleventh International Conference on Learning Representations , year=

Tempera: Test-time prompt editing via reinforcement learning , author=. The Eleventh International Conference on Learning Representations , year=

work page
[57]

2023 , url =

Aghajanyan, Armen , title =. 2023 , url =

work page 2023
[58]

2023 , eprint=

Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots , author=. 2023 , eprint=

work page 2023
[59]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Jailbroken: How Does LLM Safety Training Fail?

Jailbroken: How does llm safety training fail? , author=. arXiv preprint arXiv:2307.02483 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

2022 , url=

Tim Dettmers and Mike Lewis and Younes Belkada and Luke Zettlemoyer , booktitle=. 2022 , url=

work page 2022
[62]

Advances in neural information processing systems , volume=

An empirical evaluation of thompson sampling , author=. Advances in neural information processing systems , volume=

work page
[63]

Advances in applied mathematics , volume=

Asymptotically efficient adaptive allocation rules , author=. Advances in applied mathematics , volume=

work page
[64]

A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

A prompt pattern catalog to enhance prompt engineering with chatgpt , author=. arXiv preprint arXiv:2302.11382 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[65]

Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Lu, Yao and Bartolo, Max and Moore, Alastair and Riedel, Sebastian and Stenetorp, Pontus. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022. doi:10.18653/v1/2022.acl-long.556

work page doi:10.18653/v1/2022.acl-long.556 2022
[66]

arXiv preprint arXiv:2212.04037 , year=

Demystifying prompts in language models via perplexity estimation , author=. arXiv preprint arXiv:2212.04037 , year=

work page arXiv
[67]

Making Pre-trained Language Models Better Few-shot Learners

Gao, Tianyu and Fisch, Adam and Chen, Danqi. Making Pre-trained Language Models Better Few-shot Learners. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.18653/v1/2021.acl-long.295

work page doi:10.18653/v1/2021.acl-long.295 2021
[68]

and Levy, Omer

Honovich, Or and Shaham, Uri and Bowman, Samuel R. and Levy, Omer. Instruction Induction: From Few Examples to Natural Language Task Descriptions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.108

work page doi:10.18653/v1/2023.acl-long.108 2023
[69]

XGBoost: A scalable tree boosting system

Chen, Tianqi and Guestrin, Carlos , title =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , series =. 2016 , isbn =. doi:10.1145/2939672.2939785 , acmid =

work page doi:10.1145/2939672.2939785 2016
[70]

StereoSet: Measuring stereotypical bias in pretrained language models , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages=

work page
[71]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

Learning How to Ask: Querying LMs with Mixtures of Soft Prompts , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) , year=

work page 2021
[72]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

OpenPrompt: An Open-source Framework for Prompt-learning , author=. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations , pages=

work page
[73]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

The Power of Scale for Parameter-Efficient Prompt Tuning , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[74]

Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts

Khashabi, Daniel and Lyu, Xinxi and Min, Sewon and Qin, Lianhui and Richardson, Kyle and Welleck, Sean and Hajishirzi, Hannaneh and Khot, Tushar and Sabharwal, Ashish and Singh, Sameer and Choi, Yejin. Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts. Proceedings of the 2022 Conference of the North American Chapter ...

work page doi:10.18653/v1/2022.naacl-main.266 2022
[75]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , author=. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2021
[76]

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

Better hypothesis testing for statistical machine translation: Controlling for optimizer instability , author=. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies , pages=

work page
[77]

Reproducibility of Benchmarked Deep Reinforcement Learning Tasks for Continuous Control

Reproducibility of benchmarked deep reinforcement learning tasks for continuous control , author=. arXiv preprint arXiv:1708.04133 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

OpenAI blog , year=

ChatGPT: Optimizing language models for dialogue , author=. OpenAI blog , year=

work page
[79]

Transactions of the Association for Computational Linguistics , volume=

Measuring and improving consistency in pretrained language models , author=. Transactions of the Association for Computational Linguistics , volume=. 2021 , publisher=

work page 2021
[80]

Proceedings of the AAAI conference on artificial intelligence , volume=

Deep reinforcement learning that matters , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[81]

2023 , booktitle=

Editing Models with Task Arithmetic , author=. 2023 , booktitle=

work page 2023