{"paper":{"title":"Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Alane Suhr, Melanie Sclar, Yejin Choi, Yulia Tsvetkov","submitted_at":"2023-10-17T15:03:30Z","abstract_excerpt":"As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with per"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"that the set of tested formatting variations and the sampled formats in FormatSpread adequately represent the space of plausible, meaning-preserving prompt designs that users might actually employ","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"948f1d9368558848e13ae92ff5dcd1a6b9d2341d8f63f979dcac6b0c6fdd0938"},"source":{"id":"2310.11324","kind":"arxiv","version":2},"verdict":{"id":"1d4400c5-122c-4fb2-8678-3750c8d7854b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T01:55:08.702657Z","strongest_claim":"several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B","one_line_summary":"LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"that the set of tested formatting variations and the sampled formats in FormatSpread adequately represent the space of plausible, meaning-preserving prompt designs that users might actually employ","pith_extraction_headline":"Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences."},"references":{"count":64,"sample":[{"doi":"","year":2023,"title":"Tweet: Susan & I found MMLU performance jump 6-10 points in the 40s by formatting multiple choice as (A) not A in MMLU (for internal model)","work_id":"a98ef876-d539-4161-86fc-12ebf45c62ee","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Falcon-40B : an open large language model with state-of-the-art performance","work_id":"f87a57c8-47ba-4385-84be-df8a6ff868e2","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"An empirical evaluation of thompson sampling","work_id":"5267b456-570f-4cd0-ba11-51f07bd9cbae","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"Better hypothesis testing for statistical machine translation: Controlling for optimizer instability","work_id":"7574558d-bc86-480f-bb5e-f9809c19b113","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"GPT 3.int8(): 8-bit matrix multiplication for transformers at scale","work_id":"764578ef-94c9-44e1-88fc-c200bccc1f14","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":64,"snapshot_sha256":"1736fafb21c26ad293174330c9589146c2ecb9efcdb93adf40d068b3b5275232","internal_anchors":5},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}