{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:2FLK2MWJLATJAVAZTWKFIVMEEF","short_pith_number":"pith:2FLK2MWJ","schema_version":"1.0","canonical_sha256":"d156ad32c958269054199d945455842140bc05d2fc21f7eb865c67bde5b35e2a","source":{"kind":"arxiv","id":"2310.11324","version":2},"attestation_state":"computed","paper":{"title":"Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Alane Suhr, Melanie Sclar, Yejin Choi, Yulia Tsvetkov","submitted_at":"2023-10-17T15:03:30Z","abstract_excerpt":"As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with per"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2310.11324","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-10-17T15:03:30Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"36118606ae417a37c6d02143c21547b584d75bfdb5e7343bb753ba88f911ecd5","abstract_canon_sha256":"1deda2aad4398016853ac581c2d4b4c56736fa4608e0b35ff62f6f41987e0bb3"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:45.902487Z","signature_b64":"7kuMLJ/+HYeDCE5dX/ksfaCwg2Kr1hh86dl7MnA8gn3eGR1OPkZA3NBiYjnP1gdlv6DtaZKybkv++m5SNDZgCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"d156ad32c958269054199d945455842140bc05d2fc21f7eb865c67bde5b35e2a","last_reissued_at":"2026-05-17T23:38:45.901734Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:45.901734Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Alane Suhr, Melanie Sclar, Yejin Choi, Yulia Tsvetkov","submitted_at":"2023-10-17T15:03:30Z","abstract_excerpt":"As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with per"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"that the set of tested formatting variations and the sampled formats in FormatSpread adequately represent the space of plausible, meaning-preserving prompt designs that users might actually employ","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"948f1d9368558848e13ae92ff5dcd1a6b9d2341d8f63f979dcac6b0c6fdd0938"},"source":{"id":"2310.11324","kind":"arxiv","version":2},"verdict":{"id":"1d4400c5-122c-4fb2-8678-3750c8d7854b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T01:55:08.702657Z","strongest_claim":"several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B","one_line_summary":"LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"that the set of tested formatting variations and the sampled formats in FormatSpread adequately represent the space of plausible, meaning-preserving prompt designs that users might actually employ","pith_extraction_headline":"Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences."},"references":{"count":64,"sample":[{"doi":"","year":2023,"title":"Tweet: Susan & I found MMLU performance jump 6-10 points in the 40s by formatting multiple choice as (A) not A in MMLU (for internal model)","work_id":"a98ef876-d539-4161-86fc-12ebf45c62ee","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Falcon-40B : an open large language model with state-of-the-art performance","work_id":"f87a57c8-47ba-4385-84be-df8a6ff868e2","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"An empirical evaluation of thompson sampling","work_id":"5267b456-570f-4cd0-ba11-51f07bd9cbae","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2011,"title":"Better hypothesis testing for statistical machine translation: Controlling for optimizer instability","work_id":"7574558d-bc86-480f-bb5e-f9809c19b113","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"GPT 3.int8(): 8-bit matrix multiplication for transformers at scale","work_id":"764578ef-94c9-44e1-88fc-c200bccc1f14","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":64,"snapshot_sha256":"1736fafb21c26ad293174330c9589146c2ecb9efcdb93adf40d068b3b5275232","internal_anchors":5},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2310.11324","created_at":"2026-05-17T23:38:45.901821+00:00"},{"alias_kind":"arxiv_version","alias_value":"2310.11324v2","created_at":"2026-05-17T23:38:45.901821+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2310.11324","created_at":"2026-05-17T23:38:45.901821+00:00"},{"alias_kind":"pith_short_12","alias_value":"2FLK2MWJLATJ","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"2FLK2MWJLATJAVAZ","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"2FLK2MWJ","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2502.16761","citing_title":"Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2509.19590","citing_title":"Position: AI Evaluations Should be Grounded on a Theory of Capability","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2510.04309","citing_title":"Activation Steering with a Feedback Controller","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2603.22161","citing_title":"Causal Evidence that Language Models use Confidence to Drive Behavior","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20994","citing_title":"Towards Context-Invariant Safety Alignment for Large Language Models","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18890","citing_title":"Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18194","citing_title":"Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2507.14913","citing_title":"PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2405.14782","citing_title":"Lessons from the Trenches on Reproducible Evaluation of Language Models","ref_index":104,"is_internal_anchor":true},{"citing_arxiv_id":"2603.09127","citing_title":"Collective AI can amplify tiny perturbations into divergent decisions","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02608","citing_title":"Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14197","citing_title":"The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03111","citing_title":"Benchmarking Local Language Models for Social Robots using Edge Devices","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04665","citing_title":"Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08455","citing_title":"CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06161","citing_title":"Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06656","citing_title":"Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML","ref_index":293,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04665","citing_title":"Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11328","citing_title":"Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07745","citing_title":"The Cartesian Cut in Agentic AI","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07830","citing_title":"CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07186","citing_title":"The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14672","citing_title":"SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01048","citing_title":"Compared to What? Baselines and Metrics for Counterfactual Prompting","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02038","citing_title":"What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models","ref_index":8,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF","json":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF.json","graph_json":"https://pith.science/api/pith-number/2FLK2MWJLATJAVAZTWKFIVMEEF/graph.json","events_json":"https://pith.science/api/pith-number/2FLK2MWJLATJAVAZTWKFIVMEEF/events.json","paper":"https://pith.science/paper/2FLK2MWJ"},"agent_actions":{"view_html":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF","download_json":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF.json","view_paper":"https://pith.science/paper/2FLK2MWJ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2310.11324&json=true","fetch_graph":"https://pith.science/api/pith-number/2FLK2MWJLATJAVAZTWKFIVMEEF/graph.json","fetch_events":"https://pith.science/api/pith-number/2FLK2MWJLATJAVAZTWKFIVMEEF/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF/action/timestamp_anchor","attest_storage":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF/action/storage_attestation","attest_author":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF/action/author_attestation","sign_citation":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF/action/citation_signature","submit_replication":"https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF/action/replication_record"}},"created_at":"2026-05-17T23:38:45.901821+00:00","updated_at":"2026-05-17T23:38:45.901821+00:00"}