pith. sign in
Pith Number

pith:2FLK2MWJ

pith:2023:2FLK2MWJLATJAVAZTWKFIVMEEF
not attested not anchored not stored refs resolved

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Alane Suhr, Melanie Sclar, Yejin Choi, Yulia Tsvetkov

Several open-source LLMs vary in accuracy by up to 76 points on the same few-shot task due to minor prompt formatting differences.

arxiv:2310.11324 v2 · 2023-10-17 · cs.CL · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{2FLK2MWJLATJAVAZTWKFIVMEEF}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B

C2weakest assumption

that the set of tested formatting variations and the sampled formats in FormatSpread adequately represent the space of plausible, meaning-preserving prompt designs that users might actually employ

C3one line summary

LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.

References

64 extracted · 64 resolved · 5 Pith anchors

[1] Tweet: Susan & I found MMLU performance jump 6-10 points in the 40s by formatting multiple choice as (A) not A in MMLU (for internal model) 2023
[2] Falcon-40B : an open large language model with state-of-the-art performance 2023
[3] An empirical evaluation of thompson sampling 2011
[5] Better hypothesis testing for statistical machine translation: Controlling for optimizer instability 2011
[7] GPT 3.int8(): 8-bit matrix multiplication for transformers at scale 2022

Cited by

25 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:45.901734Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

d156ad32c958269054199d945455842140bc05d2fc21f7eb865c67bde5b35e2a

Aliases

arxiv: 2310.11324 · arxiv_version: 2310.11324v2 · doi: 10.48550/arxiv.2310.11324 · pith_short_12: 2FLK2MWJLATJ · pith_short_16: 2FLK2MWJLATJAVAZ · pith_short_8: 2FLK2MWJ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/2FLK2MWJLATJAVAZTWKFIVMEEF \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: d156ad32c958269054199d945455842140bc05d2fc21f7eb865c67bde5b35e2a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "1deda2aad4398016853ac581c2d4b4c56736fa4608e0b35ff62f6f41987e0bb3",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-10-17T15:03:30Z",
    "title_canon_sha256": "36118606ae417a37c6d02143c21547b584d75bfdb5e7343bb753ba88f911ecd5"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2310.11324",
    "kind": "arxiv",
    "version": 2
  }
}