hub Canonical reference

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr · 2023 · cs.CL · arXiv 2310.11324

Canonical reference. 100% of citing Pith papers cite this work as background.

41 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

As large language models (LLMs) are adopted as a fundamental component of language technologies, it is crucial to accurately characterize their performance. Because choices in prompt design can strongly influence model behavior, this design process is critical in effectively using any modern pre-trained generative language model. In this work, we focus on LLM sensitivity to a quintessential class of meaning-preserving design choices: prompt formatting. We find that several widely used open-source LLMs are extremely sensitive to subtle changes in prompt formatting in few-shot settings, with performance differences of up to 76 accuracy points when evaluated using LLaMA-2-13B. Sensitivity remains even when increasing model size, the number of few-shot examples, or performing instruction tuning. Our analysis suggests that work evaluating LLMs with prompting-based methods would benefit from reporting a range of performance across plausible prompt formats, instead of the currently-standard practice of reporting performance on a single format. We also show that format performance only weakly correlates between models, which puts into question the methodological validity of comparing models with an arbitrarily chosen, fixed prompt format. To facilitate systematic analysis we propose FormatSpread, an algorithm that rapidly evaluates a sampled set of plausible prompt formats for a given task, and reports the interval of expected performance without accessing model weights. Furthermore, we present a suite of analyses that characterize the nature of this sensitivity, including exploring the influence of particular atomic perturbations and the internal representation of particular formats.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

Falsification, Not Exposure: An Internally Preregistered Placebo-Controlled Decomposition of Self-Repair Feedback in Frozen Small Code Models

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

Preregistered placebo-controlled decomposition shows external executable counterexamples drive self-repair gains in small code models more than re-exposure or self-critique.

Self-Harness: Harnesses That Improve Themselves

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

AIP: A Graph Representation for Learning and Governing Agent Skills

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

AIP models skills as graphs of discrete steps connected by typed I/O edges under a validated schema, raising agent mean reward from 0.60 to 0.71 and pass rate from 53% to 67% on 27 SkillsBench tasks while enabling node-level fixes.

On the impact of retrieved content representations in RAG Pipelines

cs.IR · 2026-05-29 · unverdicted · novelty 7.0

A controlled comparison of document representations in RAG finds answer retention to be the dominant factor in generator accuracy across four LLMs.

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.

Activation Steering with a Feedback Controller

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.

Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.

Consistency Training Can Entrench Misalignment

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Consistency training suppresses reward hacking and emergent misalignment but amplifies sycophancy in controlled model organisms, driven by labeling-induced distribution shifts rather than selection operators.

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

cs.CL · 2026-05-28 · unverdicted · novelty 6.0

Domain specialization does not consistently improve clinical LLM robustness to meaning-preserving prompt variations, as shown by new sensitivity metrics on DiagnosisQA and MedQA.

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

Persona prefixes reduce brand recommendation Jaccard similarity by 0.12-0.20, with mid-market brands swapping up to 75% of recommendations while category leaders remain ~80% consistent across OpenAI and Anthropic models.

SafetyRepro: Configuration-Conditional Rank Instability on Alignment Benchmarks

cs.LG · 2026-05-25 · unverdicted · novelty 6.0

Configuration choices alone flip pairwise safety verdicts on every tested alignment benchmark, isolated via a finite-envelope proposition linking disagreement rate to strict ordering reversal.

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

cs.IR · 2026-05-22 · unverdicted · novelty 6.0

Paraphrase Jaccard similarity of 0.135-0.288 falls below the 0.50-0.61 same-prompt rerun baseline on OpenAI and Anthropic models, showing prompt wording dominates buyer intent in commercial recommendations.

Towards Context-Invariant Safety Alignment for Large Language Models

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.

Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

physics.soc-ph · 2026-05-17 · accept · novelty 6.0

Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

CUDABEAVER benchmark and pass@k(M,C,A) metric show LLM CUDA debugging success drops by up to 40 percentage points under strict performance requirements.

Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

cs.LG · 2026-05-07 · conditional · novelty 6.0

Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

cs.CL · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.

What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.

citing papers explorer

Showing 14 of 14 citing papers after filters.

Self-Harness: Harnesses That Improve Themselves cs.CL · 2026-06-08 · unverdicted · none · ref 22 · internal anchor
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval cs.CL · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models cs.CL · 2026-04-16 · unverdicted · none · ref 53 · internal anchor
SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
Persona Non Grata: LLM Persona-Driven Generations in MCQA are Unstable in Distinct Dimensions cs.CL · 2026-07-01 · unverdicted · none · ref 43 · internal anchor
Persona-driven generations by LLMs in MCQA tasks exhibit instability that differs systematically by model family, size, domain, and prompt format.
Consistency Training Can Entrench Misalignment cs.CL · 2026-06-02 · unverdicted · none · ref 49 · internal anchor
Consistency training suppresses reward hacking and emergent misalignment but amplifies sycophancy in controlled model organisms, driven by labeling-induced distribution shifts rather than selection operators.
Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs cs.CL · 2026-05-28 · unverdicted · none · ref 24 · internal anchor
Domain specialization does not consistently improve clinical LLM robustness to meaning-preserving prompt variations, as shown by new sensitivity metrics on DiagnosisQA and MedQA.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 46 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs cs.CL · 2026-05-06 · unverdicted · none · ref 11 · 2 links · internal anchor
LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models cs.CL · 2026-05-03 · unverdicted · none · ref 8 · internal anchor
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation cs.CL · 2025-07-20 · unverdicted · none · ref 25 · internal anchor
PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.
Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions cs.CL · 2025-02-24 · unverdicted · none · ref 8 · internal anchor
Fine-tuning LLMs on the SubPOP dataset of 3,362 questions and 70K pairs reduces the gap between LLM predictions and human survey responses by up to 46% and generalizes to unseen surveys and subpopulations.
Towards Physical Intuitions for Alignment Dynamics: A Case Study With Randomness Crystallization cs.CL · 2026-06-29 · unverdicted · none · ref 6 · internal anchor
Proposes a three-phase crystallization model (liquid, nucleation via SFT, settling via RL) for alignment dynamics using random number generation tasks as case study.
Chain-based Adaptive Reconfiguration Over Lattices for Hallucination Reduction cs.CL · 2026-05-26 · unverdicted · none · ref 2 · internal anchor
CAROL unifies hallucination detection and mitigation by defining semantic uncertainty over a lattice of sequences and casting mitigation as a Markov chain process with claimed convergence guarantees.

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer