hub

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

Melanie Sclar, Yejin Choi, Yulia Tsvetkov, Alane Suhr · 2023 · arXiv 2310.11324

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

open full Pith review browse 17 citing papers arXiv PDF

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.

The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.

Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML

cs.LG · 2026-05-07 · conditional · novelty 6.0

Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.

Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs

cs.CL · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.

What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models

cs.CL · 2026-05-03 · unverdicted · novelty 6.0

Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

Collective AI can amplify tiny perturbations into divergent decisions

cs.AI · 2026-03-10 · conditional · novelty 6.0

Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.

Lessons from the Trenches on Reproducible Evaluation of Language Models

cs.CL · 2024-05-23 · accept · novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

Benchmarking Local Language Models for Social Robots using Edge Devices

cs.RO · 2026-05-04 · unverdicted · novelty 5.0

Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.

The Cartesian Cut in Agentic AI

cs.AI · 2026-04-09 · unverdicted · novelty 5.0

LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.

The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure

cs.CL · 2026-04-03 · accept · novelty 5.0

PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.

citing papers explorer

Showing 17 of 17 citing papers.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens cs.LG · 2026-04-03 · accept · none · ref 23 · internal anchor
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging cs.LG · 2026-05-08 · unverdicted · none · ref 25 · internal anchor
CUDABeaver shows LLM CUDA debuggers often degenerate code for test-passing at the cost of speed, with protocol-aware metrics shifting success rates by up to 40 percentage points.
CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios cs.CR · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval cs.CL · 2026-05-08 · unverdicted · none · ref 14 · internal anchor
LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models cs.CL · 2026-04-16 · unverdicted · none · ref 53 · internal anchor
SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.
Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML cs.LG · 2026-05-07 · conditional · none · ref 293 · internal anchor
Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges cs.AI · 2026-05-07 · unverdicted · none · ref 38 · internal anchor
LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.
Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs cs.CL · 2026-05-06 · unverdicted · none · ref 11 · 2 links · internal anchor
LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.
What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models cs.CL · 2026-05-03 · unverdicted · none · ref 8 · internal anchor
Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 54 · internal anchor
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees cs.AI · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
Collective AI can amplify tiny perturbations into divergent decisions cs.AI · 2026-03-10 · conditional · none · ref 6 · internal anchor
Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · accept · none · ref 104 · internal anchor
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
Benchmarking Local Language Models for Social Robots using Edge Devices cs.RO · 2026-05-04 · unverdicted · none · ref 19 · internal anchor
Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.
The Cartesian Cut in Agentic AI cs.AI · 2026-04-09 · unverdicted · none · ref 58 · internal anchor
LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure cs.CL · 2026-04-03 · accept · none · ref 20 · internal anchor
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer