Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Gianni Barlacchi; Sandro Pezzelle; Yunchong Huang

arxiv: 2602.11938 · v5 · submitted 2026-02-12 · 💻 cs.CL

Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Yunchong Huang , Gianni Barlacchi , Sandro Pezzelle This is my paper

Pith reviewed 2026-05-16 02:43 UTC · model grok-4.3

classification 💻 cs.CL

keywords question answeringunderspecified questionsbenchmark evaluationquestion rewritinglarge language modelsQA performancequestion ambiguity

0 comments

The pith

Rewriting underspecified questions into fully specified versions improves QA performance on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that many questions in QA benchmarks are underspecified, lacking enough context for a unique answer, and that this ambiguity explains a large share of model errors. Using an LLM classifier, the authors find 16% to over 50% of questions in popular datasets are underspecified and that models score lower on them. They then rewrite those questions to be fully specified while keeping the original gold answers fixed, and show consistent gains in QA accuracy. This suggests that apparent model failures often reflect unclear questions more than model weakness, which matters for how we design and interpret benchmarks.

Core claim

We introduce an LLM-based classifier to identify underspecified questions across several QA datasets and apply it to find that 16% to over 50% of benchmark questions are underspecified, with LLMs performing significantly worse on them. Through a controlled rewriting experiment that converts these questions into fully specified variants while holding gold answers fixed, QA performance consistently improves, showing that many apparent QA failures stem from question underspecification rather than model limitations.

What carries the argument

An LLM-based classifier that detects underspecification combined with a rewriting process that produces fully specified questions while preserving the original gold answers.

If this is right

QA benchmarks contain substantial numbers of underspecified questions that lower measured model performance.
Models reach higher accuracy once questions are rewritten to be fully specified.
Underspecification functions as a confound that affects how we evaluate QA systems.
Benchmark design should pay more attention to ensuring question clarity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Automatic underspecification checks could be added to future benchmark creation pipelines to reduce ambiguity.
The rewriting technique might be applied to training data to create more robust QA models.
Similar detection and clarification steps could improve reliability in real-world user queries to QA systems.

Load-bearing premise

The rewriting process produces fully specified questions that preserve the exact intended meaning and gold answer without introducing new information or changing the underlying query intent.

What would settle it

If the classifier flags many clearly well-specified questions or if the rewritten versions produce no accuracy gains or lower accuracy than the originals, the claim that underspecification drives the observed failures would not hold.

read the original abstract

Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Underspecification hits 16-50% of questions in common QA sets and rewriting them lifts performance, but the rewrite controls look thin.

read the letter

The main thing to know is that this paper measures how often questions in standard QA benchmarks are underspecified and shows that rewriting them into clearer versions improves model results while keeping the gold answers the same. They put an LLM classifier on several datasets and report that 16% to over 50% of questions fall into this category, with models doing noticeably worse on them. The rewriting experiment then serves as an upper-bound test: fix the answer, make the question fully specified, and watch the scores rise. That combination of prevalence numbers plus a controlled before-after comparison is the new piece here, and it does a clean job of flagging a practical confound in how we evaluate QA systems. The numbers across multiple datasets are useful for anyone who builds or tests these models, and the idea that some apparent failures are really question problems rather than model limits is worth taking seriously. The soft spot is the rewriting step itself. Holding the answer fixed is a good start, but without reported human validation or checks that the new questions add no extra cues or shift intent, the performance gains could partly come from incidental changes in wording or lexical overlap. The abstract also skips classifier accuracy and inter-annotator details, so the exact rates are harder to trust at face value. This is the kind of work that belongs in a reading group for people focused on benchmark design and evaluation methodology. It deserves peer review because the core observation is grounded in real data and points to a fixable issue, even if the rewrite validation needs tightening before the claims land fully.

Referee Report

2 major / 1 minor

Summary. The paper argues that many failures of LLMs on standard QA benchmarks stem from underspecified questions whose interpretation cannot be uniquely determined without extra context. It introduces an LLM-based classifier that flags 16% to over 50% of questions across several datasets as underspecified, shows that LLMs perform worse on these, and reports a controlled rewriting experiment in which underspecified questions are turned into fully specified variants while gold answers are held fixed; QA performance improves consistently, suggesting the failures are due to question ambiguity rather than model limitations.

Significance. If the rewriting step is shown to preserve original intent and answerability conditions, the work identifies a measurable confound in current QA evaluation and supplies concrete rates of underspecification that could inform benchmark curation. The empirical demonstration that performance rises when questions are clarified provides a useful upper-bound analysis and motivates greater attention to question clarity in dataset design.

major comments (2)

[Abstract] Abstract and rewriting experiment: the central claim that performance gains isolate underspecification rests on the assumption that rewritten questions preserve exact intended meaning and gold-answer conditions without adding cues or shifting intent. No quantitative validation (human equivalence judgments, inter-annotator agreement, or lexical-overlap controls) is reported, leaving open the possibility that gains arise from incidental factors such as higher passage overlap or leakage of answer information.
[Abstract] Abstract: the reported underspecification rates (16% to >50%) and the claim of significantly worse LLM performance on them depend on the accuracy of the LLM classifier, yet no classifier accuracy, precision/recall, or inter-annotator agreement figures are supplied, making it impossible to assess how reliably the rates reflect true underspecification.

minor comments (1)

[Title] The title question is an example of underspecification but is not explicitly linked to the datasets analyzed; a brief note connecting it to the empirical findings would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our evaluation methodology. We address each major comment below and commit to revisions that strengthen the paper's claims.

read point-by-point responses

Referee: [Abstract] Abstract and rewriting experiment: the central claim that performance gains isolate underspecification rests on the assumption that rewritten questions preserve exact intended meaning and gold-answer conditions without adding cues or shifting intent. No quantitative validation (human equivalence judgments, inter-annotator agreement, or lexical-overlap controls) is reported, leaving open the possibility that gains arise from incidental factors such as higher passage overlap or leakage of answer information.

Authors: We agree that explicit validation of the rewriting step is necessary to isolate the effect of underspecification. The experiment holds gold answers fixed by construction and adds only minimal context to resolve ambiguity, but we acknowledge the need for quantitative safeguards against incidental changes. In the revised manuscript we will add a human evaluation on a representative sample of rewritten questions, reporting equivalence judgments, inter-annotator agreement, and lexical-overlap statistics between original and rewritten forms. These additions will directly address the possibility of unintended cues or shifts. revision: yes
Referee: [Abstract] Abstract: the reported underspecification rates (16% to >50%) and the claim of significantly worse LLM performance on them depend on the accuracy of the LLM classifier, yet no classifier accuracy, precision/recall, or inter-annotator agreement figures are supplied, making it impossible to assess how reliably the rates reflect true underspecification.

Authors: We recognize that the reported rates and performance gaps rest on the classifier's reliability. While the abstract summarizes the main findings, the full paper describes the classifier prompt and its application; however, we did not include validation metrics in the initial submission. In the revision we will add a dedicated section reporting classifier accuracy, precision/recall on a held-out human-annotated set, and inter-annotator agreement, thereby substantiating the 16%–>50% figures and the associated performance differences. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical classifier and rewriting experiment are self-contained measurements

full rationale

The paper introduces an LLM-based classifier to detect underspecified questions and performs a rewriting experiment that holds gold answers fixed while measuring QA performance gains. These steps rely on external datasets and direct empirical deltas rather than any self-definitional loop, fitted parameter renamed as prediction, or self-citation that bears the central load. No equations or uniqueness theorems are invoked; the claim that performance improves under rewriting is falsifiable by the reported numbers and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that underspecified questions possess a recoverable unique interpretation that can be made explicit via rewriting without altering the gold answer or introducing bias.

axioms (1)

domain assumption Underspecified questions have a unique intended interpretation that can be made explicit by rewriting while preserving the original gold answer.
Invoked to justify holding gold answers fixed during the rewriting experiment.

pith-pipeline@v0.9.0 · 5473 in / 1222 out tokens · 64624 ms · 2026-05-16T02:43:17.645031+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Role of Ambiguity in Error Prediction via Uncertainty Quantification
cs.CL 2026-06 unverdicted novelty 5.0

Disentangling input ambiguity from uncertainty quantification improves error prediction for LLMs on QA tasks, yielding over 10 PRR point gains across models and datasets.