Recognition: unknown
UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval
Pith reviewed 2026-05-10 07:57 UTC · model grok-4.3
The pith
Classic similarity-based information retrieval favors relevance over usefulness for answering queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Classic similarity-based information retrieval aligns more strongly with relevance than with usefulness. LLM-based systems can counteract this bias, yet domain-specific problems still require a high degree of expertise that current LLMs do not fully incorporate.
What carries the argument
UsefulBench, a dataset of domain-specific query-text pairs labeled by three professional analysts for relevance (text connected to the query) versus usefulness (practical value in responding to the query).
If this is right
- Traditional IR systems retrieve texts that match queries in wording more reliably than texts that aid practical responses.
- LLM-based retrieval narrows the gap between relevance and usefulness but remains limited on expert domains.
- Retrieval targets should shift toward decision-useful information rather than similarity alone.
- UsefulBench provides a benchmark for testing systems that prioritize practical value over lexical match.
Where Pith is reading between the lines
- Evaluation of future IR systems could track downstream decision accuracy rather than label agreement alone.
- Integrating domain-specific knowledge sources might close the expertise gap that limits current LLMs on usefulness.
- Testing the same queries with outcome-linked data could show whether usefulness labels predict real improvements in answers.
Load-bearing premise
Usefulness is a stable property that three professional analysts can label consistently without external validation against decision outcomes.
What would settle it
A larger annotation study or direct measurement of whether usefulness-labeled texts improve accuracy on the original decision tasks.
Figures
read the original abstract
Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UsefulBench, a domain-specific dataset in which three professional analysts label texts as relevant (connected to the query via similarity) or useful (holding practical value for responding to the query). It reports that classic similarity-based IR aligns more strongly with relevance labels, while LLM-based retrievers can partially counteract this bias but still fail to incorporate the high domain expertise needed for the benchmark's problems. The work explores mitigation strategies and presents UsefulBench as a new challenge dataset for decision-useful IR.
Significance. If the usefulness/relevance distinction can be shown to be stable and the reported alignment gaps hold under larger-scale validation, the paper would usefully highlight a gap between conventional relevance and decision utility in IR. It supplies a concrete motivating example (Paris vs. Berlin) and initial evidence that both traditional and LLM systems fall short on usefulness, which could steer future work toward new training signals or evaluation targets. The absence of any machine-checked proofs or parameter-free derivations is unsurprising for an empirical benchmark paper, but the small annotation scale limits immediate generalizability.
major comments (2)
- [Dataset curation and labeling description] The central empirical claims—that classic similarity-based IR tracks relevance more than usefulness and that LLMs lack domain expertise—rest entirely on the UsefulBench labels produced by exactly three professional analysts. No inter-annotator agreement statistic, adjudication procedure, dataset size, annotation protocol, statistical tests, or correlation with external decision outcomes is described. If the usefulness/relevance distinction is unstable or annotator-specific, the reported alignment gaps become uninterpretable.
- [Abstract and experimental results] The abstract and results sections state clear directional findings without reporting the number of queries, texts per query, domains covered, or any controls for query difficulty. This makes it impossible to assess whether the claimed superiority of classic IR on relevance (or LLM shortcomings on usefulness) generalizes beyond the specific instances labeled by the three analysts.
minor comments (1)
- [Abstract] The abstract refers to 'domain-specific problems' and 'high degree of expertise' without naming the domains or providing additional concrete examples beyond the single Paris/Berlin query, which reduces clarity for readers unfamiliar with the benchmark.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below. We agree that additional documentation on the dataset and experimental setup is warranted and will revise the paper accordingly to enhance transparency and reproducibility.
read point-by-point responses
-
Referee: [Dataset curation and labeling description] The central empirical claims—that classic similarity-based IR tracks relevance more than usefulness and that LLMs lack domain expertise—rest entirely on the UsefulBench labels produced by exactly three professional analysts. No inter-annotator agreement statistic, adjudication procedure, dataset size, annotation protocol, statistical tests, or correlation with external decision outcomes is described. If the usefulness/relevance distinction is unstable or annotator-specific, the reported alignment gaps become uninterpretable.
Authors: We appreciate the referee highlighting the need for more rigorous documentation of the labeling process. While the manuscript introduces the dataset as curated by three professional analysts, we concur that specifics such as inter-annotator agreement, the exact annotation protocol, dataset size, and statistical validation are essential. In the revised manuscript, we will expand the relevant section to include these details, including any measures of agreement among the analysts and the procedures followed. We will also report statistical tests supporting the alignment observations. However, establishing correlation with external decision outcomes would necessitate additional empirical studies outside the scope of this benchmark introduction; we will explicitly discuss this as a limitation and a direction for future work. revision: yes
-
Referee: [Abstract and experimental results] The abstract and results sections state clear directional findings without reporting the number of queries, texts per query, domains covered, or any controls for query difficulty. This makes it impossible to assess whether the claimed superiority of classic IR on relevance (or LLM shortcomings on usefulness) generalizes beyond the specific instances labeled by the three analysts.
Authors: We agree that providing these details is crucial for evaluating the scope and generalizability of our findings. We will revise the abstract to incorporate the number of queries, the average or total texts per query, the domains covered, and any controls implemented for query difficulty. Similarly, the results section will be updated to explicitly state these figures and describe the experimental controls. This will allow readers to better contextualize the reported directional findings regarding the performance of similarity-based IR and LLM-based systems. revision: yes
Circularity Check
No circularity: empirical benchmark with independent human labels
full rationale
The paper defines relevance and usefulness as distinct labeling criteria, collects new annotations from three analysts on a curated domain-specific dataset, and reports direct empirical comparisons of retrieval systems against those labels. No equations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked to justify core premises or uniqueness. The central claims rest on the newly created labels and observed alignment differences rather than any reduction of outputs to inputs by construction. This is a standard empirical benchmark paper whose derivation chain is self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Usefulness is a reliably labelable property by professional analysts
Reference graph
Works this paper leans on
-
[1]
Language Models (Mostly) Know What They Know
The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA. Association for Com- puting Machinery. Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated evalu...
work page internal anchor Pith review arXiv 2024
-
[2]
DIRAS: Efficient LLM annotation of doc- ument relevance for retrieval augmented generation. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 5238–5258, Albuquerque, New Mexico. Association for Compu- tational Linguist...
2025
-
[3]
Expertlongbench: Benchmarking language models on expert-level long-form generation tasks with structured checklists. Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An automated evalua- tion framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Assoc...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.