arxiv: 2604.15827 · v2 · submitted 2026-04-17 · 💻 cs.IR · cs.CL

Recognition: unknown

UsefulBench: Towards Decision-Useful Information as a Target for Information Retrieval

Tobias Schimanski , Stefanie Lewandowski , Christian Woerle , Nicola Reichenau , Yauheni Huryn , Markus Leippold

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:57 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords information retrievalusefulnessrelevanceLLM evaluationbenchmark datasetdecision supportdomain expertise

0 comments

The pith

Classic similarity-based information retrieval favors relevance over usefulness for answering queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset to separate whether a text merely matches a query from whether it helps answer it in practice. Traditional retrieval systems turn out to track the first property more closely than the second. LLM-based systems reduce that misalignment but still miss the specialized knowledge needed for many domain queries. The distinction matters because users often want information that supports decisions rather than just similar wording. The work positions usefulness as a new target for retrieval systems.

Core claim

Classic similarity-based information retrieval aligns more strongly with relevance than with usefulness. LLM-based systems can counteract this bias, yet domain-specific problems still require a high degree of expertise that current LLMs do not fully incorporate.

What carries the argument

UsefulBench, a dataset of domain-specific query-text pairs labeled by three professional analysts for relevance (text connected to the query) versus usefulness (practical value in responding to the query).

If this is right

Traditional IR systems retrieve texts that match queries in wording more reliably than texts that aid practical responses.
LLM-based retrieval narrows the gap between relevance and usefulness but remains limited on expert domains.
Retrieval targets should shift toward decision-useful information rather than similarity alone.
UsefulBench provides a benchmark for testing systems that prioritize practical value over lexical match.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluation of future IR systems could track downstream decision accuracy rather than label agreement alone.
Integrating domain-specific knowledge sources might close the expertise gap that limits current LLMs on usefulness.
Testing the same queries with outcome-linked data could show whether usefulness labels predict real improvements in answers.

Load-bearing premise

Usefulness is a stable property that three professional analysts can label consistently without external validation against decision outcomes.

What would settle it

A larger annotation study or direct measurement of whether usefulness-labeled texts improve accuracy on the original decision tasks.

Figures

Figures reproduced from arXiv: 2604.15827 by Christian Woerle, Markus Leippold, Nicola Reichenau, Stefanie Lewandowski, Tobias Schimanski, Yauheni Huryn.

**Figure 2.** Figure 2: UsefulBench creation pipeline. Three professional analysts search for documents (text passages) that [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: nDCG@10 comparison of embedding and LLM-based rankings. ture emerges when considering binary distinctions. Smaller models, such as gpt-4.1-mini, perform slightly better in distinguishing non-relevant/useful (0) from relevant/useful (1–2) documents. In contrast, the largest model, gpt-4.1, performs best in identifying highly relevant/useful documents (2 vs. 0–1). Increased model size seems to primarily imp… view at source ↗

**Figure 4.** Figure 4: F1 score comparison of the ablations using [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: F1 score comparison when fine-tuning Minis [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Prompt for rating the usefulness of a docu [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 6.** Figure 6: Prompt for rating the relevance of a document. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 8.** Figure 8: Instructions for the misclassification analysis. [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for rating the relevance and usefulness [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: ECE scores comparison of the ablations using one prompt for relevance and usefulness, extending [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: ECE scores comparison when fine-tuning Ministral 3b, 8b, and 14b models. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

read the original abstract

Conventional information retrieval is concerned with identifying the relevance of texts for a given query. Yet, the conventional definition of relevance is dominated by aspects of similarity in texts, leaving unobserved whether the text is truly useful for addressing the query. For instance, when answering whether Paris is larger than Berlin, texts about Paris being in France are relevant (lexical/semantic similarity), but not useful. In this paper, we introduce UsefulBench, a domain-specific dataset curated by three professional analysts labeling whether a text is connected to a query (relevance) or holds practical value in responding to it (usefulness). We show that classic similarity-based information retrieval aligns more strongly with relevance. While LLM-based systems can counteract this bias, we find that domain-specific problems require a high degree of expertise, which current LLMs do not fully incorporate. We explore approaches to (partially) overcome this challenge. However, UsefulBench presents a dataset challenge for targeted information retrieval systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces UsefulBench, a domain-specific dataset in which three professional analysts label texts as relevant (connected to the query via similarity) or useful (holding practical value for responding to the query). It reports that classic similarity-based IR aligns more strongly with relevance labels, while LLM-based retrievers can partially counteract this bias but still fail to incorporate the high domain expertise needed for the benchmark's problems. The work explores mitigation strategies and presents UsefulBench as a new challenge dataset for decision-useful IR.

Significance. If the usefulness/relevance distinction can be shown to be stable and the reported alignment gaps hold under larger-scale validation, the paper would usefully highlight a gap between conventional relevance and decision utility in IR. It supplies a concrete motivating example (Paris vs. Berlin) and initial evidence that both traditional and LLM systems fall short on usefulness, which could steer future work toward new training signals or evaluation targets. The absence of any machine-checked proofs or parameter-free derivations is unsurprising for an empirical benchmark paper, but the small annotation scale limits immediate generalizability.

major comments (2)

[Dataset curation and labeling description] The central empirical claims—that classic similarity-based IR tracks relevance more than usefulness and that LLMs lack domain expertise—rest entirely on the UsefulBench labels produced by exactly three professional analysts. No inter-annotator agreement statistic, adjudication procedure, dataset size, annotation protocol, statistical tests, or correlation with external decision outcomes is described. If the usefulness/relevance distinction is unstable or annotator-specific, the reported alignment gaps become uninterpretable.
[Abstract and experimental results] The abstract and results sections state clear directional findings without reporting the number of queries, texts per query, domains covered, or any controls for query difficulty. This makes it impossible to assess whether the claimed superiority of classic IR on relevance (or LLM shortcomings on usefulness) generalizes beyond the specific instances labeled by the three analysts.

minor comments (1)

[Abstract] The abstract refers to 'domain-specific problems' and 'high degree of expertise' without naming the domains or providing additional concrete examples beyond the single Paris/Berlin query, which reduces clarity for readers unfamiliar with the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments in detail below. We agree that additional documentation on the dataset and experimental setup is warranted and will revise the paper accordingly to enhance transparency and reproducibility.

read point-by-point responses

Referee: [Dataset curation and labeling description] The central empirical claims—that classic similarity-based IR tracks relevance more than usefulness and that LLMs lack domain expertise—rest entirely on the UsefulBench labels produced by exactly three professional analysts. No inter-annotator agreement statistic, adjudication procedure, dataset size, annotation protocol, statistical tests, or correlation with external decision outcomes is described. If the usefulness/relevance distinction is unstable or annotator-specific, the reported alignment gaps become uninterpretable.

Authors: We appreciate the referee highlighting the need for more rigorous documentation of the labeling process. While the manuscript introduces the dataset as curated by three professional analysts, we concur that specifics such as inter-annotator agreement, the exact annotation protocol, dataset size, and statistical validation are essential. In the revised manuscript, we will expand the relevant section to include these details, including any measures of agreement among the analysts and the procedures followed. We will also report statistical tests supporting the alignment observations. However, establishing correlation with external decision outcomes would necessitate additional empirical studies outside the scope of this benchmark introduction; we will explicitly discuss this as a limitation and a direction for future work. revision: yes
Referee: [Abstract and experimental results] The abstract and results sections state clear directional findings without reporting the number of queries, texts per query, domains covered, or any controls for query difficulty. This makes it impossible to assess whether the claimed superiority of classic IR on relevance (or LLM shortcomings on usefulness) generalizes beyond the specific instances labeled by the three analysts.

Authors: We agree that providing these details is crucial for evaluating the scope and generalizability of our findings. We will revise the abstract to incorporate the number of queries, the average or total texts per query, the domains covered, and any controls implemented for query difficulty. Similarly, the results section will be updated to explicitly state these figures and describe the experimental controls. This will allow readers to better contextualize the reported directional findings regarding the performance of similarity-based IR and LLM-based systems. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent human labels

full rationale

The paper defines relevance and usefulness as distinct labeling criteria, collects new annotations from three analysts on a curated domain-specific dataset, and reports direct empirical comparisons of retrieval systems against those labels. No equations, fitted parameters, or predictions appear in the provided text. No self-citations are invoked to justify core premises or uniqueness. The central claims rest on the newly created labels and observed alignment differences rather than any reduction of outputs to inputs by construction. This is a standard empirical benchmark paper whose derivation chain is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that usefulness is a labelable property distinct from relevance and that the three-analyst curation provides a valid ground truth; no free parameters or new entities are introduced.

axioms (1)

domain assumption Usefulness is a reliably labelable property by professional analysts
The entire dataset and all downstream comparisons depend on this assumption for their validity.

pith-pipeline@v0.9.0 · 5481 in / 1189 out tokens · 41649 ms · 2026-05-10T07:57:02.893711+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Language Models (Mostly) Know What They Know

The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA. Association for Com- puting Machinery. Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. 2024. RAGAs: Automated evalu...

work page internal anchor Pith review arXiv 2024
[2]

DIRAS: Efficient LLM annotation of doc- ument relevance for retrieval augmented generation. InProceedings of the 2025 Conference of the Na- tions of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 5238–5258, Albuquerque, New Mexico. Association for Compu- tational Linguist...

2025
[3]

arXiv preprint arXiv:2305.14975 (2023) Confidence Estimation in Automatic Short Answer Grading with LLMs 15

Expertlongbench: Benchmarking language models on expert-level long-form generation tasks with structured checklists. Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. 2024. ARES: An automated evalua- tion framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Assoc...

work page arXiv 2024