The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

Akash Vishwakarma; Ameya Gawde; Cien Zhang; Harshvardhan Singh; Michael Wyatt Thot; Svetlana Karslioglu; Tony Joseph; Vyzantinos Repantis

arxiv: 2605.18857 · v1 · pith:PCIWNIXInew · submitted 2026-05-14 · 💻 cs.IR · cs.AI· cs.LG

The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

Vyzantinos Repantis , Harshvardhan Singh , Tony Joseph , Cien Zhang , Akash Vishwakarma , Svetlana Karslioglu , Michael Wyatt Thot , Ameya Gawde This is my paper

Pith reviewed 2026-05-20 21:46 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG

keywords retrieval selectivitychance-corrected metricsBoRhypergeometric baselineRAG evaluationcoverage paradoxLLM tool selectioninformation retrieval

0 comments

The pith

Near-perfect retrieval success often matches random selection once enough relevant items exist in the pool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bits-over-Random (BoR) as a way to measure genuine selectivity beyond raw success rates. On the 20 Newsgroups collection, both BM25 and SPLADE reach over 99 percent coverage of at least one relevant document inside the top 100 results, yet BoR sits near zero because a random draw already succeeds at that depth. Selectivity vanishes once the expected number of relevant documents inside the retrieved slice exceeds roughly three to five. The same pattern appears in retrieval-augmented generation, where larger K values produce lower LLM accuracy, and in LLM agent tool selection when catalog sizes are small. The authors therefore propose reporting BoR together with conventional metrics and choosing retrieval depths that preserve actual selectivity rather than merely inflating coverage.

Core claim

On the 20 Newsgroups dataset, BM25 and SPLADE both report greater than 99 percent success at K=100 under the coverage rule, yet BoR equals approximately zero, showing performance no better than a random baseline. BoR is defined as the base-two logarithm of observed success probability divided by the success probability under a hypergeometric null model for drawing at least one relevant document. The collapse occurs whenever the expected coverage ratio, K times average relevant documents per query divided by collection size, exceeds three to five; the same boundary explains degraded RAG accuracy at K=100 and vanishing selectivity in small-catalog LLM tool use.

What carries the argument

Bits-over-Random (BoR), defined as log base 2 of observed success probability over the hypergeometric random baseline for the coverage success rule of at least one relevant item inside the top K.

If this is right

LLM accuracy in retrieval-augmented generation degrades at K=100 in line with near-zero BoR.
BoR stays positive on BEIR, SciFact, and MS MARCO even when recall differs by 13 points.
The selectivity collapse appears in LLM agent tool selection once catalog size makes the expected ratio exceed the boundary.
Reporting BoR alongside traditional metrics prevents over-reliance on high success rates that carry no extra selectivity.
Retrieval depth should be limited once additional items add negligible selectivity but raise compute cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems that maximize K without checking the coverage ratio may waste computation on results that add no real information.
The same boundary likely applies to any selection task where the fraction of acceptable items is moderate and K is not tiny.
Re-ranking or filtering steps could restore selectivity after an initial broad retrieval that already saturates the random baseline.

Load-bearing premise

The hypergeometric distribution accurately represents the null model of random document selection for the chosen success rule across the tested datasets.

What would settle it

Empirically sampling documents uniformly at random without replacement on the 20 Newsgroups collection at K=100 and checking whether the observed coverage rate matches the hypergeometric prediction to within a few percent.

Figures

Figures reproduced from arXiv: 2605.18857 by Akash Vishwakarma, Ameya Gawde, Cien Zhang, Harshvardhan Singh, Michael Wyatt Thot, Svetlana Karslioglu, Tony Joseph, Vyzantinos Repantis.

**Figure 2.** Figure 2: BoR analysis on 20 Newsgroups demonstrates the 99% success paradox. Both BM25 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits-over-Random (BoR), a chance-corrected measure of retrieval selectivity that reveals when high success rates mask random-level performance. We measure selectivity as $BoR = \log_{2}\left(\frac{\mathrm{P}_{obs}}{\mathrm{P}_{rand}}\right)$, where $\mathrm{P}_{rand}$ is the hypergeometric baseline for the chosen success rule (here, coverage: $ \geq1 $ relevant in top-$K$). On the 20 Newsgroups dataset, BM25 and SPLADE both report $>99$% success at $K=100$ (coverage), yet $BoR \approx 0$, indicating random-level selectivity at that depth. When the expected coverage ratio $\left(\frac{K \cdot \bar{R}_{q}}{N}\right)$ exceeds 3-5, the baseline dominates and selectivity collapses. Downstream retrieval-augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at $K=100$, consistent with the near-zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13-point recall gap), confirming baseline predictions across sparse and large-scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High coverage in retrieval often masks random-level selectivity once K grows relative to relevant items, and this paper gives a clean hypergeometric-based metric to spot it.

read the letter

The main thing to know is that on datasets like 20 Newsgroups, standard retrievers can post 99%+ coverage at K=100 while the actual selectivity over random drops to zero. The paper calls this out with Bits-over-Random, defined as log2 of observed success probability over the hypergeometric baseline for getting at least one relevant in the top-K. When the expected coverage ratio exceeds roughly 3-5, the random baseline takes over and extra depth buys nothing useful. That matches the math exactly, and they show the same pattern in RAG accuracy drops and in small-catalog LLM tool selection. On BEIR and MS MARCO the metric stays positive and systems cluster near the theoretical ceiling, which lines up with the model predictions. The hypergeometric null is the correct one for without-replacement sampling, so the saturation claim is not an artifact. What they do well is apply the correction directly to practical IR and RAG settings instead of leaving it as abstract theory. The downstream links to LLM accuracy and agent efficiency are straightforward and worth noting. A minor soft spot is that the abstract leaves the exact aggregation across queries and any post-processing details implicit, though nothing in the central argument looks circular or fitted. The idea is not revolutionary—chance correction appears elsewhere—but the targeted use for retrieval depth decisions is practical. This is for IR and RAG practitioners who need to decide when deeper retrieval stops adding signal and starts adding cost. Readers working on evaluation standards or LLM pipelines will get concrete guidance on metric choices and K tuning. It has enough formal grounding and real-dataset checks to deserve serious referee time rather than a quick pass. I would send it for peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Bits-over-Random (BoR), a chance-corrected selectivity metric defined as BoR = log₂(P_obs / P_rand), where P_rand is the probability of covering at least one relevant document in the top-K under a hypergeometric null model. It reports that on the 20 Newsgroups dataset both BM25 and SPLADE exceed 99% coverage at K=100 yet yield BoR ≈ 0, indicating random-level selectivity. The paper shows that selectivity collapses when the expected coverage ratio K · R_q / N exceeds 3-5, links this to degraded RAG accuracy at large K, and extends the saturation logic to LLM tool selection with small catalogs, recommending that BoR be reported alongside conventional metrics.

Significance. If the empirical results hold, the work supplies a theoretically grounded, parameter-free way to detect when high nominal success rates convey no additional selectivity. The hypergeometric derivation is exact for the stated success rule and the saturation threshold follows directly from the CDF, providing a falsifiable prediction that is confirmed across 20 Newsgroups, BEIR/SciFact, and MS MARCO. This is especially timely for RAG and agent settings where downstream consumers cannot filter noise.

minor comments (3)

[Abstract] Abstract: the notation shifts from R_q to bar{R}_q when defining the coverage ratio; explicitly state whether the bar denotes the query-wise average and provide the numerical value of bar{R}_q for 20 Newsgroups so readers can reproduce the BoR ≈ 0 claim.
[Results] The manuscript states that 41 systems on MS MARCO cluster within 0.2 bits of the theoretical ceiling; include a brief table or figure showing the range of recall values and corresponding BoR scores to substantiate the claim that recall gaps do not translate into selectivity gaps.
[Methods] Clarify in the methods section how P_obs is estimated from finite runs (e.g., number of queries, tie-breaking, or smoothing) so that the reported BoR values can be independently verified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our work, as well as for recognizing its timeliness for RAG and LLM agent applications. We appreciate the recommendation for minor revision and the confirmation that the hypergeometric derivation and saturation predictions are exact and falsifiable.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces BoR as an explicit definition BoR = log2(P_obs / P_rand) using the standard hypergeometric null model for the coverage rule (at least one relevant in top-K). The saturation claim when K · R_q / N exceeds 3-5 follows directly from the hypergeometric CDF 1 - C(N-R, K)/C(N, K) approaching 1 under the finite-population sampling model; this is a mathematical consequence of the chosen null rather than a fitted parameter or self-referential loop. Empirical results on 20 Newsgroups, BEIR, and MS MARCO are presented as observations consistent with the baseline, not as predictions that reduce to the same inputs by construction. No self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems are invoked to justify the core measure or its implications. The derivation chain is therefore self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating the hypergeometric distribution as the appropriate random baseline for coverage-based success and on the log-ratio definition of BoR itself. No free parameters fitted to target data are mentioned; the new metric is the primary addition.

axioms (1)

domain assumption The hypergeometric distribution models the probability of retrieving at least one relevant document under random selection for the coverage success rule.
Invoked to compute P_rand in the BoR definition as the chance baseline.

invented entities (1)

Bits-over-Random (BoR) metric no independent evidence
purpose: Quantify chance-corrected retrieval selectivity in bits
Newly defined quantity introduced to reveal when high success rates equal random performance.

pith-pipeline@v0.9.0 · 5935 in / 1399 out tokens · 96153 ms · 2026-05-20T21:46:43.293809+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

Journal of Machine Learning Research , volume=

Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , author=. Journal of Machine Learning Research , volume=

work page
[2]

2008 , publisher=

Introduction to Information Retrieval , author=. 2008 , publisher=

work page 2008
[3]

2022 , publisher=

Formal, Thibault and Piwowarski, Benjamin and Clinchant, Stéphane , booktitle=. 2022 , publisher=

work page 2022
[4]

1995 , publisher=

Lang, Ken , booktitle=. 1995 , publisher=

work page 1995
[5]

Transactions of the Association for Computational Linguistics , volume=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , doi=

work page 2024
[6]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2005 , doi=

work page 2005
[7]

International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , series=

Large Language Models Can Be Easily Distracted by Irrelevant Context , author=. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , series=. 2023 , publisher=

work page 2023
[8]

2021 , editor=

Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna , booktitle=. 2021 , editor=

work page 2021
[9]

Journal of Chemical Information and Modeling , volume=

Evaluating Virtual Screening Methods: Good and Bad Metrics for the Early Recognition Problem , author=. Journal of Chemical Information and Modeling , volume=. 2007 , doi=

work page 2007
[10]

ACM Transactions on Information Systems , volume=

A similarity measure for indefinite rankings , author=. ACM Transactions on Information Systems , volume=. 2010 , url=

work page 2010
[11]

2025 , urldate=

Introducing advanced tool use on the Claude Developer Platform , author=. 2025 , urldate=

work page 2025
[12]

Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author=. Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=

work page

[1] [1]

Journal of Machine Learning Research , volume=

Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , author=. Journal of Machine Learning Research , volume=

work page

[2] [2]

2008 , publisher=

Introduction to Information Retrieval , author=. 2008 , publisher=

work page 2008

[3] [3]

2022 , publisher=

Formal, Thibault and Piwowarski, Benjamin and Clinchant, Stéphane , booktitle=. 2022 , publisher=

work page 2022

[4] [4]

1995 , publisher=

Lang, Ken , booktitle=. 1995 , publisher=

work page 1995

[5] [5]

Transactions of the Association for Computational Linguistics , volume=

Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , doi=

work page 2024

[6] [6]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2005 , doi=

work page 2005

[7] [7]

International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , series=

Large Language Models Can Be Easily Distracted by Irrelevant Context , author=. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , series=. 2023 , publisher=

work page 2023

[8] [8]

2021 , editor=

Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna , booktitle=. 2021 , editor=

work page 2021

[9] [9]

Journal of Chemical Information and Modeling , volume=

Evaluating Virtual Screening Methods: Good and Bad Metrics for the Early Recognition Problem , author=. Journal of Chemical Information and Modeling , volume=. 2007 , doi=

work page 2007

[10] [10]

ACM Transactions on Information Systems , volume=

A similarity measure for indefinite rankings , author=. ACM Transactions on Information Systems , volume=. 2010 , url=

work page 2010

[11] [11]

2025 , urldate=

Introducing advanced tool use on the Claude Developer Platform , author=. 2025 , urldate=

work page 2025

[12] [12]

Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author=. Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=

work page