The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection
Pith reviewed 2026-05-20 21:46 UTC · model grok-4.3
The pith
Near-perfect retrieval success often matches random selection once enough relevant items exist in the pool.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On the 20 Newsgroups dataset, BM25 and SPLADE both report greater than 99 percent success at K=100 under the coverage rule, yet BoR equals approximately zero, showing performance no better than a random baseline. BoR is defined as the base-two logarithm of observed success probability divided by the success probability under a hypergeometric null model for drawing at least one relevant document. The collapse occurs whenever the expected coverage ratio, K times average relevant documents per query divided by collection size, exceeds three to five; the same boundary explains degraded RAG accuracy at K=100 and vanishing selectivity in small-catalog LLM tool use.
What carries the argument
Bits-over-Random (BoR), defined as log base 2 of observed success probability over the hypergeometric random baseline for the coverage success rule of at least one relevant item inside the top K.
If this is right
- LLM accuracy in retrieval-augmented generation degrades at K=100 in line with near-zero BoR.
- BoR stays positive on BEIR, SciFact, and MS MARCO even when recall differs by 13 points.
- The selectivity collapse appears in LLM agent tool selection once catalog size makes the expected ratio exceed the boundary.
- Reporting BoR alongside traditional metrics prevents over-reliance on high success rates that carry no extra selectivity.
- Retrieval depth should be limited once additional items add negligible selectivity but raise compute cost.
Where Pith is reading between the lines
- Systems that maximize K without checking the coverage ratio may waste computation on results that add no real information.
- The same boundary likely applies to any selection task where the fraction of acceptable items is moderate and K is not tiny.
- Re-ranking or filtering steps could restore selectivity after an initial broad retrieval that already saturates the random baseline.
Load-bearing premise
The hypergeometric distribution accurately represents the null model of random document selection for the chosen success rule across the tested datasets.
What would settle it
Empirically sampling documents uniformly at random without replacement on the 20 Newsgroups collection at K=100 and checking whether the observed coverage rate matches the hypergeometric prediction to within a few percent.
Figures
read the original abstract
For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits-over-Random (BoR), a chance-corrected measure of retrieval selectivity that reveals when high success rates mask random-level performance. We measure selectivity as $BoR = \log_{2}\left(\frac{\mathrm{P}_{obs}}{\mathrm{P}_{rand}}\right)$, where $\mathrm{P}_{rand}$ is the hypergeometric baseline for the chosen success rule (here, coverage: $ \geq1 $ relevant in top-$K$). On the 20 Newsgroups dataset, BM25 and SPLADE both report $>99$% success at $K=100$ (coverage), yet $BoR \approx 0$, indicating random-level selectivity at that depth. When the expected coverage ratio $\left(\frac{K \cdot \bar{R}_{q}}{N}\right)$ exceeds 3-5, the baseline dominates and selectivity collapses. Downstream retrieval-augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at $K=100$, consistent with the near-zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13-point recall gap), confirming baseline predictions across sparse and large-scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Bits-over-Random (BoR), a chance-corrected selectivity metric defined as BoR = log₂(P_obs / P_rand), where P_rand is the probability of covering at least one relevant document in the top-K under a hypergeometric null model. It reports that on the 20 Newsgroups dataset both BM25 and SPLADE exceed 99% coverage at K=100 yet yield BoR ≈ 0, indicating random-level selectivity. The paper shows that selectivity collapses when the expected coverage ratio K · R_q / N exceeds 3-5, links this to degraded RAG accuracy at large K, and extends the saturation logic to LLM tool selection with small catalogs, recommending that BoR be reported alongside conventional metrics.
Significance. If the empirical results hold, the work supplies a theoretically grounded, parameter-free way to detect when high nominal success rates convey no additional selectivity. The hypergeometric derivation is exact for the stated success rule and the saturation threshold follows directly from the CDF, providing a falsifiable prediction that is confirmed across 20 Newsgroups, BEIR/SciFact, and MS MARCO. This is especially timely for RAG and agent settings where downstream consumers cannot filter noise.
minor comments (3)
- [Abstract] Abstract: the notation shifts from R_q to bar{R}_q when defining the coverage ratio; explicitly state whether the bar denotes the query-wise average and provide the numerical value of bar{R}_q for 20 Newsgroups so readers can reproduce the BoR ≈ 0 claim.
- [Results] The manuscript states that 41 systems on MS MARCO cluster within 0.2 bits of the theoretical ceiling; include a brief table or figure showing the range of recall values and corresponding BoR scores to substantiate the claim that recall gaps do not translate into selectivity gaps.
- [Methods] Clarify in the methods section how P_obs is estimated from finite runs (e.g., number of queries, tie-breaking, or smoothing) so that the reported BoR values can be independently verified.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of our work, as well as for recognizing its timeliness for RAG and LLM agent applications. We appreciate the recommendation for minor revision and the confirmation that the hypergeometric derivation and saturation predictions are exact and falsifiable.
Circularity Check
No significant circularity identified
full rationale
The paper introduces BoR as an explicit definition BoR = log2(P_obs / P_rand) using the standard hypergeometric null model for the coverage rule (at least one relevant in top-K). The saturation claim when K · R_q / N exceeds 3-5 follows directly from the hypergeometric CDF 1 - C(N-R, K)/C(N, K) approaching 1 under the finite-population sampling model; this is a mathematical consequence of the chosen null rather than a fitted parameter or self-referential loop. Empirical results on 20 Newsgroups, BEIR, and MS MARCO are presented as observations consistent with the baseline, not as predictions that reduce to the same inputs by construction. No self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems are invoked to justify the core measure or its implications. The derivation chain is therefore self-contained against external statistical benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The hypergeometric distribution models the probability of retrieving at least one relevant document under random selection for the coverage success rule.
invented entities (1)
-
Bits-over-Random (BoR) metric
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Journal of Machine Learning Research , volume=
Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , author=. Journal of Machine Learning Research , volume=
- [2]
-
[3]
Formal, Thibault and Piwowarski, Benjamin and Clinchant, Stéphane , booktitle=. 2022 , publisher=
work page 2022
- [4]
-
[5]
Transactions of the Association for Computational Linguistics , volume=
Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , doi=
work page 2024
-
[6]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2005 , doi=
work page 2005
-
[7]
Large Language Models Can Be Easily Distracted by Irrelevant Context , author=. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , series=. 2023 , publisher=
work page 2023
-
[8]
Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna , booktitle=. 2021 , editor=
work page 2021
-
[9]
Journal of Chemical Information and Modeling , volume=
Evaluating Virtual Screening Methods: Good and Bad Metrics for the Early Recognition Problem , author=. Journal of Chemical Information and Modeling , volume=. 2007 , doi=
work page 2007
-
[10]
ACM Transactions on Information Systems , volume=
A similarity measure for indefinite rankings , author=. ACM Transactions on Information Systems , volume=. 2010 , url=
work page 2010
-
[11]
Introducing advanced tool use on the Claude Developer Platform , author=. 2025 , urldate=
work page 2025
-
[12]
Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author=. Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.