pith. sign in

arxiv: 2605.18857 · v1 · pith:PCIWNIXInew · submitted 2026-05-14 · 💻 cs.IR · cs.AI· cs.LG

The 99% Success Paradox: When Near-Perfect Retrieval Equals Random Selection

Pith reviewed 2026-05-20 21:46 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.LG
keywords retrieval selectivitychance-corrected metricsBoRhypergeometric baselineRAG evaluationcoverage paradoxLLM tool selectioninformation retrieval
0
0 comments X

The pith

Near-perfect retrieval success often matches random selection once enough relevant items exist in the pool.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Bits-over-Random (BoR) as a way to measure genuine selectivity beyond raw success rates. On the 20 Newsgroups collection, both BM25 and SPLADE reach over 99 percent coverage of at least one relevant document inside the top 100 results, yet BoR sits near zero because a random draw already succeeds at that depth. Selectivity vanishes once the expected number of relevant documents inside the retrieved slice exceeds roughly three to five. The same pattern appears in retrieval-augmented generation, where larger K values produce lower LLM accuracy, and in LLM agent tool selection when catalog sizes are small. The authors therefore propose reporting BoR together with conventional metrics and choosing retrieval depths that preserve actual selectivity rather than merely inflating coverage.

Core claim

On the 20 Newsgroups dataset, BM25 and SPLADE both report greater than 99 percent success at K=100 under the coverage rule, yet BoR equals approximately zero, showing performance no better than a random baseline. BoR is defined as the base-two logarithm of observed success probability divided by the success probability under a hypergeometric null model for drawing at least one relevant document. The collapse occurs whenever the expected coverage ratio, K times average relevant documents per query divided by collection size, exceeds three to five; the same boundary explains degraded RAG accuracy at K=100 and vanishing selectivity in small-catalog LLM tool use.

What carries the argument

Bits-over-Random (BoR), defined as log base 2 of observed success probability over the hypergeometric random baseline for the coverage success rule of at least one relevant item inside the top K.

If this is right

  • LLM accuracy in retrieval-augmented generation degrades at K=100 in line with near-zero BoR.
  • BoR stays positive on BEIR, SciFact, and MS MARCO even when recall differs by 13 points.
  • The selectivity collapse appears in LLM agent tool selection once catalog size makes the expected ratio exceed the boundary.
  • Reporting BoR alongside traditional metrics prevents over-reliance on high success rates that carry no extra selectivity.
  • Retrieval depth should be limited once additional items add negligible selectivity but raise compute cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Systems that maximize K without checking the coverage ratio may waste computation on results that add no real information.
  • The same boundary likely applies to any selection task where the fraction of acceptable items is moderate and K is not tiny.
  • Re-ranking or filtering steps could restore selectivity after an initial broad retrieval that already saturates the random baseline.

Load-bearing premise

The hypergeometric distribution accurately represents the null model of random document selection for the chosen success rule across the tested datasets.

What would settle it

Empirically sampling documents uniformly at random without replacement on the 20 Newsgroups collection at K=100 and checking whether the observed coverage rate matches the hypergeometric prediction to within a few percent.

Figures

Figures reproduced from arXiv: 2605.18857 by Akash Vishwakarma, Ameya Gawde, Cien Zhang, Harshvardhan Singh, Michael Wyatt Thot, Svetlana Karslioglu, Tony Joseph, Vyzantinos Repantis.

Figure 1
Figure 1. Figure 1: BoR analysis on SciFact dataset shows sustained selectivity across retrieval depths. Both [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: BoR analysis on 20 Newsgroups demonstrates the 99% success paradox. Both BM25 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits-over-Random (BoR), a chance-corrected measure of retrieval selectivity that reveals when high success rates mask random-level performance. We measure selectivity as $BoR = \log_{2}\left(\frac{\mathrm{P}_{obs}}{\mathrm{P}_{rand}}\right)$, where $\mathrm{P}_{rand}$ is the hypergeometric baseline for the chosen success rule (here, coverage: $ \geq1 $ relevant in top-$K$). On the 20 Newsgroups dataset, BM25 and SPLADE both report $>99$% success at $K=100$ (coverage), yet $BoR \approx 0$, indicating random-level selectivity at that depth. When the expected coverage ratio $\left(\frac{K \cdot \bar{R}_{q}}{N}\right)$ exceeds 3-5, the baseline dominates and selectivity collapses. Downstream retrieval-augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at $K=100$, consistent with the near-zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13-point recall gap), confirming baseline predictions across sparse and large-scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript introduces Bits-over-Random (BoR), a chance-corrected selectivity metric defined as BoR = log₂(P_obs / P_rand), where P_rand is the probability of covering at least one relevant document in the top-K under a hypergeometric null model. It reports that on the 20 Newsgroups dataset both BM25 and SPLADE exceed 99% coverage at K=100 yet yield BoR ≈ 0, indicating random-level selectivity. The paper shows that selectivity collapses when the expected coverage ratio K · R_q / N exceeds 3-5, links this to degraded RAG accuracy at large K, and extends the saturation logic to LLM tool selection with small catalogs, recommending that BoR be reported alongside conventional metrics.

Significance. If the empirical results hold, the work supplies a theoretically grounded, parameter-free way to detect when high nominal success rates convey no additional selectivity. The hypergeometric derivation is exact for the stated success rule and the saturation threshold follows directly from the CDF, providing a falsifiable prediction that is confirmed across 20 Newsgroups, BEIR/SciFact, and MS MARCO. This is especially timely for RAG and agent settings where downstream consumers cannot filter noise.

minor comments (3)
  1. [Abstract] Abstract: the notation shifts from R_q to bar{R}_q when defining the coverage ratio; explicitly state whether the bar denotes the query-wise average and provide the numerical value of bar{R}_q for 20 Newsgroups so readers can reproduce the BoR ≈ 0 claim.
  2. [Results] The manuscript states that 41 systems on MS MARCO cluster within 0.2 bits of the theoretical ceiling; include a brief table or figure showing the range of recall values and corresponding BoR scores to substantiate the claim that recall gaps do not translate into selectivity gaps.
  3. [Methods] Clarify in the methods section how P_obs is estimated from finite runs (e.g., number of queries, tie-breaking, or smoothing) so that the reported BoR values can be independently verified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our work, as well as for recognizing its timeliness for RAG and LLM agent applications. We appreciate the recommendation for minor revision and the confirmation that the hypergeometric derivation and saturation predictions are exact and falsifiable.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces BoR as an explicit definition BoR = log2(P_obs / P_rand) using the standard hypergeometric null model for the coverage rule (at least one relevant in top-K). The saturation claim when K · R_q / N exceeds 3-5 follows directly from the hypergeometric CDF 1 - C(N-R, K)/C(N, K) approaching 1 under the finite-population sampling model; this is a mathematical consequence of the chosen null rather than a fitted parameter or self-referential loop. Empirical results on 20 Newsgroups, BEIR, and MS MARCO are presented as observations consistent with the baseline, not as predictions that reduce to the same inputs by construction. No self-citation chains, ansatzes smuggled via prior work, or uniqueness theorems are invoked to justify the core measure or its implications. The derivation chain is therefore self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating the hypergeometric distribution as the appropriate random baseline for coverage-based success and on the log-ratio definition of BoR itself. No free parameters fitted to target data are mentioned; the new metric is the primary addition.

axioms (1)
  • domain assumption The hypergeometric distribution models the probability of retrieving at least one relevant document under random selection for the coverage success rule.
    Invoked to compute P_rand in the BoR definition as the chance baseline.
invented entities (1)
  • Bits-over-Random (BoR) metric no independent evidence
    purpose: Quantify chance-corrected retrieval selectivity in bits
    Newly defined quantity introduced to reveal when high success rates equal random performance.

pith-pipeline@v0.9.0 · 5935 in / 1399 out tokens · 96153 ms · 2026-05-20T21:46:43.293809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Journal of Machine Learning Research , volume=

    Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , author=. Journal of Machine Learning Research , volume=

  2. [2]

    2008 , publisher=

    Introduction to Information Retrieval , author=. 2008 , publisher=

  3. [3]

    2022 , publisher=

    Formal, Thibault and Piwowarski, Benjamin and Clinchant, Stéphane , booktitle=. 2022 , publisher=

  4. [4]

    1995 , publisher=

    Lang, Ken , booktitle=. 1995 , publisher=

  5. [5]

    Transactions of the Association for Computational Linguistics , volume=

    Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , doi=

  6. [6]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2005 , doi=

  7. [7]

    International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , series=

    Large Language Models Can Be Easily Distracted by Irrelevant Context , author=. International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , series=. 2023 , publisher=

  8. [8]

    2021 , editor=

    Thakur, Nandan and Reimers, Nils and Rücklé, Andreas and Srivastava, Abhishek and Gurevych, Iryna , booktitle=. 2021 , editor=

  9. [9]

    Journal of Chemical Information and Modeling , volume=

    Evaluating Virtual Screening Methods: Good and Bad Metrics for the Early Recognition Problem , author=. Journal of Chemical Information and Modeling , volume=. 2007 , doi=

  10. [10]

    ACM Transactions on Information Systems , volume=

    A similarity measure for indefinite rankings , author=. ACM Transactions on Information Systems , volume=. 2010 , url=

  11. [11]

    2025 , urldate=

    Introducing advanced tool use on the Claude Developer Platform , author=. 2025 , urldate=

  12. [12]

    Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author=. Proceedings of the Workshop on Cognitive Computation (CoCo@NIPS) , year=