pith. sign in

arxiv: 2604.05766 · v1 · submitted 2026-04-07 · 💻 cs.IR

The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination

Pith reviewed 2026-05-10 18:52 UTC · model grok-4.3

classification 💻 cs.IR
keywords LLMinformation retrievalbenchmarksdata contaminationmeta-analysisnDCGTRECreranking
0
0 comments X

The pith

Recent systems that incorporate large language models post sizable effectiveness gains on two long-standing IR benchmarks, but an adapted contamination check shows that some of those gains may stem from memorization of test data rather than

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews results from 143 publications that report on the Robust04 and DL20 passage retrieval collections. It documents an LLM effect in which systems built with LLM components reach 8.8 percent higher nDCG@10 on DL20 than the 2020 TREC best and roughly 20 percent higher on Robust04 after 2023. When an existing contamination-detection technique is adjusted for reranking, measurable overlap appears between benchmark content and pre-training data. Removing the flagged topics lowers reported scores, yet the remaining confidence intervals stay wide enough that the authors cannot decide whether the observed gains represent genuine retrieval advances or simply leakage. The uncertainty matters because benchmark numbers are still used to decide which methods represent progress.

Core claim

The central claim is that an LLM effect appears in the published record—recent LLM-augmented runs outperform the prior best results by the margins stated above—yet the same record contains detectable contamination on both collections. After contaminated topics are excluded, effectiveness drops, but the confidence intervals remain too wide to separate methodological improvement from memorization of pre-training data.

What carries the argument

A longitudinal meta-analysis of 143 papers on the Robust04 and DL20 collections, paired with an adaptation of a data-contamination detector applied directly to reranking outputs.

If this is right

  • If the LLM effect survives contamination controls, then LLM components constitute a genuine step change in retrieval effectiveness on these collections.
  • If contamination accounts for most of the lift, then current benchmark scores overestimate real progress and comparisons across papers become unreliable.
  • Excluding contaminated topics already reduces the measured gains, showing that benchmark cleanliness directly affects reported numbers.
  • Wide confidence intervals after exclusion indicate that larger or cleaner test sets are required before firm conclusions can be drawn.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • New benchmark collections that are explicitly held out from all public training data would let the field test whether LLM gains persist once leakage is removed.
  • Routine contamination audits could become standard practice when new models are evaluated on older test sets.
  • The choice of which topics to retain after auditing may itself influence which methods appear strongest, suggesting a need for transparent reporting of exclusion criteria.

Load-bearing premise

The adapted contamination detector accurately flags cases of memorization in reranking runs and does so with few false positives.

What would settle it

Re-running the meta-analysis on a newly constructed benchmark whose documents and queries have never appeared in any public pre-training corpus and checking whether the LLM effect size shrinks to zero.

Figures

Figures reproduced from arXiv: 2604.05766 by Allan Hanbury, Moritz Staudinger, Wojciech Kusa.

Figure 1
Figure 1. Figure 1: Robust04 MAP results between 2005 and 2025. Regression lines show trends based on best reported results. Empty [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Robust04 nDCG@10 results between 2021 and 2025. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This paper performs a meta-analysis of 143 publications reporting results on the TREC Robust04 and TREC Deep Learning 2020 (DL20) passage retrieval benchmarks. It documents an 'LLM effect' in which recent systems incorporating LLM components achieve 8.8% higher nDCG@10 on DL20 relative to the best TREC 2020 result and approximately 20% higher on Robust04 since 2023. Adapting an existing data contamination detection method to reranking outputs, the authors identify measurable contamination in both collections; excluding contaminated topics reduces reported effectiveness, yet wide confidence intervals prevent a firm conclusion on whether the observed gains reflect genuine methodological progress or memorization of pretraining data.

Significance. If the central observations hold after addressing the noted methodological gaps, the work is significant for the IR community. It extends prior meta-analyses on benchmark stagnation by quantifying the scale of LLM-driven gains on two canonical collections and by surfacing contamination as a plausible confounder. The explicit qualification that wide confidence intervals preclude attribution to either advance or memorization supplies a falsifiable framing that can guide future controlled experiments on LLM-augmented retrieval.

major comments (1)
  1. The adaptation of the contamination detection method to reranking outputs is presented only at a high level. No calibration on uncontaminated controls, false-positive rate estimates, or threshold-selection procedure is reported. Because the paper's final claim—that it is difficult to separate genuine advances from memorization—rests directly on the reliability of the excluded-topic analysis, this omission is load-bearing for the central conclusion.
minor comments (1)
  1. The selection criteria that reduced the literature to exactly 143 publications are not stated explicitly; a supplementary table listing inclusion/exclusion decisions per paper would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of the work for the IR community. We address the single major comment below and will revise the manuscript accordingly to improve methodological transparency.

read point-by-point responses
  1. Referee: The adaptation of the contamination detection method to reranking outputs is presented only at a high level. No calibration on uncontaminated controls, false-positive rate estimates, or threshold-selection procedure is reported. Because the paper's final claim—that it is difficult to separate genuine advances from memorization—rests directly on the reliability of the excluded-topic analysis, this omission is load-bearing for the central conclusion.

    Authors: We agree that the current presentation of the adapted contamination detection method is high-level and that additional detail would strengthen the manuscript. The method follows the core procedure from the source paper with targeted modifications to operate on reranking outputs (specifically, scanning the top-k documents returned by each system for contamination signals rather than initial retrieval scores). In the revised version we will expand the relevant section to include: (1) a step-by-step description of the adaptations made for reranking, (2) the precise threshold-selection rule employed (chosen to match the sensitivity level reported in the original work), and (3) an explicit discussion of the method’s known limitations, including the false-positive rates documented by its authors. We did not conduct new calibration experiments on uncontaminated controls, as our study is a meta-analysis of published results rather than a controlled contamination study; we will state this limitation clearly. These additions will make the reliability of the excluded-topic analysis more transparent while preserving the cautious interpretation already present in the paper (wide confidence intervals after exclusion). revision: yes

Circularity Check

0 steps flagged

No significant circularity in observational meta-analysis

full rationale

This paper performs a meta-analysis by aggregating reported nDCG@10 results from 143 publications on two fixed TREC benchmarks and applies an adapted version of a prior contamination detection method. No equations, derivations, or model predictions are defined; the 'LLM effect' is simply the arithmetic difference between recent reported scores and historical TREC baselines. The contamination analysis is presented as an observational finding with explicit caveats about wide confidence intervals and does not reduce any central claim to a fitted parameter or self-referential definition. No load-bearing self-citation chains or ansatzes imported from the authors' prior work are required to support the reported trends.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the representativeness of the 143 selected publications and on the validity of the adapted contamination detection procedure; the abstract introduces no new free parameters, axioms, or invented entities beyond standard statistical aggregation.

pith-pipeline@v0.9.0 · 5500 in / 1220 out tokens · 63509 ms · 2026-05-10T18:52:09.198342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We observe what we term an LLM effect: recent systems incorporating LLM components achieve 8.8% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, and Adam Jatowt. 2025. Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation. arXiv:2502.02464 [cs.IR] https://arxiv.org/abs/2502.02464

  2. [2]

    Armstrong, Alistair Moffat, William Webber, and Justin Zobel

    Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don’t add up: ad-hoc retrieval results since 1998. InProceedings of the 18th ACM conference on Information and knowledge management (CIKM ’09). Association for Computing Machinery, New York, NY, USA, 601–610. doi:10. 1145/1645953.1646031

  3. [3]

    Shubham Chatterjee. 2025. REGENT: Relevance-Guided Attention for Entity- Aware Multi-Vector Neural Re-Ranking. InProceedings of the 2025 Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(China)(SIGIR-AP 2025). Association for Computing Machinery, New York, NY, USA, 199–210. doi:10.1...

  4. [4]

    Shubham Chatterjee and Jeff Dalton. 2025. QDER: Query-Specific Document and Entity Representations for Multi-Vector Document Re-Ranking. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 2255–2265. doi:10.1145/3...

  5. [5]

    Guglielmo Faggioli, Nicola Ferro, Raffaele Perego, and Nicola Tonellotto. 2025. CoDIME: A Counterfactual Approach for Dimension Importance Estimation through Click Logs. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, ...

  6. [6]

    Maik Fröbe, Christopher Akiki, Martin Potthast, and Matthias Hagen. 2022. How Train–Test Leakage Affects Zero-Shot Retrieval. InString Processing and Information Retrieval, Diego Arroyuelo and Barbara Poblete (Eds.). Springer International Publishing, Cham, 147–161

  7. [7]

    Shahriar Golchin and Mihai Surdeanu. 2025. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Lan- guage Models.Transactions of the Association for Computational Lin- guistics13 (07 2025), 809–830. arXiv:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/TACL.a.20/2540087/tacl.a.20.pdf doi:10.1162/TACL.a.20

  8. [8]

    Harman (Ed.)

    Donna K. Harman (Ed.). 1992.Proceedings of The First Text REtrieval Conference, TREC 1992, Gaithersburg, Maryland, USA, November 4-6, 1992. NIST Special Pub- lication, Vol. 500-207. National Institute of Standards and Technology (NIST). http://trec.nist.gov/pubs/trec1/t1_proceedings.html

  9. [9]

    Bruce Croft

    Samuel Huston and W. Bruce Croft. 2014. A Comparison of Retrieval Models using Term Dependencies. InProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management(Shanghai, China) (CIKM ’14). Association for Computing Machinery, New York, NY, USA, 111–120. doi:10.1145/2661829.2661894

  10. [10]

    Vishakha Suresh Kalal, Andrew Parry, and Sean MacAvaney. 2024. Training on the Test Model: Contamination in Ranking Distillation. doi:10.48550/arXiv.2411.02284 arXiv:2411.02284 [cs]

  11. [11]

    Sadegh Kharazmi, Falk Scholer, David Vallet, and Mark Sanderson. 2016. Ex- amining Additivity and Weak Baselines.ACM Trans. Inf. Syst.34, 4 (June 2016), 23:1–23:18. doi:10.1145/2882782

  12. [12]

    Jinseok Kim, Sukmin Cho, Soyeong Jeong, Sangyeop Kim, and Sungzoon Cho

  13. [13]

    InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25)

    Upcycling Candidate Tokens of Large Language Models for Query Expan- sion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 4889–4894. doi:10.1145/3746252. 3760895

  14. [14]

    Carlos Lassance and Stephane Clinchant. 2023. The Tale of Two MSMARCO - and Their Unfair Comparisons. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2431–2435. doi:10.1145/3539618.3592071

  15. [15]

    Carlos Lassance, Hervé Déjean, Thibault Formal, and Stéphane Clinchant. 2024. SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789 [cs.IR] https://arxiv. org/abs/2403.06789

  16. [16]

    Jimmy Lin. 2021. The neural hype, justified! a recantation.SIGIR Forum53, 2 (March 2021), 88–93. doi:10.1145/3458553.3458563

  17. [17]

    Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. To- ward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In Advances in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Gior...

  18. [18]

    Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Docu- ment Ranking with a Pretrained Sequence-to-Sequence Model. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 708–718. doi:10.18653/v1/2020.findings-emnlp.63

  19. [19]

    Yixuan Qiao, Hao Chen, Liyu Cao, Liping Chen, Pengyong Li, Jun Wang, Peng Gao, Yuan Ni, and Guotong Xie. 2020. PASH at TREC 2020 Deep Learning Track: Dense Matching for Nested Ranking. InProceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA], November 16-20, 2020 (NIST Special Publication, Vol. 12...

  20. [20]

    Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koop- man, Guido Zuccon, Benno Stein, Martin Potthast, and Matthias Hagen. 2025. Set-Encoder: Permutation-Invariant Inter-passage Attention for Listwise Passage Re-ranking with Cross-Encoders. InAdvances in Information Retrieval: 47th Eu- ropean Conference on Information Retrieval, ECI...

  21. [21]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InProceedings of the Neural In- formation Processing Systems (NeurIPS) Track on Datasets and Benchmarks, Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files...

  22. [22]

    Neural Hype

    Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examin- ing the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SI- GIR’19). Association for Computing Machinery, New York, N...