The LLM Effect on IR Benchmarks: A Meta-Analysis of Effectiveness, Baselines, and Contamination
Pith reviewed 2026-05-10 18:52 UTC · model grok-4.3
The pith
Recent systems that incorporate large language models post sizable effectiveness gains on two long-standing IR benchmarks, but an adapted contamination check shows that some of those gains may stem from memorization of test data rather than
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an LLM effect appears in the published record—recent LLM-augmented runs outperform the prior best results by the margins stated above—yet the same record contains detectable contamination on both collections. After contaminated topics are excluded, effectiveness drops, but the confidence intervals remain too wide to separate methodological improvement from memorization of pre-training data.
What carries the argument
A longitudinal meta-analysis of 143 papers on the Robust04 and DL20 collections, paired with an adaptation of a data-contamination detector applied directly to reranking outputs.
If this is right
- If the LLM effect survives contamination controls, then LLM components constitute a genuine step change in retrieval effectiveness on these collections.
- If contamination accounts for most of the lift, then current benchmark scores overestimate real progress and comparisons across papers become unreliable.
- Excluding contaminated topics already reduces the measured gains, showing that benchmark cleanliness directly affects reported numbers.
- Wide confidence intervals after exclusion indicate that larger or cleaner test sets are required before firm conclusions can be drawn.
Where Pith is reading between the lines
- New benchmark collections that are explicitly held out from all public training data would let the field test whether LLM gains persist once leakage is removed.
- Routine contamination audits could become standard practice when new models are evaluated on older test sets.
- The choice of which topics to retain after auditing may itself influence which methods appear strongest, suggesting a need for transparent reporting of exclusion criteria.
Load-bearing premise
The adapted contamination detector accurately flags cases of memorization in reranking runs and does so with few false positives.
What would settle it
Re-running the meta-analysis on a newly constructed benchmark whose documents and queries have never appeared in any public pre-training corpus and checking whether the LLM effect size shrinks to zero.
Figures
read the original abstract
Benchmark collections have long enabled controlled comparison and cumulative progress in Information Retrieval (IR). However, prior meta-analyses have shown that reported effectiveness gains often fail to accumulate, in part due to the use of weak or outdated baselines. While large language models are increasingly used in retrieval pipelines, their impact on established IR benchmarks has not been systematically analyzed. In this study, we analyze 143 publications reporting results on the TREC Robust04 collection and the TREC Deep Learning 2020 (DL20) passage retrieval benchmark to examine longitudinal trends in retrieval effectiveness and baseline strength. We observe what we term an \emph{LLM effect}: recent systems incorporating LLM components achieve 8.8\% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20\% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks. While excluding contaminated topics reduces effectiveness, confidence intervals remain wide, making it difficult to determine whether the LLM effect reflects genuine methodological advances or memorization from pretraining data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper performs a meta-analysis of 143 publications reporting results on the TREC Robust04 and TREC Deep Learning 2020 (DL20) passage retrieval benchmarks. It documents an 'LLM effect' in which recent systems incorporating LLM components achieve 8.8% higher nDCG@10 on DL20 relative to the best TREC 2020 result and approximately 20% higher on Robust04 since 2023. Adapting an existing data contamination detection method to reranking outputs, the authors identify measurable contamination in both collections; excluding contaminated topics reduces reported effectiveness, yet wide confidence intervals prevent a firm conclusion on whether the observed gains reflect genuine methodological progress or memorization of pretraining data.
Significance. If the central observations hold after addressing the noted methodological gaps, the work is significant for the IR community. It extends prior meta-analyses on benchmark stagnation by quantifying the scale of LLM-driven gains on two canonical collections and by surfacing contamination as a plausible confounder. The explicit qualification that wide confidence intervals preclude attribution to either advance or memorization supplies a falsifiable framing that can guide future controlled experiments on LLM-augmented retrieval.
major comments (1)
- The adaptation of the contamination detection method to reranking outputs is presented only at a high level. No calibration on uncontaminated controls, false-positive rate estimates, or threshold-selection procedure is reported. Because the paper's final claim—that it is difficult to separate genuine advances from memorization—rests directly on the reliability of the excluded-topic analysis, this omission is load-bearing for the central conclusion.
minor comments (1)
- The selection criteria that reduced the literature to exactly 143 publications are not stated explicitly; a supplementary table listing inclusion/exclusion decisions per paper would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for recognizing the potential significance of the work for the IR community. We address the single major comment below and will revise the manuscript accordingly to improve methodological transparency.
read point-by-point responses
-
Referee: The adaptation of the contamination detection method to reranking outputs is presented only at a high level. No calibration on uncontaminated controls, false-positive rate estimates, or threshold-selection procedure is reported. Because the paper's final claim—that it is difficult to separate genuine advances from memorization—rests directly on the reliability of the excluded-topic analysis, this omission is load-bearing for the central conclusion.
Authors: We agree that the current presentation of the adapted contamination detection method is high-level and that additional detail would strengthen the manuscript. The method follows the core procedure from the source paper with targeted modifications to operate on reranking outputs (specifically, scanning the top-k documents returned by each system for contamination signals rather than initial retrieval scores). In the revised version we will expand the relevant section to include: (1) a step-by-step description of the adaptations made for reranking, (2) the precise threshold-selection rule employed (chosen to match the sensitivity level reported in the original work), and (3) an explicit discussion of the method’s known limitations, including the false-positive rates documented by its authors. We did not conduct new calibration experiments on uncontaminated controls, as our study is a meta-analysis of published results rather than a controlled contamination study; we will state this limitation clearly. These additions will make the reliability of the excluded-topic analysis more transparent while preserving the cautious interpretation already present in the paper (wide confidence intervals after exclusion). revision: yes
Circularity Check
No significant circularity in observational meta-analysis
full rationale
This paper performs a meta-analysis by aggregating reported nDCG@10 results from 143 publications on two fixed TREC benchmarks and applies an adapted version of a prior contamination detection method. No equations, derivations, or model predictions are defined; the 'LLM effect' is simply the arithmetic difference between recent reported scores and historical TREC baselines. The contamination analysis is presented as an observational finding with explicit caveats about wide confidence intervals and does not reduce any central claim to a fitted parameter or self-referential definition. No load-bearing self-citation chains or ansatzes imported from the authors' prior work are required to support the reported trends.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We observe what we term an LLM effect: recent systems incorporating LLM components achieve 8.8% higher nDCG@10 on DL20 compared to the best result from TREC 2020 and approximately 20% higher on Robust04 since 2023. However, adapting a data contamination detection approach to reranking reveals measurable contamination in both benchmarks.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Armstrong, Alistair Moffat, William Webber, and Justin Zobel
Timothy G. Armstrong, Alistair Moffat, William Webber, and Justin Zobel. 2009. Improvements that don’t add up: ad-hoc retrieval results since 1998. InProceedings of the 18th ACM conference on Information and knowledge management (CIKM ’09). Association for Computing Machinery, New York, NY, USA, 601–610. doi:10. 1145/1645953.1646031
-
[3]
Shubham Chatterjee. 2025. REGENT: Relevance-Guided Attention for Entity- Aware Multi-Vector Neural Re-Ranking. InProceedings of the 2025 Annual Interna- tional ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region(China)(SIGIR-AP 2025). Association for Computing Machinery, New York, NY, USA, 199–210. doi:10.1...
-
[4]
Shubham Chatterjee and Jeff Dalton. 2025. QDER: Query-Specific Document and Entity Representations for Multi-Vector Document Re-Ranking. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, NY, USA, 2255–2265. doi:10.1145/3...
-
[5]
Guglielmo Faggioli, Nicola Ferro, Raffaele Perego, and Nicola Tonellotto. 2025. CoDIME: A Counterfactual Approach for Dimension Importance Estimation through Click Logs. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval(Padua, Italy)(SIGIR ’25). Association for Computing Machinery, New York, ...
-
[6]
Maik Fröbe, Christopher Akiki, Martin Potthast, and Matthias Hagen. 2022. How Train–Test Leakage Affects Zero-Shot Retrieval. InString Processing and Information Retrieval, Diego Arroyuelo and Barbara Poblete (Eds.). Springer International Publishing, Cham, 147–161
work page 2022
-
[7]
Shahriar Golchin and Mihai Surdeanu. 2025. Data Contamination Quiz: A Tool to Detect and Estimate Contamination in Large Lan- guage Models.Transactions of the Association for Computational Lin- guistics13 (07 2025), 809–830. arXiv:https://direct.mit.edu/tacl/article- pdf/doi/10.1162/TACL.a.20/2540087/tacl.a.20.pdf doi:10.1162/TACL.a.20
-
[8]
Donna K. Harman (Ed.). 1992.Proceedings of The First Text REtrieval Conference, TREC 1992, Gaithersburg, Maryland, USA, November 4-6, 1992. NIST Special Pub- lication, Vol. 500-207. National Institute of Standards and Technology (NIST). http://trec.nist.gov/pubs/trec1/t1_proceedings.html
work page 1992
-
[9]
Samuel Huston and W. Bruce Croft. 2014. A Comparison of Retrieval Models using Term Dependencies. InProceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management(Shanghai, China) (CIKM ’14). Association for Computing Machinery, New York, NY, USA, 111–120. doi:10.1145/2661829.2661894
-
[10]
Vishakha Suresh Kalal, Andrew Parry, and Sean MacAvaney. 2024. Training on the Test Model: Contamination in Ranking Distillation. doi:10.48550/arXiv.2411.02284 arXiv:2411.02284 [cs]
-
[11]
Sadegh Kharazmi, Falk Scholer, David Vallet, and Mark Sanderson. 2016. Ex- amining Additivity and Weak Baselines.ACM Trans. Inf. Syst.34, 4 (June 2016), 23:1–23:18. doi:10.1145/2882782
-
[12]
Jinseok Kim, Sukmin Cho, Soyeong Jeong, Sangyeop Kim, and Sungzoon Cho
-
[13]
Upcycling Candidate Tokens of Large Language Models for Query Expan- sion. InProceedings of the 34th ACM International Conference on Information and Knowledge Management(Seoul, Republic of Korea)(CIKM ’25). Association for Computing Machinery, New York, NY, USA, 4889–4894. doi:10.1145/3746252. 3760895
-
[14]
Carlos Lassance and Stephane Clinchant. 2023. The Tale of Two MSMARCO - and Their Unfair Comparisons. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2431–2435. doi:10.1145/3539618.3592071
- [15]
-
[16]
Jimmy Lin. 2021. The neural hype, justified! a recantation.SIGIR Forum53, 2 (March 2021), 88–93. doi:10.1145/3458553.3458563
-
[17]
Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. To- ward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. In Advances in Information Retrieval, Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Gior...
-
[18]
Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. 2020. Docu- ment Ranking with a Pretrained Sequence-to-Sequence Model. InFindings of the Association for Computational Linguistics: EMNLP 2020, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 708–718. doi:10.18653/v1/2020.findings-emnlp.63
-
[19]
Yixuan Qiao, Hao Chen, Liyu Cao, Liping Chen, Pengyong Li, Jun Wang, Peng Gao, Yuan Ni, and Guotong Xie. 2020. PASH at TREC 2020 Deep Learning Track: Dense Matching for Nested Ranking. InProceedings of the Twenty-Ninth Text REtrieval Conference, TREC 2020, Virtual Event [Gaithersburg, Maryland, USA], November 16-20, 2020 (NIST Special Publication, Vol. 12...
-
[20]
Ferdinand Schlatt, Maik Fröbe, Harrisen Scells, Shengyao Zhuang, Bevan Koop- man, Guido Zuccon, Benno Stein, Martin Potthast, and Matthias Hagen. 2025. Set-Encoder: Permutation-Invariant Inter-passage Attention for Listwise Passage Re-ranking with Cross-Encoders. InAdvances in Information Retrieval: 47th Eu- ropean Conference on Information Retrieval, ECI...
-
[21]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models. InProceedings of the Neural In- formation Processing Systems (NeurIPS) Track on Datasets and Benchmarks, Vol. 1. https://datasets-benchmarks-proceedings.neurips.cc/paper_files...
work page 2021
-
[22]
Wei Yang, Kuang Lu, Peilin Yang, and Jimmy Lin. 2019. Critically Examin- ing the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SI- GIR’19). Association for Computing Machinery, New York, N...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.