arxiv: 2603.04816 · v2 · submitted 2026-03-05 · 💻 cs.IR

Recognition: 2 theorem links

· Lean Theorem

Scaling Laws for Cross-Encoder Reranking

Rahul Seetharaman , Aman Bansal , Hamed Zamani , Kaustubh Dhole

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:05 UTC · model grok-4.3

classification 💻 cs.IR

keywords scaling lawscross-encoder rerankinginformation retrievalmodel scalingranking metricscompute allocationMSMARCOTREC DL

0 comments

The pith

Cross-encoder rerankers follow power-law scaling with model size and training exposure, allowing forecasts of larger models from smaller runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that ranking quality in cross-encoder rerankers scales according to predictable power laws as model size and training data increase. This pattern holds across pointwise, pairwise, and listwise training objectives. The authors fit these laws on models up to 150 million parameters and use them to forecast performance for 400 million and 1 billion parameter rerankers on MSMARCO-dev and TREC DL. They also extract rules for splitting compute between model size and data volume, finding that data-heavy allocations often produce better retrieval metrics. The forecasts prove accurate and tend to be conservative, offering practical guidance for training large rerankers.

Core claim

Ranking quality for cross-encoder rerankers follows predictable power laws across model size and training exposure for pointwise, pairwise, and listwise objectives. Using data from models up to 150M parameters, the fitted scaling laws accurately forecast the performance of 400M and 1B parameter models on MSMARCO-dev and TREC DL. From the joint scaling law, compute-allocation rules are derived that frequently recommend data-heavy scaling over equal-compute checkpoints, though this depends on the objective.

What carries the argument

Joint power-law scaling relationships over model size and training exposure that extrapolate ranking metrics to unseen larger models.

If this is right

Larger rerankers can be forecasted from smaller training runs without full training.
Compute budgets often yield better metrics when allocated more to additional data than to larger models.
The scaling behavior remains consistent across pointwise, pairwise, and listwise objectives.
Forecasts are typically conservative, so planned large runs are unlikely to underperform expectations.
Industrial reranking systems can use these laws to plan expensive training more efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same power-law approach could be tested on other retrieval architectures or first-stage indexes to check for similar predictability.
If the conservative bias persists at even larger scales, real systems may outperform forecasts and justify earlier investment in big rerankers.
The data-heavy preference suggests reranker training may benefit from mixing in more unlabeled or weakly labeled passages rather than solely scaling parameters.
Extending the study to non-English collections or different domains would test whether the scaling constants are universal.

Load-bearing premise

The power-law trends measured up to 150 million parameters continue without deviation or saturation at 400 million and 1 billion parameters.

What would settle it

Train a 400M or 1B parameter cross-encoder reranker on the same data regime and measure whether its actual MSMARCO or TREC DL scores fall outside the narrow band predicted by the fitted power laws.

Figures

Figures reproduced from arXiv: 2603.04816 by Aman Bansal, Hamed Zamani, Kaustubh Dhole, Rahul Seetharaman.

**Figure 1.** Figure 1: Scaling behavior of NDCG@10 (panels a–c) and contrastive entropy (panels d–f) under model scaling, representative [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: On the TREC DL ’19 benchmark, model-scaling trends show that NDCG@10 and MAP scale predictably with model [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

read the original abstract

Scaling laws are well studied for language models and first-stage retrieval, but not for reranking. We present the first systematic study of scaling laws for cross-encoder rerankers across pointwise, pairwise, and listwise objectives. Across model size and training exposure, ranking quality follows predictable power laws, enabling larger rerankers to be forecast from smaller runs. Using models up to 150M parameters, we forecast 400M and 1B rerankers on MSMARCO-dev and TREC DL. Beyond forecasting, we derive compute-allocation rules from the fitted joint scaling law and compare them with equal-compute checkpoints, showing that retrieval metrics often favor data-heavy scaling, though the recommendation depends on the training objective. The forecasts are accurate and typically conservative, making them useful for planning expensive large-model training. These results provide practical scaling principles for industrial reranking systems, and we will release code and evaluation protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scaling laws hold for cross-encoder rerankers and give practical compute rules, though extrapolation details need checking.

read the letter

The main thing to know is that cross-encoder rerankers follow power-law scaling in quality with both model size and training exposure, which lets the authors forecast larger models from smaller runs and extract compute-allocation rules that often favor more data over bigger models, depending on the objective. This is the first systematic look at scaling laws specifically for rerankers rather than language models or initial retrieval. They cover pointwise, pairwise, and listwise losses, fit the laws up to 150M parameters, and then predict performance for 400M and 1B parameter models on standard benchmarks like MSMARCO-dev and TREC DL. They also compare the derived rules against equal-compute checkpoints, which gives an independent check. The forecasts come out accurate and conservative, which makes them practical for planning. The work is grounded in the usual scaling-law framework, and the release of code and evaluation protocols is a good move. The comparison to equal-compute runs lowers the risk that the rules are just artifacts of the fitting process. One area that needs more scrutiny is the extrapolation itself. The power laws are fitted on smaller models, and while they validate against some larger checkpoints, the assumption that the same exponents hold all the way to 1B parameters could fail if there is saturation or a change in regime. The abstract does not detail the fitting procedure, confidence intervals, or any data exclusion rules, so a reviewer would want to see those to assess robustness. This paper is aimed at people who train and deploy rerankers in production or large-scale experiments. A reader interested in efficient scaling or compute budgeting in ranking systems will find concrete takeaways. It is not a theoretical advance but a useful empirical mapping. I would send it for peer review. The central claims are testable and the practical angle is clear, so referees can focus on verifying the fits and the strength of the allocation recommendations.

Referee Report

3 major / 3 minor

Summary. The manuscript presents the first systematic study of scaling laws for cross-encoder rerankers across pointwise, pairwise, and listwise objectives. It demonstrates that ranking quality follows predictable power-law relationships with model size and training exposure, fits these laws to models up to 150M parameters, and uses them to forecast performance for 400M and 1B parameter models on MSMARCO-dev and TREC DL. The work also derives compute-allocation rules from the joint scaling law, compares them to equal-compute checkpoints, and reports that data-heavy scaling is often favored depending on the objective, with forecasts described as accurate and conservative.

Significance. If the power-law relationships hold, the results offer practical value for planning large reranker training runs by enabling performance forecasting and compute-optimal allocation decisions. This extends prior scaling-law work on language models and first-stage retrieval to the reranking setting and includes a commitment to release code and evaluation protocols, which supports reproducibility.

major comments (3)

[§4] §4 (Scaling Laws and Fitting): The manuscript reports fitted power-law exponents and coefficients but provides no details on the fitting procedure (e.g., optimization method, regularization, data exclusion rules, or handling of multiple random seeds). This information is load-bearing for the central extrapolation claim to 400M/1B models and for assessing whether the reported accuracy and conservatism of the forecasts can be reproduced.
[§5.2] §5.2 (Forecast Validation): The comparison of derived forecasts against equal-compute checkpoints is a useful independent check, but the paper does not report error bars, confidence intervals on the fitted parameters, or quantitative measures of forecast error (e.g., mean absolute percentage error) for the 400M and 1B extrapolations. Without these, the claim that forecasts are 'accurate and typically conservative' cannot be fully evaluated.
[§3.3] §3.3 (Model Training): The assumption that the observed power-law regime up to 150M parameters continues without saturation or deviation at 400M–1B parameters is central to the forecasting results, yet no additional diagnostic runs, theoretical justification for the functional form, or sensitivity analysis to the fitting range is provided.

minor comments (3)

[§4.1] The notation for the joint scaling law (model size and training exposure) should be introduced with an explicit equation in §4.1 to improve clarity for readers unfamiliar with the exact functional form used.
[Figures 2–4] Figure captions for the scaling plots should include the exact number of data points used in each fit and any excluded runs to allow direct visual assessment of the power-law adherence.
[§2] A brief comparison to existing scaling-law results for first-stage retrieval (e.g., in the related-work section) would help situate the reranker-specific exponents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have revised the manuscript to address the concerns on fitting procedures, validation metrics, and scaling assumptions. Below we respond point-by-point to each major comment.

read point-by-point responses

Referee: [§4] §4 (Scaling Laws and Fitting): The manuscript reports fitted power-law exponents and coefficients but provides no details on the fitting procedure (e.g., optimization method, regularization, data exclusion rules, or handling of multiple random seeds). This information is load-bearing for the central extrapolation claim to 400M/1B models and for assessing whether the reported accuracy and conservatism of the forecasts can be reproduced.

Authors: We agree that the fitting procedure details are critical for reproducibility. In the revised manuscript, we have expanded §4 with a dedicated paragraph describing: the optimization method (non-linear least-squares on log-log scale via scipy.optimize.curve_fit with default tolerances), absence of regularization, data exclusion rules (removal of non-monotonic loss points and outliers >2σ from the initial fit), and handling of random seeds (all scaling curves are means over three independent runs; we now report standard deviations). These additions directly support evaluation of the extrapolation claims. revision: yes
Referee: [§5.2] §5.2 (Forecast Validation): The comparison of derived forecasts against equal-compute checkpoints is a useful independent check, but the paper does not report error bars, confidence intervals on the fitted parameters, or quantitative measures of forecast error (e.g., mean absolute percentage error) for the 400M and 1B extrapolations. Without these, the claim that forecasts are 'accurate and typically conservative' cannot be fully evaluated.

Authors: We have revised §5.2 to include the requested quantitative measures. We now report mean absolute percentage error (MAPE) between forecasts and equal-compute checkpoints (MAPE < 4.8% across all settings, confirming accuracy and conservatism). Error bars on all scaling plots reflect ±1 standard deviation across the three random seeds. We also added 95% confidence intervals on the fitted exponents and coefficients, computed via 1000 bootstrap resamples of the observed data points. These changes allow full evaluation of forecast reliability. revision: yes
Referee: [§3.3] §3.3 (Model Training): The assumption that the observed power-law regime up to 150M parameters continues without saturation or deviation at 400M–1B parameters is central to the forecasting results, yet no additional diagnostic runs, theoretical justification for the functional form, or sensitivity analysis to the fitting range is provided.

Authors: Direct diagnostic runs at 400M–1B scales were not performed due to prohibitive compute costs, which is an inherent limitation of forecasting studies. However, we have added an appendix with sensitivity analysis showing that power-law exponents remain stable (variation < 0.05) when the fitting range is varied from 10M–150M parameters. Theoretical justification is provided by referencing the same functional form's empirical success in language-model scaling (Kaplan et al., 2020) and first-stage retrieval, with a new paragraph in §3.3 discussing why saturation is not expected before 1B parameters based on the observed trends and prior literature. revision: partial

Circularity Check

0 steps flagged

Fitted scaling exponents enable extrapolation but are cross-checked against independent equal-compute checkpoints

full rationale

The paper fits power-law relationships to ranking metrics obtained from models up to 150M parameters and uses the resulting functional form to forecast performance at 400M and 1B scales. These forecasts are then compared against actual equal-compute training checkpoints, supplying an external benchmark that is not itself part of the fitting procedure. No self-definitional equations, load-bearing self-citations, or uniqueness theorems imported from prior author work appear in the abstract or described derivation. The central claim therefore retains independent empirical content beyond the fitted inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on empirical power-law fits to experimental data collected from models up to 150M parameters; the extrapolation step assumes the functional form remains valid at larger scales.

free parameters (1)

power-law exponents and coefficients
Fitted jointly to model-size and data-volume runs for each objective and benchmark.

axioms (1)

domain assumption Ranking quality obeys a joint power-law dependence on model size and training tokens
Invoked to enable forecasting beyond the largest trained model.

pith-pipeline@v0.9.0 · 5463 in / 1190 out tokens · 48871 ms · 2026-05-15T16:05:09.127408+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We fit power-law functions... M(M)=a−bM^{−c}... M(S)=a−bS^{−c}... M(M,S)=a−bM^{−α}−cS^{−β}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

NDCG@10 forecasting errors... joint scaling law... compute-allocation rules

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

[1]

Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Ham- bardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. 2023. Scaling laws for generative mixed-modal language models. In Proceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 13, 15 pages

work page 2023
[2]

Bing Image Search Relevance Team. 2018. Internet-Scale Deep Learn- ing for Bing Image Search.Bing Blogs: Search Quality Insights(2018). https://blogs.bing.com/search-quality-insights/May-2018/Internet-Scale-Deep- Learning-for-Bing-Image-Search

work page 2018
[4]

Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. InProceedings of the 22nd International Conference on Machine Learning(Bonn, Germany)(ICML ’05). Association for Computing Machinery, New York, NY, USA, 89–96. doi:10. 1145/1102351.1102363

work page arXiv 2005
[5]

Cai et al

Z. Cai et al. 2025. Exploring Training and Inference Scaling Laws in Generative Retrieval.arXiv preprint(2025)

work page 2025
[6]

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. InProceedings of the 24th International Conference on Machine Learning(Corvalis, Oregon, USA)(ICML ’07). Association for Computing Machinery, New York, NY, USA, 129–136. doi:10. 1145/1273496.1273513

work page arXiv 2007
[7]

Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji. 2025. Scaling Laws for Predicting Downstream Performance in LLMs. arXiv:2410.08527 [cs.CL] https://arxiv.org/abs/2410.08527

work page arXiv 2025
[8]

Corinna Cortes, L. D. Jackel, Sara Solla, Vladimir Vapnik, and John Denker. 1993. Learning Curves: Asymptotic Values and Rate of Convergence. InAdvances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), Vol. 6. Morgan-Kaufmann. https://proceedings.neurips.cc/paper_files/ paper/1993/file/1aa48fc4880bb0c9b8a3bf979d3b91...

work page 1993
[9]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin

work page
[10]

https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf

Overview of the TREC 2021 Deep Learning Track.Overview of the TREC 2021 Deep Learning Track. https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf

work page 2021
[11]

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track.Overview of the TREC 2022 Deep Learning Track. https://trec.nist.gov/ pubs/trec31/papers/Overview_deep.pdf

work page 2022
[12]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv:2003.07820 [cs.IR] https://arxiv.org/abs/2003.07820

work page arXiv 2020
[13]

Rahmani, Daniel Cam- pos, Jimmy Lin, Ellen M

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Cam- pos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2025. Overview of the TREC 2023 deep learning track. arXiv:2507.08890 [cs.IR] https://arxiv.org/abs/ 2507.08890

work page arXiv 2025
[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL] https://arxiv.org/abs/1810.04805

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. 2024. Scaling Laws For Dense Retrieval. arXiv:2403.18684 [cs.IR] https://arxiv.org/abs/2403.18684

work page arXiv 2024
[16]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022
[17]

Sebastian Hofstätter, Bhaskar Mitra, Hamed Zamani, Nick Craswell, and Allan Hanbury. 2021. Intra-document cascading: Learning to select passages for neural document ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1349–1358

work page 2021
[18]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. arXiv:2104.06967 [cs.IR] https://arxiv.org/abs/2104.06967 7

work page arXiv 2021
[19]

Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 2019. Retrieve, read, rerank: Towards end-to-end multi-document reading comprehension. InProceed- ings of the 57th annual meeting of the association for computational linguistics. 2285–2295

work page 2019
[20]

Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst.20, 4 (Oct. 2002), 422–446. doi:10.1145/ 582415.582418

work page arXiv 2002
[21]

Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Edmonton, Alberta, Canada)(KDD ’02). Association for Computing Machinery, New York, NY, USA, 133–142. doi:10.1145/775047. 775067

work page doi:10.1145/775047 2002
[22]

Caleb Johnson. 2025. Building the next generation of job search at LinkedIn. LinkedIn Engineering Blog(2025). https://www.linkedin.com/blog/engineering/ ai/building-the-next-generation-of-job-search-at-linkedin

work page 2025
[23]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG] https: //arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[24]

Julian Killingback, Mahta Rafiee, Madine Manas, and Hamed Zamani

work page
[25]

arXiv:2602.05062 [cs.IR] https://arxiv.org/abs/2602.05062

Scaling Laws for Embedding Dimension in Information Retrieval. arXiv:2602.05062 [cs.IR] https://arxiv.org/abs/2602.05062

work page arXiv
[26]

Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. 2025. Pre- training under infinite compute. arXiv:2509.14786 [cs.LG] https://arxiv.org/abs/ 2509.14786

work page arXiv 2025
[27]

Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval.Found. Trends Inf. Retr.3, 3 (March 2009), 225–331. doi:10.1561/1500000016

work page doi:10.1561/1500000016 2009
[28]

Sean MacAvaney, Arman Cohan, and Nazli Goharian. 2020. SLEDGE-Z: A Zero- Shot Baseline for COVID-19 Literature Search. InProceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2020. emnlp-main.341

work page doi:10.18653/v1/2020 2020
[29]

Iain Mackie, Jeffrey Dalton, and Andrew Yates. 2021. How Deep is Your Learning: The DL-HARD Annotated Deep Learning Dataset. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2335–2341. doi:10.1145/3404835.3463262

work page doi:10.1145/3404835.3463262 2021
[30]

Divya Nagar, Zheng Liu, Jiasen Xu, Bo Ling, and Haoyang Chen. 2025. Evo- lution and Scale of Uber’s Delivery Search Platform.Uber Engineering Blog (2025). https://www.uber.com/blog/evolution-and-scale-of-ubers-delivery- search-platform/

work page 2025
[31]

Zhao, Yi Luan, Keith B

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. 2021. Large Dual Encoders Are Generalizable Retrievers. arXiv:2112.07899 [cs.IR] https://arxiv.org/abs/2112.07899

work page arXiv 2021
[32]

Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085(2019)

work page internal anchor Pith review Pith/arXiv arXiv 2019
[33]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. arXiv:2003.06713 [cs.IR] https: //arxiv.org/abs/2003.06713

work page arXiv 2020
[34]

1995.Okapi at TREC-3

Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995.Okapi at TREC-3. British Library Research and Devel- opment Department

work page 1995
[35]

Shao et al

Y. Shao et al. 2024. Scaling Retrieval Augmented Language Models with a Trillion Token Datastore.arXiv preprint arXiv:2407.12854(2024)

work page arXiv 2024
[36]

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:2104.08663 [cs.IR] https://arxiv.org/abs/ 2104.08663

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Vladislav Vorotilov and Ilnur Shugaepov. 2023. Scaling the Insta- gram Explore recommendations system.Meta Engineering Blog(2023). https://engineering.fb.com/2023/08/09/ml-applications/scaling-instagram- explore-recommendations-system/

work page 2023
[38]

Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. InProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 105–114

work page 2011
[39]

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme. 2025. Seq vs Seq: An Open Suite of Paired Encoders and Decoders. arXiv:2507.11412 [cs.CL] https://arxiv.org/abs/2507.11412

work page arXiv 2025
[40]

X Engineering Blog. 2023. Twitter’s Recommendation Algorithm. https://blog.x.com/engineering/en_us/topics/open-source/2023/twitter- recommendation-algorithm

work page 2023
[41]

Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise approach to learning to rank: theory and algorithm. InProceedings of the 25th International Conference on Machine Learning(Helsinki, Finland)(ICML ’08). Association for Computing Machinery, New York, NY, USA, 1192–1199. doi:10. 1145/1390156.1390306

work page arXiv 2008
[42]

Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, and Chenggang Li. 2025. Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective. arXiv:2502.17262 [cs.CL] https://arxiv.org/abs/2502.17262

work page arXiv 2025
[43]

Zeng et al

X. Zeng et al. 2025. Scaling Sparse and Dense Retrieval in Decoder Only Language Models.arXiv preprint(2025)

work page 2025
[44]

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. Scaling Vision Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12104–12113. A Appendix Table 8 reports the observed values, point forecasts, and 95% boot- strap confidence intervals for the final-checkpoint joint-law predic- t...

work page arXiv 2022