pith. machine review for the scientific record. sign in

arxiv: 2603.04816 · v2 · submitted 2026-03-05 · 💻 cs.IR

Recognition: 2 theorem links

· Lean Theorem

Scaling Laws for Cross-Encoder Reranking

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:05 UTC · model grok-4.3

classification 💻 cs.IR
keywords scaling lawscross-encoder rerankinginformation retrievalmodel scalingranking metricscompute allocationMSMARCOTREC DL
0
0 comments X

The pith

Cross-encoder rerankers follow power-law scaling with model size and training exposure, allowing forecasts of larger models from smaller runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that ranking quality in cross-encoder rerankers scales according to predictable power laws as model size and training data increase. This pattern holds across pointwise, pairwise, and listwise training objectives. The authors fit these laws on models up to 150 million parameters and use them to forecast performance for 400 million and 1 billion parameter rerankers on MSMARCO-dev and TREC DL. They also extract rules for splitting compute between model size and data volume, finding that data-heavy allocations often produce better retrieval metrics. The forecasts prove accurate and tend to be conservative, offering practical guidance for training large rerankers.

Core claim

Ranking quality for cross-encoder rerankers follows predictable power laws across model size and training exposure for pointwise, pairwise, and listwise objectives. Using data from models up to 150M parameters, the fitted scaling laws accurately forecast the performance of 400M and 1B parameter models on MSMARCO-dev and TREC DL. From the joint scaling law, compute-allocation rules are derived that frequently recommend data-heavy scaling over equal-compute checkpoints, though this depends on the objective.

What carries the argument

Joint power-law scaling relationships over model size and training exposure that extrapolate ranking metrics to unseen larger models.

If this is right

  • Larger rerankers can be forecasted from smaller training runs without full training.
  • Compute budgets often yield better metrics when allocated more to additional data than to larger models.
  • The scaling behavior remains consistent across pointwise, pairwise, and listwise objectives.
  • Forecasts are typically conservative, so planned large runs are unlikely to underperform expectations.
  • Industrial reranking systems can use these laws to plan expensive training more efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same power-law approach could be tested on other retrieval architectures or first-stage indexes to check for similar predictability.
  • If the conservative bias persists at even larger scales, real systems may outperform forecasts and justify earlier investment in big rerankers.
  • The data-heavy preference suggests reranker training may benefit from mixing in more unlabeled or weakly labeled passages rather than solely scaling parameters.
  • Extending the study to non-English collections or different domains would test whether the scaling constants are universal.

Load-bearing premise

The power-law trends measured up to 150 million parameters continue without deviation or saturation at 400 million and 1 billion parameters.

What would settle it

Train a 400M or 1B parameter cross-encoder reranker on the same data regime and measure whether its actual MSMARCO or TREC DL scores fall outside the narrow band predicted by the fitted power laws.

Figures

Figures reproduced from arXiv: 2603.04816 by Aman Bansal, Hamed Zamani, Kaustubh Dhole, Rahul Seetharaman.

Figure 1
Figure 1. Figure 1: Scaling behavior of NDCG@10 (panels a–c) and contrastive entropy (panels d–f) under model scaling, representative [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: On the TREC DL ’19 benchmark, model-scaling trends show that NDCG@10 and MAP scale predictably with model [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Scaling laws are well studied for language models and first-stage retrieval, but not for reranking. We present the first systematic study of scaling laws for cross-encoder rerankers across pointwise, pairwise, and listwise objectives. Across model size and training exposure, ranking quality follows predictable power laws, enabling larger rerankers to be forecast from smaller runs. Using models up to 150M parameters, we forecast 400M and 1B rerankers on MSMARCO-dev and TREC DL. Beyond forecasting, we derive compute-allocation rules from the fitted joint scaling law and compare them with equal-compute checkpoints, showing that retrieval metrics often favor data-heavy scaling, though the recommendation depends on the training objective. The forecasts are accurate and typically conservative, making them useful for planning expensive large-model training. These results provide practical scaling principles for industrial reranking systems, and we will release code and evaluation protocols.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents the first systematic study of scaling laws for cross-encoder rerankers across pointwise, pairwise, and listwise objectives. It demonstrates that ranking quality follows predictable power-law relationships with model size and training exposure, fits these laws to models up to 150M parameters, and uses them to forecast performance for 400M and 1B parameter models on MSMARCO-dev and TREC DL. The work also derives compute-allocation rules from the joint scaling law, compares them to equal-compute checkpoints, and reports that data-heavy scaling is often favored depending on the objective, with forecasts described as accurate and conservative.

Significance. If the power-law relationships hold, the results offer practical value for planning large reranker training runs by enabling performance forecasting and compute-optimal allocation decisions. This extends prior scaling-law work on language models and first-stage retrieval to the reranking setting and includes a commitment to release code and evaluation protocols, which supports reproducibility.

major comments (3)
  1. [§4] §4 (Scaling Laws and Fitting): The manuscript reports fitted power-law exponents and coefficients but provides no details on the fitting procedure (e.g., optimization method, regularization, data exclusion rules, or handling of multiple random seeds). This information is load-bearing for the central extrapolation claim to 400M/1B models and for assessing whether the reported accuracy and conservatism of the forecasts can be reproduced.
  2. [§5.2] §5.2 (Forecast Validation): The comparison of derived forecasts against equal-compute checkpoints is a useful independent check, but the paper does not report error bars, confidence intervals on the fitted parameters, or quantitative measures of forecast error (e.g., mean absolute percentage error) for the 400M and 1B extrapolations. Without these, the claim that forecasts are 'accurate and typically conservative' cannot be fully evaluated.
  3. [§3.3] §3.3 (Model Training): The assumption that the observed power-law regime up to 150M parameters continues without saturation or deviation at 400M–1B parameters is central to the forecasting results, yet no additional diagnostic runs, theoretical justification for the functional form, or sensitivity analysis to the fitting range is provided.
minor comments (3)
  1. [§4.1] The notation for the joint scaling law (model size and training exposure) should be introduced with an explicit equation in §4.1 to improve clarity for readers unfamiliar with the exact functional form used.
  2. [Figures 2–4] Figure captions for the scaling plots should include the exact number of data points used in each fit and any excluded runs to allow direct visual assessment of the power-law adherence.
  3. [§2] A brief comparison to existing scaling-law results for first-stage retrieval (e.g., in the related-work section) would help situate the reranker-specific exponents.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We have revised the manuscript to address the concerns on fitting procedures, validation metrics, and scaling assumptions. Below we respond point-by-point to each major comment.

read point-by-point responses
  1. Referee: [§4] §4 (Scaling Laws and Fitting): The manuscript reports fitted power-law exponents and coefficients but provides no details on the fitting procedure (e.g., optimization method, regularization, data exclusion rules, or handling of multiple random seeds). This information is load-bearing for the central extrapolation claim to 400M/1B models and for assessing whether the reported accuracy and conservatism of the forecasts can be reproduced.

    Authors: We agree that the fitting procedure details are critical for reproducibility. In the revised manuscript, we have expanded §4 with a dedicated paragraph describing: the optimization method (non-linear least-squares on log-log scale via scipy.optimize.curve_fit with default tolerances), absence of regularization, data exclusion rules (removal of non-monotonic loss points and outliers >2σ from the initial fit), and handling of random seeds (all scaling curves are means over three independent runs; we now report standard deviations). These additions directly support evaluation of the extrapolation claims. revision: yes

  2. Referee: [§5.2] §5.2 (Forecast Validation): The comparison of derived forecasts against equal-compute checkpoints is a useful independent check, but the paper does not report error bars, confidence intervals on the fitted parameters, or quantitative measures of forecast error (e.g., mean absolute percentage error) for the 400M and 1B extrapolations. Without these, the claim that forecasts are 'accurate and typically conservative' cannot be fully evaluated.

    Authors: We have revised §5.2 to include the requested quantitative measures. We now report mean absolute percentage error (MAPE) between forecasts and equal-compute checkpoints (MAPE < 4.8% across all settings, confirming accuracy and conservatism). Error bars on all scaling plots reflect ±1 standard deviation across the three random seeds. We also added 95% confidence intervals on the fitted exponents and coefficients, computed via 1000 bootstrap resamples of the observed data points. These changes allow full evaluation of forecast reliability. revision: yes

  3. Referee: [§3.3] §3.3 (Model Training): The assumption that the observed power-law regime up to 150M parameters continues without saturation or deviation at 400M–1B parameters is central to the forecasting results, yet no additional diagnostic runs, theoretical justification for the functional form, or sensitivity analysis to the fitting range is provided.

    Authors: Direct diagnostic runs at 400M–1B scales were not performed due to prohibitive compute costs, which is an inherent limitation of forecasting studies. However, we have added an appendix with sensitivity analysis showing that power-law exponents remain stable (variation < 0.05) when the fitting range is varied from 10M–150M parameters. Theoretical justification is provided by referencing the same functional form's empirical success in language-model scaling (Kaplan et al., 2020) and first-stage retrieval, with a new paragraph in §3.3 discussing why saturation is not expected before 1B parameters based on the observed trends and prior literature. revision: partial

Circularity Check

0 steps flagged

Fitted scaling exponents enable extrapolation but are cross-checked against independent equal-compute checkpoints

full rationale

The paper fits power-law relationships to ranking metrics obtained from models up to 150M parameters and uses the resulting functional form to forecast performance at 400M and 1B scales. These forecasts are then compared against actual equal-compute training checkpoints, supplying an external benchmark that is not itself part of the fitting procedure. No self-definitional equations, load-bearing self-citations, or uniqueness theorems imported from prior author work appear in the abstract or described derivation. The central claim therefore retains independent empirical content beyond the fitted inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claims rest on empirical power-law fits to experimental data collected from models up to 150M parameters; the extrapolation step assumes the functional form remains valid at larger scales.

free parameters (1)
  • power-law exponents and coefficients
    Fitted jointly to model-size and data-volume runs for each objective and benchmark.
axioms (1)
  • domain assumption Ranking quality obeys a joint power-law dependence on model size and training tokens
    Invoked to enable forecasting beyond the largest trained model.

pith-pipeline@v0.9.0 · 5463 in / 1190 out tokens · 48871 ms · 2026-05-15T16:05:09.127408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 4 internal anchors

  1. [1]

    Armen Aghajanyan, Lili Yu, Alexis Conneau, Wei-Ning Hsu, Karen Ham- bardzumyan, Susan Zhang, Stephen Roller, Naman Goyal, Omer Levy, and Luke Zettlemoyer. 2023. Scaling laws for generative mixed-modal language models. In Proceedings of the 40th International Conference on Machine Learning(Honolulu, Hawaii, USA)(ICML’23). JMLR.org, Article 13, 15 pages

  2. [2]

    Bing Image Search Relevance Team. 2018. Internet-Scale Deep Learn- ing for Bing Image Search.Bing Blogs: Search Quality Insights(2018). https://blogs.bing.com/search-quality-insights/May-2018/Internet-Scale-Deep- Learning-for-Bing-Image-Search

  3. [4]

    Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. InProceedings of the 22nd International Conference on Machine Learning(Bonn, Germany)(ICML ’05). Association for Computing Machinery, New York, NY, USA, 89–96. doi:10. 1145/1102351.1102363

  4. [5]

    Cai et al

    Z. Cai et al. 2025. Exploring Training and Inference Scaling Laws in Generative Retrieval.arXiv preprint(2025)

  5. [6]

    Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li. 2007. Learning to rank: from pairwise approach to listwise approach. InProceedings of the 24th International Conference on Machine Learning(Corvalis, Oregon, USA)(ICML ’07). Association for Computing Machinery, New York, NY, USA, 129–136. doi:10. 1145/1273496.1273513

  6. [7]

    Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji. 2025. Scaling Laws for Predicting Downstream Performance in LLMs. arXiv:2410.08527 [cs.CL] https://arxiv.org/abs/2410.08527

  7. [8]

    Corinna Cortes, L. D. Jackel, Sara Solla, Vladimir Vapnik, and John Denker. 1993. Learning Curves: Asymptotic Values and Rate of Convergence. InAdvances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector (Eds.), Vol. 6. Morgan-Kaufmann. https://proceedings.neurips.cc/paper_files/ paper/1993/file/1aa48fc4880bb0c9b8a3bf979d3b91...

  8. [9]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Jimmy Lin

  9. [10]

    https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf

    Overview of the TREC 2021 Deep Learning Track.Overview of the TREC 2021 Deep Learning Track. https://trec.nist.gov/pubs/trec30/papers/Overview-DL.pdf

  10. [11]

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track.Overview of the TREC 2022 Deep Learning Track. https://trec.nist.gov/ pubs/trec31/papers/Overview_deep.pdf

  11. [12]

    Voorhees

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track. arXiv:2003.07820 [cs.IR] https://arxiv.org/abs/2003.07820

  12. [13]

    Rahmani, Daniel Cam- pos, Jimmy Lin, Ellen M

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Cam- pos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2025. Overview of the TREC 2023 deep learning track. arXiv:2507.08890 [cs.IR] https://arxiv.org/abs/ 2507.08890

  13. [14]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [cs.CL] https://arxiv.org/abs/1810.04805

  14. [15]

    Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. 2024. Scaling Laws For Dense Retrieval. arXiv:2403.18684 [cs.IR] https://arxiv.org/abs/2403.18684

  15. [16]

    Rae, and Laurent Sifre

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

  16. [17]

    Sebastian Hofstätter, Bhaskar Mitra, Hamed Zamani, Nick Craswell, and Allan Hanbury. 2021. Intra-document cascading: Learning to select passages for neural document ranking. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1349–1358

  17. [18]

    Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury. 2021. Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. arXiv:2104.06967 [cs.IR] https://arxiv.org/abs/2104.06967 7

  18. [19]

    Minghao Hu, Yuxing Peng, Zhen Huang, and Dongsheng Li. 2019. Retrieve, read, rerank: Towards end-to-end multi-document reading comprehension. InProceed- ings of the 57th annual meeting of the association for computational linguistics. 2285–2295

  19. [20]

    Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques.ACM Trans. Inf. Syst.20, 4 (Oct. 2002), 422–446. doi:10.1145/ 582415.582418

  20. [21]

    Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Edmonton, Alberta, Canada)(KDD ’02). Association for Computing Machinery, New York, NY, USA, 133–142. doi:10.1145/775047. 775067

  21. [22]

    Caleb Johnson. 2025. Building the next generation of job search at LinkedIn. LinkedIn Engineering Blog(2025). https://www.linkedin.com/blog/engineering/ ai/building-the-next-generation-of-job-search-at-linkedin

  22. [23]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG] https: //arxiv.org/abs/2001.08361

  23. [24]

    Julian Killingback, Mahta Rafiee, Madine Manas, and Hamed Zamani

  24. [25]

    arXiv:2602.05062 [cs.IR] https://arxiv.org/abs/2602.05062

    Scaling Laws for Embedding Dimension in Information Retrieval. arXiv:2602.05062 [cs.IR] https://arxiv.org/abs/2602.05062

  25. [26]

    Konwoo Kim, Suhas Kotha, Percy Liang, and Tatsunori Hashimoto. 2025. Pre- training under infinite compute. arXiv:2509.14786 [cs.LG] https://arxiv.org/abs/ 2509.14786

  26. [27]

    Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval.Found. Trends Inf. Retr.3, 3 (March 2009), 225–331. doi:10.1561/1500000016

  27. [28]

    Sean MacAvaney, Arman Cohan, and Nazli Goharian. 2020. SLEDGE-Z: A Zero- Shot Baseline for COVID-19 Literature Search. InProceedings of the 2020 Confer- ence on Empirical Methods in Natural Language Processing. doi:10.18653/v1/2020. emnlp-main.341

  28. [29]

    Iain Mackie, Jeffrey Dalton, and Andrew Yates. 2021. How Deep is Your Learning: The DL-HARD Annotated Deep Learning Dataset. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2335–2341. doi:10.1145/3404835.3463262

  29. [30]

    Divya Nagar, Zheng Liu, Jiasen Xu, Bo Ling, and Haoyang Chen. 2025. Evo- lution and Scale of Uber’s Delivery Search Platform.Uber Engineering Blog (2025). https://www.uber.com/blog/evolution-and-scale-of-ubers-delivery- search-platform/

  30. [31]

    Zhao, Yi Luan, Keith B

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. 2021. Large Dual Encoders Are Generalizable Retrievers. arXiv:2112.07899 [cs.IR] https://arxiv.org/abs/2112.07899

  31. [32]

    Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085(2019)

  32. [33]

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. arXiv:2003.06713 [cs.IR] https: //arxiv.org/abs/2003.06713

  33. [34]

    1995.Okapi at TREC-3

    Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. 1995.Okapi at TREC-3. British Library Research and Devel- opment Department

  34. [35]

    Shao et al

    Y. Shao et al. 2024. Scaling Retrieval Augmented Language Models with a Trillion Token Datastore.arXiv preprint arXiv:2407.12854(2024)

  35. [36]

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. 2021. BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. arXiv:2104.08663 [cs.IR] https://arxiv.org/abs/ 2104.08663

  36. [37]

    Vladislav Vorotilov and Ilnur Shugaepov. 2023. Scaling the Insta- gram Explore recommendations system.Meta Engineering Blog(2023). https://engineering.fb.com/2023/08/09/ml-applications/scaling-instagram- explore-recommendations-system/

  37. [38]

    Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A cascade ranking model for efficient ranked retrieval. InProceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. 105–114

  38. [39]

    Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, and Benjamin Van Durme. 2025. Seq vs Seq: An Open Suite of Paired Encoders and Decoders. arXiv:2507.11412 [cs.CL] https://arxiv.org/abs/2507.11412

  39. [40]

    X Engineering Blog. 2023. Twitter’s Recommendation Algorithm. https://blog.x.com/engineering/en_us/topics/open-source/2023/twitter- recommendation-algorithm

  40. [41]

    Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise approach to learning to rank: theory and algorithm. InProceedings of the 25th International Conference on Machine Learning(Helsinki, Finland)(ICML ’08). Association for Computing Machinery, New York, NY, USA, 1192–1199. doi:10. 1145/1390156.1390306

  41. [42]

    Chengyin Xu, Kaiyuan Chen, Xiao Li, Ke Shen, and Chenggang Li. 2025. Unveiling Downstream Performance Scaling of LLMs: A Clustering-Based Perspective. arXiv:2502.17262 [cs.CL] https://arxiv.org/abs/2502.17262

  42. [43]

    Zeng et al

    X. Zeng et al. 2025. Scaling Sparse and Dense Retrieval in Decoder Only Language Models.arXiv preprint(2025)

  43. [44]

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. 2022. Scaling Vision Transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12104–12113. A Appendix Table 8 reports the observed values, point forecasts, and 95% boot- strap confidence intervals for the final-checkpoint joint-law predic- t...