pith. sign in

arxiv: 2605.18607 · v1 · pith:XYKCYTCWnew · submitted 2026-05-18 · 💻 cs.CL · cs.LG

Forecasting Downstream Performance of LLMs With Proxy Metrics

Pith reviewed 2026-05-20 10:26 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords LLM evaluationperformance forecastingproxy metricstoken statisticsmodel selectionpretraining dataexpert trajectories
0
0 comments X

The pith

Proxy metrics from token statistics on expert solutions forecast LLM downstream performance more reliably than loss or compute baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that simple aggregates of token-level prediction statistics can serve as effective proxies for how well a language model will perform on downstream tasks. These statistics come from running the model on expert-written solutions to problems and include measures like the entropy of its output distribution, how often the correct token appears in its top predictions, and the rank it assigns to the expert's choice. The authors demonstrate that these proxies give stronger rankings than cross-entropy loss in three practical settings: choosing among different model families, selecting pretraining data, and predicting accuracy gains as training continues. This approach matters because it supports cheaper and earlier decisions during model development without needing to run full expensive evaluations on every candidate.

Core claim

Proxy metrics built by aggregating token-level statistics such as entropy, top-k accuracy, and expert token rank from a candidate model's next-token distribution over expert-written solutions provide more accurate forecasts of downstream performance than loss- or compute-based alternatives. This holds for ranking heterogeneous reasoning models, selecting among pretraining corpora at greatly reduced cost, and extrapolating accuracy over extended training horizons.

What carries the argument

Proxy metrics formed by aggregating entropy, top-k accuracy, and expert token rank computed from next-token predictions over expert-written solutions.

If this is right

  • Different model families can be ranked for reasoning capability with substantially higher correlation to actual downstream results.
  • Pretraining corpora can be screened for a target model using far less compute than direct evaluation while still identifying strong candidates.
  • Downstream accuracy trends can be projected over large increases in training compute with lower error than prior methods.
  • Model development decisions become feasible at earlier training stages or with smaller evaluation budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the proxies prove stable across more domains, they could reduce reliance on full downstream benchmarks for routine model comparisons.
  • The same token statistics might be adapted to forecast performance in non-language settings such as code generation or multimodal tasks.
  • Layering these proxies with existing signals like loss curves could produce even tighter forecasts for very early training stages.

Load-bearing premise

Token-level statistics on expert-written solutions are representative enough of target downstream tasks that their aggregation reveals capability differences missed by loss.

What would settle it

Running the proxies on a fresh collection of models and tasks and finding Spearman correlations no higher than those from loss, or observing large errors when extrapolating across long training runs.

Figures

Figures reproduced from arXiv: 2605.18607 by Arkil Patel, Dzmitry Bahdanau, Marius Mosbach, Siva Reddy.

Figure 1
Figure 1. Figure 1: Left. Ranking models on held-out challenging reasoning tasks (as measured by mean CV Spearman ρ) using our linear RankSVM proxy. Our proxy uses features of the next-token prediction distributions of candidate models over expert reasoning traces. Right: Ranking 25 pretraining corpora for a target 1B LLM on the DataDecide testbed (Magnusson et al., 2025). Each method trains small proxy models (4M–90M) on eac… view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of our method. We use a candidate model’s next-token prediction distribution [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Extrapolating proxy metrics along the training trajectory. Left: pretraining checkpoints of OLMo-3-7B on four reasoning benchmarks. Right: post-training checkpoints of OLMo-3-7B￾Think on four reasoning benchmarks. Filled markers are the training window, stars are held-out checkpoints, solid curves are power-law fits from the training window, and dashed curves are extrapolations. The plots for other benchma… view at source ↗
Figure 4
Figure 4. Figure 4: Extrapolating HellaSwag accuracy along the OLMo-3-7B pretraining trajectory. The proxy power-law fit (RMSE = 0.003) tracks the target far more closely than the CE loss exponen￾tial (RMSE = 0.09). Experimental setup. We use pretraining checkpoints of OLMo-3-7B (Olmo et al., 2026) across ten OLMES benchmarks (Gu et al., 2025). For each benchmark we fit a predictor→accuracy curve on checkpoints up to 80,000 s… view at source ↗
Figure 5
Figure 5. Figure 5: Proxy metric selection frequency (normalized) for univariate ( [PITH_FULL_IMAGE:figures/full_fig_p024_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Likelihood-based baselines against a learned proxy. Left: the three loss-based baselines (FineWeb cross-entropy loss, uniform expert-trajectory cross-entropy loss, and rBridge) plotted against MMLU-Pro accuracy across 18 language models. Low loss is a weak and non-monotonic indicator of downstream ranking across model families and post-training recipes. Right: the learned RankSVM (linear) proxy, evaluated … view at source ↗
Figure 7
Figure 7. Figure 7: Performance of the linear RankSVM proxy as we vary the number of held-out tasks and the [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Performance of the 3-sparse proxy as we vary the number of held-out tasks and the fraction [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ranking LLMs with the 3-sparse proxy. Downstream accuracy vs. proxy score for each of the six benchmarks on a randomly sampled held-out fold. Same format as [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Pretraining extrapolation on AIME and SuperGPQA. Same protocol as [PITH_FULL_IMAGE:figures/full_fig_p027_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Post-training extrapolation on AIME. Same protocol as [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Proxy metric vs. downstream accuracy at post-training checkpoints. The best selected univariate proxy is plotted against downstream accuracy on USACO (left) and HMMT (right) across post-training checkpoints of OLMo-3-7B-Think. The strong monotonic relationship confirms that the extrapolated proxy tracks the ranking of interest. 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Training Steps 1e6 0.55 0.60 0.65 0.70 0.75 0.… view at source ↗
Figure 13
Figure 13. Figure 13: Direct sigmoid extrapolation of HellaSwag accuracy. Accuracy is fit as a sigmoid of log10(steps) following Owen (2024). Circles are the training window (up to 80K steps), the star is the held-out checkpoint at 1.4M steps. The fit overshoots the held-out accuracy (RMSE = 0.11). downstream ranking, and the mean across all five post-training benchmarks (ρ = 0.84, as reported in §6.1) confirms that this corre… view at source ↗
Figure 14
Figure 14. Figure 14: Extrapolating Winogrande accuracy along the pretraining trajectory of OLMo-3-7B. Circles are the training window (up to 80K steps), the star is the held-out checkpoint at 1.4M steps (∼18× the training compute). Left: accuracy vs. log10(steps), fit with a sigmoid (RMSE = 0.02). Centre: accuracy vs. CE loss on FineWeb, fit with an exponential (RMSE = 0.08). Right: accuracy vs. the best univariate proxy, fit… view at source ↗
Figure 15
Figure 15. Figure 15: Extrapolating ARC Challenge accuracy along the pretraining trajectory of OLMo-3- 7B. Circles are the training window (up to 80K steps), the star is the held-out checkpoint at 1.4M steps (∼18× the training compute). Left: accuracy vs. log10(steps), fit with a sigmoid (RMSE = 0.07). Centre: accuracy vs. CE loss on FineWeb, fit with an exponential (RMSE = 0.13). Right: accuracy vs. the best univariate proxy,… view at source ↗
read the original abstract

Progress in language model development is often driven by comparative decisions: which architecture to adopt, which pretraining corpus to use, or which training recipe to apply. Making these decisions well requires reliable performance forecasts, yet the two commonly used signals are fundamentally limited. Cross-entropy loss is poorly aligned with downstream capabilities, and direct downstream evaluation is expensive, sparse, and often uninformative at early training stages. Instead, we propose to construct proxy metrics by aggregating token-level statistics, such as entropy, top-k accuracy, and expert token rank, from a candidate model's next token distribution over expert-written solutions. Across three settings, our proxies consistently outperform loss- and compute-based baselines: 1) For cross-family model selection, they rank a heterogeneous population of reasoning models with mean Spearman Rho = 0.81 (vs. Rho = 0.36 for cross-entropy loss); 2) For pretraining data selection, they reliably rank 25 candidate corpora for a target model at roughly $10{,}000\times$ less compute than direct evaluation, pushing the Pareto frontier beyond existing methods; and 3) for training-time forecasting, they extrapolate downstream accuracy across an $18\times$ compute horizon with roughly half the error of existing alternatives. Together, these results suggest that expert trajectories are a broadly useful source of signal for assessing model capabilities, enabling reliable performance forecasting throughout the model development life cycle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes constructing proxy metrics by aggregating token-level statistics (entropy, top-k accuracy, expert token rank) from a candidate model's next-token distribution over expert-written solutions. It evaluates these proxies across three settings, claiming they outperform loss- and compute-based baselines: mean Spearman ρ = 0.81 (vs. 0.36 for loss) for ranking heterogeneous reasoning models; reliable ranking of 25 pretraining corpora at ~10,000× lower compute than direct evaluation; and extrapolation of downstream accuracy over an 18× compute horizon with roughly half the error of alternatives.

Significance. If the empirical results hold under broader validation, the work could meaningfully advance efficient LLM development by reducing reliance on expensive downstream evaluations or unaligned loss signals. The concrete quantitative gains across model selection, data selection, and training forecasting, together with the focus on expert trajectories as an information source, represent a practical contribution that could influence workflows if the proxies prove robust beyond the reported settings.

major comments (2)
  1. [§3] §3 (Proxy construction): The central claim requires that the aggregated token statistics capture capability differences orthogonal to cross-entropy loss. The manuscript provides no ablation or partial-correlation analysis showing that the combination of entropy, top-k accuracy, and expert token rank supplies signal independent of loss when both are computed on the same expert solutions; without this, the reported outperformance in all three settings could reduce to distributional matching rather than genuine capability forecasting.
  2. [§4.1] §4.1 (Cross-family model selection): The mean Spearman ρ = 0.81 result is load-bearing for the first claim. The description does not specify the exact number of models, the precise downstream tasks used for ground-truth ranking, or any statistical significance test against the loss baseline of ρ = 0.36, making it impossible to judge whether the improvement generalizes or is tied to the particular expert-solution distribution.
minor comments (2)
  1. [Abstract] Abstract: The phrases 'roughly $10,000× less compute' and 'roughly half the error' would benefit from exact values, confidence intervals, or ranges to allow precise comparison with baselines.
  2. [§3] Notation: The aggregation function combining entropy, top-k accuracy, and expert token rank is described only in prose; an explicit equation in §3 would improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions we will make to improve clarity and strengthen the claims.

read point-by-point responses
  1. Referee: [§3] §3 (Proxy construction): The central claim requires that the aggregated token statistics capture capability differences orthogonal to cross-entropy loss. The manuscript provides no ablation or partial-correlation analysis showing that the combination of entropy, top-k accuracy, and expert token rank supplies signal independent of loss when both are computed on the same expert solutions; without this, the reported outperformance in all three settings could reduce to distributional matching rather than genuine capability forecasting.

    Authors: We agree that an explicit analysis demonstrating signal independent of loss would better support the claim that the proxies capture capability differences beyond distributional matching. While the token-level statistics (entropy, top-k accuracy, expert token rank) are constructed to reflect properties such as uncertainty and alignment with expert choices that are not directly captured by aggregate cross-entropy loss, the manuscript does not include a partial-correlation or ablation study on the same expert solutions. We will add this analysis in the revision, including partial Spearman correlations between the proxy scores and downstream performance while controlling for loss, as well as an ablation comparing the combined proxy against loss alone. This will help substantiate that the reported gains reflect orthogonal signal. revision: yes

  2. Referee: [§4.1] §4.1 (Cross-family model selection): The mean Spearman ρ = 0.81 result is load-bearing for the first claim. The description does not specify the exact number of models, the precise downstream tasks used for ground-truth ranking, or any statistical significance test against the loss baseline of ρ = 0.36, making it impossible to judge whether the improvement generalizes or is tied to the particular expert-solution distribution.

    Authors: We acknowledge that the description in §4.1 omits key experimental details needed to fully evaluate the result. In the revised manuscript we will explicitly state the number of models evaluated, list the precise downstream tasks used to establish the ground-truth rankings, and report a statistical significance test (such as bootstrap resampling) comparing the proxy correlation to the loss baseline. These additions will allow readers to better assess generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical comparisons to external baselines

full rationale

The paper constructs proxy metrics from token-level statistics (entropy, top-k accuracy, expert token rank) aggregated over expert-written solutions and evaluates them empirically against loss- and compute-based baselines in three distinct settings: cross-family model ranking, pretraining data selection, and training-time forecasting. Reported gains (Spearman rho of 0.81 vs. 0.36, 10,000x compute reduction, halved extrapolation error) are obtained via direct measurement on separate evaluations rather than any fitted parameter being renamed as a prediction or any equation reducing to its own inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing justifications for the central claims. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that expert trajectories supply useful signal for downstream capability; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Expert-written solutions provide a representative basis for computing token-level statistics that correlate with downstream task performance.
    The proxy construction explicitly uses next-token distributions over expert solutions as the source of entropy, top-k accuracy, and expert token rank.

pith-pipeline@v0.9.0 · 5787 in / 1295 out tokens · 53995 ms · 2026-05-20T10:26:23.579666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

104 extracted references · 104 canonical work pages · 2 internal anchors

  1. [1]

    Forty-second International Conference on Machine Learning , year=

    DataDecide: How to Predict Best Pretraining Data with Small Experiments , author=. Forty-second International Conference on Machine Learning , year=

  2. [2]

    The Fourteenth International Conference on Learning Representations , year=

    Revisiting the Scaling Properties of Downstream Metrics in Large Language Model Training , author=. The Fourteenth International Conference on Learning Representations , year=

  3. [3]

    The Thirteenth International Conference on Learning Representations , year=

    Scaling Laws for Downstream Task Performance in Machine Translation , author=. The Thirteenth International Conference on Learning Representations , year=

  4. [4]

    Predicting

    Woosung Koh and Juyoung Suk and Sungjun Han and Se-Young Yun and Jay Shin , booktitle=. Predicting. 2026 , url=

  5. [5]

    Transactions on Machine Learning Research , issn=

    Loss-to-Loss Prediction: Scaling Laws for All Datasets , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

  6. [6]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Observational Scaling Laws and the Predictability of Langauge Model Performance , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  7. [7]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Understanding Emergent Abilities of Language Models from the Loss Perspective , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  8. [8]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Are Emergent Abilities of Large Language Models a Mirage? , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  9. [9]

    The Thirteenth International Conference on Learning Representations , year=

    Language models scale reliably with over-training and on downstream tasks , author=. The Thirteenth International Conference on Learning Representations , year=

  10. [10]

    Why Has Predicting Downstream Capabilities of Frontier

    Rylan Schaeffer and Hailey Schoelkopf and Brando Miranda and Gabriel Mukobi and Varun Madan and Adam Ibrahim and Herbie Bradley and Stella Biderman and Sanmi Koyejo , booktitle=. Why Has Predicting Downstream Capabilities of Frontier. 2025 , url=

  11. [11]

    The Art of Scaling Reinforcement Learning Compute for

    Fnu Devvrit and Lovish Madaan and Rishabh Tiwari and Rachit Bansal and Sai Surya Duvvuri and Manzil Zaheer and Inderjit S Dhillon and David Brandfonbrener and Rishabh Agarwal , booktitle=. The Art of Scaling Reinforcement Learning Compute for. 2026 , url=

  12. [12]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  13. [13]

    LiveCodeBench Pro: How Do Olympiad Medalists Judge

    Zihan Zheng and Zerui Cheng and Zeyu Shen and Shang Zhou and Kaiyuan Liu and Hansen He and Dongruixuan Li and Stanley Wei and Hangyi Hao and Jianzhu Yao and Peiyao Sheng and Zixuan Wang and Wenhao Chai and Aleksandra Korolova and Peter Henderson and Sanjeev Arora and Pramod Viswanath and Jingbo Shang and Saining Xie , booktitle=. LiveCodeBench Pro: How Do...

  14. [14]

    First Conference on Language Modeling , year=

    Can Language Models Solve Olympiad Programming? , author=. First Conference on Language Modeling , year=

  15. [15]

    Kim and Samuel Miserendino and Gildas Chabot and David Li and Patrick Chao and Michael Sharman and Alexandra Barr and Amelia Glaese and Jerry Tworek , booktitle=

    Tejal Patwardhan and Rachel Dias and Elizabeth Proehl and Grace Kim and Michele Wang and Olivia Watkins and Simon Posada Fishman and Marwan Aljubeh and Phoebe Thacker and Laurance Fauconnet and Natalie S. Kim and Samuel Miserendino and Gildas Chabot and David Li and Patrick Chao and Michael Sharman and Alexandra Barr and Amelia Glaese and Jerry Tworek , b...

  16. [16]

    International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations (ICLR) , year=

  17. [17]

    Advances in Neural Information Processing Systems , editor=

    Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  18. [18]

    The Eleventh International Conference on Learning Representations , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. The Eleventh International Conference on Learning Representations , year=

  19. [19]

    2022 , url=

    Eric Zelikman and Yuhuai Wu and Jesse Mu and Noah Goodman , booktitle=. 2022 , url=

  20. [20]

    MathArena: Evaluating

    Mislav Balunovic and Jasper Dekoninck and Ivo Petrov and Nikola Jovanovi. MathArena: Evaluating. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  21. [21]

    Large Margin Rank Boundaries for Ordinal Regression , booktitle =

    Herbrich, Ralf and Graepel, Thore and Obermayer, Klaus , editor =. Large Margin Rank Boundaries for Ordinal Regression , booktitle =. 2000 , month =. doi:10.7551/mitpress/1113.003.0010 , url =

  22. [22]

    Advances in Neural Information Processing Systems , editor=

    An empirical analysis of compute-optimal large language model training , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  23. [23]

    2020 , eprint=

    Scaling Laws for Neural Language Models , author=. 2020 , eprint=

  24. [24]

    Transactions on Machine Learning Research , issn=

    Emergent Abilities of Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

  25. [25]

    The Eleventh International Conference on Learning Representations , year=

    Broken Neural Scaling Laws , author=. The Eleventh International Conference on Learning Representations , year=

  26. [26]

    Advances in Neural Information Processing Systems , editor=

    Beyond neural scaling laws: beating power law scaling via data pruning , author=. Advances in Neural Information Processing Systems , editor=. 2022 , url=

  27. [27]

    2022 , eprint=

    Language Models (Mostly) Know What They Know , author=. 2022 , eprint=

  28. [28]

    International Conference on Learning Representations , year=

    Uncertainty Estimation in Autoregressive Structured Prediction , author=. International Conference on Learning Representations , year=

  29. [29]

    The Eleventh International Conference on Learning Representations , year=

    Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation , author=. The Eleventh International Conference on Learning Representations , year=

  30. [30]

    Farquhar, J

    Farquhar, Sebastian and Kossen, Jannik and Kuhn, Lorenz and Gal, Yarin , date =. Detecting hallucinations in large language models using semantic entropy , url =. Nature , number =. 2024 , bdsk-url-1 =. doi:10.1038/s41586-024-07421-0 , id =

  31. [31]

    Forty-second International Conference on Machine Learning , year=

    Free Process Rewards without Process Labels , author=. Forty-second International Conference on Machine Learning , year=

  32. [32]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  33. [33]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Scaling Data-Constrained Language Models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  34. [34]

    Forty-second International Conference on Machine Learning , year=

    A Hitchhiker's Guide to Scaling Law Estimation , author=. Forty-second International Conference on Machine Learning , year=

  35. [35]

    2024 , eprint=

    Chinchilla Scaling: A replication attempt , author=. 2024 , eprint=

  36. [36]

    2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024) , year=

    Resolving Discrepancies in Compute-Optimal Scaling of Language Models , author=. 2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ICML 2024) , year=

  37. [37]

    Second Conference on Language Modeling , year=

    Establishing Task Scaling Laws via Compute-Efficient Model Ladders , author=. Second Conference on Language Modeling , year=

  38. [38]

    Scaling Laws for Predicting Downstream Performance in

    Yangyi Chen and Binxuan Huang and Yifan Gao and Zhengyang Wang and Jingfeng Yang and Heng Ji , journal=. Scaling Laws for Predicting Downstream Performance in. 2025 , url=

  39. [39]

    and Cho, Kyunghyun

    Lourie, Nicholas and Hu, Michael Y. and Cho, Kyunghyun. Scaling Laws Are Unreliable for Downstream Tasks: A Reality Check. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.877

  40. [40]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  41. [41]

    Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

    Tay, Yi and Dehghani, Mostafa and Abnar, Samira and Chung, Hyung and Fedus, William and Rao, Jinfeng and Narang, Sharan and Tran, Vinh and Yogatama, Dani and Metzler, Donald. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.fin...

  42. [42]

    The Twelfth International Conference on Learning Representations , year=

    Predicting Emergent Abilities with Infinite Resolution Evaluation , author=. The Twelfth International Conference on Learning Representations , year=

  43. [43]

    First Conference on Language Modeling , year=

    Predicting Emergent Capabilities by Finetuning , author=. First Conference on Language Modeling , year=

  44. [44]

    Forty-second International Conference on Machine Learning , year=

    Prasanna Mayilvahanan and Thadd. Forty-second International Conference on Machine Learning , year=

  45. [45]

    First Conference on Language Modeling , year=

    Compression Represents Intelligence Linearly , author=. First Conference on Language Modeling , year=

  46. [46]

    The Thirteenth International Conference on Learning Representations , year=

    What is Wrong with Perplexity for Long-context Language Modeling? , author=. The Thirteenth International Conference on Learning Representations , year=

  47. [47]

    The Thirteenth International Conference on Learning Representations , year=

    Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models , author=. The Thirteenth International Conference on Learning Representations , year=

  48. [48]

    Demystifying Prompts in Language Models via Perplexity Estimation

    Gonen, Hila and Iyer, Srini and Blevins, Terra and Smith, Noah and Zettlemoyer, Luke. Demystifying Prompts in Language Models via Perplexity Estimation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.679

  49. [49]

    The Thirteenth International Conference on Learning Representations , year=

    Improving Pretraining Data Using Perplexity Correlations , author=. The Thirteenth International Conference on Learning Representations , year=

  50. [50]

    The Twelfth International Conference on Learning Representations , year=

    Let's Verify Step by Step , author=. The Twelfth International Conference on Learning Representations , year=

  51. [51]

    Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations

    Wang, Peiyi and Li, Lei and Shao, Zhihong and Xu, Runxin and Dai, Damai and Li, Yifei and Chen, Deli and Wu, Yu and Sui, Zhifang. Math-Shepherd: Verify and Reinforce LLM s Step-by-step without Human Annotations. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.510

  52. [52]

    2025 , eprint=

    Process Reinforcement through Implicit Rewards , author=. 2025 , eprint=

  53. [53]

    Reasoning with language model is planning with world model

    Hao, Shibo and Gu, Yi and Ma, Haodi and Hong, Joshua and Wang, Zhen and Wang, Daisy and Hu, Zhiting. Reasoning with Language Model is Planning with World Model. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.507

  54. [54]

    tinyBenchmarks: evaluating

    Felipe Maia Polo and Lucas Weber and Leshem Choshen and Yuekai Sun and Gongjun Xu and Mikhail Yurochkin , booktitle=. tinyBenchmarks: evaluating. 2024 , url=

  55. [55]

    Anchor Points: Benchmarking Models with Much Fewer Examples

    Vivek, Rajan and Ethayarajh, Kawin and Yang, Diyi and Kiela, Douwe. Anchor Points: Benchmarking Models with Much Fewer Examples. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.95

  56. [56]

    The Thirteenth International Conference on Learning Representations , year=

    metabench - A Sparse Benchmark of Reasoning and Knowledge in Large Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  57. [57]

    Second Conference on Language Modeling , year=

    Fluid Language Model Benchmarking , author=. Second Conference on Language Modeling , year=

  58. [58]

    Training Trajectories of Language Models Across Scales

    Xia, Mengzhou and Artetxe, Mikel and Zhou, Chunting and Lin, Xi Victoria and Pasunuru, Ramakanth and Chen, Danqi and Zettlemoyer, Luke and Stoyanov, Veselin. Training Trajectories of Language Models Across Scales. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl...

  59. [59]

    and Tu, Zhuowen and Bergen, Benjamin K

    Chang, Tyler A. and Tu, Zhuowen and Bergen, Benjamin K. Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00708

  60. [60]

    2024 , eprint=

    How predictable is language model benchmark performance? , author=. 2024 , eprint=

  61. [61]

    Sloth: scaling laws for

    Felipe Maia Polo and Seamus Somerstep and Leshem Choshen and Yuekai Sun and Mikhail Yurochkin , booktitle=. Sloth: scaling laws for. 2026 , url=

  62. [62]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    Paloma: A Benchmark for Evaluating Language Model Fit , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  63. [63]

    The Twelfth International Conference on Learning Representations , year=

    Small-scale proxies for large-scale Transformer training instabilities , author=. The Twelfth International Conference on Learning Representations , year=

  64. [64]

    Advances in Neural Information Processing Systems , editor=

    Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer , author=. Advances in Neural Information Processing Systems , editor=. 2021 , url=

  65. [65]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  66. [66]

    The Thirteenth International Conference on Learning Representations , year=

    RegMix: Data Mixture as Regression for Language Model Pre-training , author=. The Thirteenth International Conference on Learning Representations , year=

  67. [67]

    Eric Zelikman and Georges Raif Harik and Yijia Shao and Varuna Jayasiri and Nick Haber and Noah Goodman , booktitle=. Quiet-. 2024 , url=

  68. [68]

    2022 , eprint=

    Solving math word problems with process- and outcome-based feedback , author=. 2022 , eprint=

  69. [69]

    2024 , eprint=

    Improve Mathematical Reasoning in Language Models by Automated Process Supervision , author=. 2024 , eprint=

  70. [70]

    Forty-first International Conference on Machine Learning , year=

    Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws , author=. Forty-first International Conference on Machine Learning , year=

  71. [71]

    2022 , eprint=

    Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments , author=. 2022 , eprint=

  72. [72]

    Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =

    Ganguli, Deep and Hernandez, Danny and Lovitt, Liane and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and Dassarma, Nova and Drain, Dawn and Elhage, Nelson and El Showk, Sheer and Fort, Stanislav and Hatfield-Dodds, Zac and Henighan, Tom and Johnston, Scott and Jones, Andy and Joseph, Nicholas and Kernian, Jackson and Kravec, Shauna and ...

  73. [73]

    Unveiling Downstream Performance Scaling of

    Chengyin Xu and Kaiyuan Chen and Xiao Li and Ke Shen and Chenggang Li , booktitle=. Unveiling Downstream Performance Scaling of. 2026 , url=

  74. [74]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Data Selection for Language Models via Importance Resampling , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  75. [75]

    Humanity's Last Exam

    Phan, Long and Gatti, Alice and Li, Nathaniel and Khoja, Adam and Kim, Ryan and Ren, Richard and Hausenloy, Jason and Zhang, Oliver and Mazeika, Mantas and Hendrycks, Dan and Han, Ziwen and Hu, Josephina and Zhang, Hugh and Zhang, Chen Bo Calvin and Shaaban, Mohamed and Ling, John and Shi, Sean and Choi, Michael and Agrawal, Anish and Chopra, Arnav and Na...

  76. [76]

    2025 , url=

    Hjalmar Wijk and Tao Roa Lin and Joel Becker and Sami Jawhar and Neev Parikh and Thomas Broadley and Lawrence Chan and Michael Chen and Joshua M Clymer and Jai Dhyani and Elena Ericheva and Katharyn Garcia and Brian Goodrich and Nikola Jurkovic and Megan Kinniment and Aron Lajko and Seraphina Nix and Lucas Jun Koba Sato and William Saunders and Maksym Tar...

  77. [77]

    The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

  78. [78]

    American Invitational Mathematics Examination (AIME) 2025 , author=

  79. [79]

    2024 , url=

    Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen , booktitle=. 2024 , url=

  80. [80]

    2025 , eprint=

    SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines , author=. 2025 , eprint=

Showing first 80 references.