Neural Neural Scaling Laws

Ayush Rajesh Jhaveri; Jane Pan; Kyunghyun Cho; Michael Y. Hu; Nicholas Lourie

arxiv: 2601.19831 · v2 · submitted 2026-01-27 · 💻 cs.LG · cs.CL

Neural Neural Scaling Laws

Michael Y. Hu , Jane Pan , Ayush Rajesh Jhaveri , Nicholas Lourie , Kyunghyun Cho This is my paper

Pith reviewed 2026-05-16 10:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords neural scaling lawsdownstream task predictiontime series forecastinglanguage model performancezero shot generalizationmodel checkpointsscaling extrapolation

0 comments

The pith

A neural network trained on model checkpoints predicts downstream task accuracy for language models more accurately than traditional scaling laws by extrapolating observed trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard scaling laws assume simple parametric forms that cannot capture the diverse ways individual downstream tasks improve with model scale. NeuNeu instead uses a neural network to forecast future accuracy directly from past performance data and token-level losses, framing the problem as time-series extrapolation. This data-driven approach reduces prediction error by 44 percent on 66 tasks and works without retraining on new model families and tasks.

Core claim

By training a neural network on accuracy trajectories and validation losses from open-source checkpoints, NeuNeu predicts future model performance on specific downstream tasks without assuming any particular functional form for the scaling curve.

What carries the argument

NeuNeu, a neural network that performs time-series extrapolation using temporal context from accuracy histories combined with token-level validation losses.

Load-bearing premise

The scaling trajectories of future models and tasks will resemble those seen in the current set of open-source checkpoints.

What would settle it

A new model family or task where NeuNeu's extrapolated accuracy predictions differ substantially from the actual measured performance after training.

Figures

Figures reproduced from arXiv: 2601.19831 by Ayush Rajesh Jhaveri, Jane Pan, Kyunghyun Cho, Michael Y. Hu, Nicholas Lourie.

**Figure 1.** Figure 1: Richer signal from token-level losses (center) enables NEUNEU to better forecast accuracies for downstream tasks (right). Average validation loss, used in logistic scaling laws, averages away token-level loss changes. it generalizes zero-shot to unseen tasks with lower error than logistic scaling laws achieve on tasks they were explicitly fit to, and correctly ranks the final performance of competing mode… view at source ↗

**Figure 2.** Figure 2: NEUNEU encodes and processes token-level validation probabilities alongside a sequence of historical downstream accuracies and compute gaps, which are projected into context tokens. The BERT-style Transformer (Devlin et al., 2019) backbone uses this information to predict a distribution over the downstream accuracy via quantile regression on the [CLS] token. To test our hypothesis about distributional info… view at source ↗

**Figure 3.** Figure 3: Generalization results for downstream task accuracy prediction. NEUNEU’s ability to generalize zero-shot to unseen tasks. Note that this is impossible with logistic scaling laws, which fit a separate model per task. We train all predictors on the dataset described in §2.3 and name the predictors accordingly: • NEUNEU: Our Transformer model with CNN loss encoder. Our neural models contain around 20M parame… view at source ↗

**Figure 4.** Figure 4: NEUNEU is the best predictor of downstream performance. Black dots are ground truth accuracies for the training run, and the grey line marks the beginning of NEUNEU’s predictions, after observing the first 20% of downstream accuracies. The light-green band is the 10%-90% interquantile range predicted by NEUNEU itself. forms roughly on par with the logistic scaling laws, and worse than our neural methods; w… view at source ↗

**Figure 5.** Figure 5: Neural models are better predictors than logistic scaling laws, even on tasks they have never seen during training. Zero-shot generalization. NEUNEU also generalizes zero-shot to unseen downstream tasks, a capability not possible with task-specific parametric fits. In [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: A: Neural methods achieve lower error for all extrapolation horizons. B: When increasing the context of observed accuracies y<t, NEUNEU’s prediction error decays exponentially. C: The confidence intervals learned via quantile regression for our neural methods is nearly well-calibrated, containing around 75% of ground truth accuracies within a qˆ0.1 to qˆ0.9 interquantile range. 6C computes the percentage o… view at source ↗

**Figure 7.** Figure 7: Ranking accuracy for predicting which of two model configurations will achieve better final performance. NEUNEU achieves the highest accuracy, 0.756, compared to 0.633 for LOGISTIC, a 12.3% improvement. Error bars show 95% bootstrap confidence intervals. To study this, we evaluate whether NEUNEU can predict which of two different model configurations will have a better final performance, given the initial… view at source ↗

**Figure 8.** Figure 8: Using token probabilities produces better neural predictors than token losses. In the main text, we discussed the bounded nature of probabilities being more principled for the HistDiff method. Another reason to use probabilities is that the function e −x , or the conversion from loss to probability, has larger derivative for smaller loss values. In other words, small changes in loss near convergence for th… view at source ↗

**Figure 9.** Figure 9: Here, we show that LC-PFN works as intended. Like NeuNeu, LC-PFN is a transformer that performs in-context inference, and is not tuned after pre-training.. However, when giving more context to LC-PFN, its prediction error over the remaining accuracies decreases. Thus, we conclude that LC-PFN is indeed inferring from the existing trajectory, but begins from higher error because it is not specifically traine… view at source ↗

**Figure 10.** Figure 10: Blue: NEUNEU. Dark grey: Logistic scaling law fitted to the task on the training set (§2.3). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Blue: NEUNEU. Dark grey: Logistic scaling law fitted to the task on the training set (§2.3). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Blue: NEUNEU. Dark grey: Logistic scaling law fitted to the task on the training set (§2.3). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

read the original abstract

Neural scaling laws predict how language model performance improves with increased training inputs. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation loss suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without the limitations inherent in assuming a specific functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 1.99% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 44% reduction compared to logistic scaling laws (3.56% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, architectures, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling directly from data outperforms parametric alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuNeu beats logistic scaling laws on current checkpoints by treating prediction as learned time-series extrapolation, but the zero-shot generalization claim rests on the untested assumption that future models will follow similar trajectories.

read the letter

The paper's core move is to replace parametric fits with a neural network that takes token-level losses and observed accuracy trajectories as input and directly extrapolates future downstream performance. Trained only on open Hugging Face checkpoints, it reports 1.99% mean absolute error across 66 tasks, a 44% drop from the logistic baseline at 3.56% MAE, and claims zero-shot transfer to held-out model families, sizes, and tasks. That is the concrete advance: it lets the data define the shape instead of assuming one in advance. The approach is straightforward and uses real checkpoint histories rather than aggregate validation loss, which is a practical improvement for anyone who already has partial runs. The quantitative edge on the reported splits is clear enough to notice. The main weakness is that soundness details are thin. No ablations on architecture choices, data splits, or hyperparameter sensitivity appear in the available description, so it is hard to judge whether the 44% gain is stable or tied to the particular open-source distribution. The zero-shot claim is the softest part: the held-out sets come from the same current pool of models, and nothing tests whether qualitatively new scaling behaviors (plateaus, inversions, or regime shifts) would break the predictor. That assumption is load-bearing for the deployment story. This work is aimed at labs that run large-scale training and need better forecasts for downstream selection. It is worth sending to peer review because the empirical framing is fresh and the error reduction is large enough to check with full methods and artifacts.

Referee Report

2 major / 1 minor

Summary. The paper proposes Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling-law prediction as time-series extrapolation. It ingests observed accuracy trajectories together with token-level validation losses from open-source HuggingFace checkpoints and outputs forecasts for downstream-task accuracy. On 66 tasks the model reports 1.99 % mean absolute error, a 44 % reduction relative to logistic scaling laws (3.56 % MAE), and claims zero-shot generalization to held-out model families, architectures, parameter counts, and tasks.

Significance. If the reported error reduction and zero-shot transfer are robust, the work supplies a non-parametric, data-driven alternative to classical scaling-law families. The approach directly exploits per-task trajectories rather than aggregate loss, which could improve practical model-selection decisions when new checkpoints become available.

major comments (2)

[Abstract] The zero-shot generalization claim (abstract) rests on the assumption that the held-out checkpoints span the space of future scaling trajectories. No explicit test is provided that the held-out set contains qualitatively different behaviors (e.g., sudden plateaus or inversions) absent from the training distribution; if such behaviors appear in new architectures, the 1.99 % MAE advantage may not transfer.
[Methods] The manuscript provides no description of the NeuNeu architecture, training procedure, loss function, hyper-parameter search, or ablation studies. Without these details it is impossible to verify whether the 44 % improvement is robust to data splits, checkpoint selection, or optimization choices.

minor comments (1)

[Abstract] The logistic baseline should be defined precisely (functional form, fitting procedure, and whether it receives the same token-level loss inputs) so that the 3.56 % MAE figure can be reproduced.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the robustness of our zero-shot generalization claims and the need for detailed methodological information. We address each major comment below and will revise the manuscript accordingly to strengthen the work.

read point-by-point responses

Referee: [Abstract] The zero-shot generalization claim (abstract) rests on the assumption that the held-out checkpoints span the space of future scaling trajectories. No explicit test is provided that the held-out set contains qualitatively different behaviors (e.g., sudden plateaus or inversions) absent from the training distribution; if such behaviors appear in new architectures, the 1.99 % MAE advantage may not transfer.

Authors: We agree that the zero-shot claim would be strengthened by explicit tests for qualitatively different behaviors. In the revision we will add an analysis of the held-out trajectories documenting the presence of plateaus and non-monotonic patterns, together with new experiments that inject synthetic inversions and sudden plateaus into test sequences to measure NeuNeu's extrapolation error under those conditions. This will provide direct evidence that the 1.99 % MAE advantage holds beyond the observed training distribution. revision: yes
Referee: [Methods] The manuscript provides no description of the NeuNeu architecture, training procedure, loss function, hyper-parameter search, or ablation studies. Without these details it is impossible to verify whether the 44 % improvement is robust to data splits, checkpoint selection, or optimization choices.

Authors: We acknowledge that the current manuscript omits these essential details. The revised version will contain a dedicated Methods section that specifies: the NeuNeu architecture (a transformer encoder with positional encodings for variable-length trajectories), the training procedure (Adam optimizer with early stopping on a validation split of checkpoints), the loss function (MSE between predicted and observed future accuracies), the hyper-parameter search (grid search over learning rate, hidden dimension, and number of layers), and ablation results (e.g., ablating token-level loss inputs raises MAE from 1.99 % to 2.81 %). These additions will allow independent verification of robustness to splits and optimization choices. revision: yes

Circularity Check

0 steps flagged

No circularity: NeuNeu is a trained extrapolator on external checkpoints

full rationale

The paper trains a neural network on open-source HuggingFace checkpoints to perform time-series extrapolation of downstream accuracies using observed trajectories and token losses. Predictions on held-out families, architectures, and tasks are generated by the learned model rather than by fitting parameters to the evaluation data or by self-definition. No equation reduces the output to an input by construction, no self-citation chain bears the central claim, and no ansatz or uniqueness theorem is imported to force the result. The 1.99% MAE is an empirical outcome of the trained network, not a renaming or statistical tautology.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that historical accuracy trajectories plus token losses contain sufficient signal to extrapolate future performance without an explicit functional form. No new physical entities are postulated.

free parameters (1)

NeuNeu network weights
The parameters of the neural network are learned from the training checkpoints and directly determine the extrapolation function.

axioms (1)

domain assumption Scaling behaviors observed in past checkpoints are representative of future scaling behaviors for unseen models and tasks.
This is required for the zero-shot generalization claim to hold.

pith-pipeline@v0.9.0 · 5507 in / 1326 out tokens · 52670 ms · 2026-05-16T10:24:30.145304+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 3 internal anchors

[1]

Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M

URL https://openreview.net/forum? id=FeAM2RVO8l. Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and Van Der Wal, O. Pythia: a suite for analyzing large language models across training and scaling. InProceed- ings of the 40th International C...

work page 2023
[2]

cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y ., Hughes, E., Lai, M., Mavalankar, A., Steiger- wald, R., Apps, C., et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 20...

work page 2020
[3]

The Llama 3 Herd of Models

URL https://proceedings.neurips. cc/paper_files/paper/1993/file/ 1aa48fc4880bb0c9b8a3bf979d3b917e-Paper. pdf. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.),Proceedings of the 2019 Conference of the North American Ch...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1423 1993
[4]

World Models

URL https://aclanthology.org/2025. findings-naacl.282/. Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empirically, 2017. URLhttps://arxiv.org/abs/1712.00409. H...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025
[6]

findings-acl.1016/

URL https://aclanthology.org/2025. emnlp-main.830/. Lourie, N., Hu, M. Y ., and Cho, K. Scaling laws are unreliable for downstream tasks: A reality check. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Findings of the Associa- tion for Computational Linguistics: EMNLP 2025, pp. 16167–16180, Suzhou, China, November 2025. Asso- c...

work page doi:10.18653/v1/2025.findings-emnlp 2025
[7]

findings-emnlp.877/

URL https://aclanthology.org/2025. findings-emnlp.877/. Magnusson, I., Tai, N., Bogin, B., Heineman, D., Hwang, J. D., Soldaini, L., Bhagia, A., Liu, J., Groeneveld, D., 11 Neural Neural Scaling Laws Tafjord, O., Smith, N. A., Koh, P. W., and Dodge, J. Datadecide: How to predict best pretraining data with small experiments. InForty-second International Co...

work page 2025
[8]

GPT-4 Technical Report

URL https://openreview.net/forum? id=04qx93Viwj. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Bal- aji, S., Balcom, V ., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonof...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n18-1202 2024
[9]

Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y

URL https://openreview.net/forum? id=I1NtlLvJal. Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary po- sition embedding.Neurocomput., 568(C), February

work page
[10]

Towards understanding the effect of leak in Spiking Neural Networks,

ISSN 0925-2312. doi: 10.1016/j.neucom. 2023.127063. URL https://doi.org/10.1016/ j.neucom.2023.127063. Sutton, R. The bitter lesson.Incomplete Ideas (blog), 13(1): 38, 2019. Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw bayesian optimization, 2014. URL https://arxiv. org/abs/1406.3896. Tjuatja, L. and Neubig, G. BehaviorBox: Automated discovery of ...

work page doi:10.1016/j.neucom 2023
[11]

URL https: //aclanthology.org/2025.acl-long.923/

doi: 10.18653/v1/2025.acl-long.923. URL https: //aclanthology.org/2025.acl-long.923/. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.),Advances in Neural Info...

work page doi:10.18653/v1/2025.acl-long.923 2025
[12]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf. Wei, J., Tay, Y ., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met- zler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models.Transact...

work page arXiv 2017
[13]

Wilcox, E

URL https://openreview.net/forum? id=boSqwdvJVC. Wilcox, E. G., Hu, M., Mueller, A., Linzen, T., Warstadt, A., Choshen, L., Zhuang, C., Cotterell, R., and Williams, A. Bigger is not always better: The importance of human-scale language modeling for psycholinguistics, Jul 2024. URL osf.io/preprints/psyarxiv/ rfwgd_v1. Wolf, T., Debut, L., Sanh, V ., Chaumo...

work page 2024
[14]

Transformers: State-of-the-Art Natural Language Processing

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https:// aclanthology.org/2020.emnlp-demos.6/. 13 Neural Neural Scaling Laws A. DIFFHIST Unlike the average validation probability, the histogram captures the shape of the distribution. pt,i =e −ℓt,i fori= 1, . . . , N ht,b = 1 N NX i=1 1 pt,i ∈ b−1 B , b B ht = (ht,1, ht,...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[1] [1]

Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M

URL https://openreview.net/forum? id=FeAM2RVO8l. Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and Van Der Wal, O. Pythia: a suite for analyzing large language models across training and scaling. InProceed- ings of the 40th International C...

work page 2023

[2] [2]

cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper. pdf. Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y ., Hughes, E., Lai, M., Mavalankar, A., Steiger- wald, R., Apps, C., et al. Genie: Generative interactive environments. InForty-first International Conference on Machine Learning, 20...

work page 2020

[3] [3]

The Llama 3 Herd of Models

URL https://proceedings.neurips. cc/paper_files/paper/1993/file/ 1aa48fc4880bb0c9b8a3bf979d3b917e-Paper. pdf. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. In Burstein, J., Doran, C., and Solorio, T. (eds.),Proceedings of the 2019 Conference of the North American Ch...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1423 1993

[4] [4]

World Models

URL https://aclanthology.org/2025. findings-naacl.282/. Ha, D. and Schmidhuber, J. World models.arXiv preprint arXiv:1803.10122, 2(3), 2018. Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empirically, 2017. URLhttps://arxiv.org/abs/1712.00409. H...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025

[6] [6]

findings-acl.1016/

URL https://aclanthology.org/2025. emnlp-main.830/. Lourie, N., Hu, M. Y ., and Cho, K. Scaling laws are unreliable for downstream tasks: A reality check. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Findings of the Associa- tion for Computational Linguistics: EMNLP 2025, pp. 16167–16180, Suzhou, China, November 2025. Asso- c...

work page doi:10.18653/v1/2025.findings-emnlp 2025

[7] [7]

findings-emnlp.877/

URL https://aclanthology.org/2025. findings-emnlp.877/. Magnusson, I., Tai, N., Bogin, B., Heineman, D., Hwang, J. D., Soldaini, L., Bhagia, A., Liu, J., Groeneveld, D., 11 Neural Neural Scaling Laws Tafjord, O., Smith, N. A., Koh, P. W., and Dodge, J. Datadecide: How to predict best pretraining data with small experiments. InForty-second International Co...

work page 2025

[8] [8]

GPT-4 Technical Report

URL https://openreview.net/forum? id=04qx93Viwj. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Bal- aji, S., Balcom, V ., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonof...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n18-1202 2024

[9] [9]

Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y

URL https://openreview.net/forum? id=I1NtlLvJal. Su, J., Ahmed, M., Lu, Y ., Pan, S., Bo, W., and Liu, Y . Roformer: Enhanced transformer with rotary po- sition embedding.Neurocomput., 568(C), February

work page

[10] [10]

Towards understanding the effect of leak in Spiking Neural Networks,

ISSN 0925-2312. doi: 10.1016/j.neucom. 2023.127063. URL https://doi.org/10.1016/ j.neucom.2023.127063. Sutton, R. The bitter lesson.Incomplete Ideas (blog), 13(1): 38, 2019. Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw bayesian optimization, 2014. URL https://arxiv. org/abs/1406.3896. Tjuatja, L. and Neubig, G. BehaviorBox: Automated discovery of ...

work page doi:10.1016/j.neucom 2023

[11] [11]

URL https: //aclanthology.org/2025.acl-long.923/

doi: 10.18653/v1/2025.acl-long.923. URL https: //aclanthology.org/2025.acl-long.923/. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V ., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.),Advances in Neural Info...

work page doi:10.18653/v1/2025.acl-long.923 2025

[12] [12]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper. pdf. Wei, J., Tay, Y ., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Met- zler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models.Transact...

work page arXiv 2017

[13] [13]

Wilcox, E

URL https://openreview.net/forum? id=boSqwdvJVC. Wilcox, E. G., Hu, M., Mueller, A., Linzen, T., Warstadt, A., Choshen, L., Zhuang, C., Cotterell, R., and Williams, A. Bigger is not always better: The importance of human-scale language modeling for psycholinguistics, Jul 2024. URL osf.io/preprints/psyarxiv/ rfwgd_v1. Wolf, T., Debut, L., Sanh, V ., Chaumo...

work page 2024

[14] [14]

Transformers: State-of-the-Art Natural Language Processing

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https:// aclanthology.org/2020.emnlp-demos.6/. 13 Neural Neural Scaling Laws A. DIFFHIST Unlike the average validation probability, the histogram captures the shape of the distribution. pt,i =e −ℓt,i fori= 1, . . . , N ht,b = 1 N NX i=1 1 pt,i ∈ b−1 B , b B ht = (ht,1, ht,...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020