pith. sign in

arxiv: 2502.12120 · v3 · pith:FSBVO47Ynew · submitted 2025-02-17 · 💻 cs.LG · cs.AI· cs.CL

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Pith reviewed 2026-05-23 02:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords scaling lawslarge language modelspretraining dataloss-to-loss scalingmodel architecturedownstream performancetransformersstate-space models
0
0 comments X

The pith

Pretraining data determines loss-to-loss scaling trends in LLMs, overriding model size, architecture, and training choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests what controls how pretraining loss relates to performance on later tasks through loss-to-loss scaling laws. Experiments across multiple setups show that the specific pretraining data used fixes the scaling relationship. Model size, optimization details, tokenizers, and even large architectural shifts between transformers like Llama and state-space models like Mamba produce only minor changes when data stays the same. A reader would care because this shifts priority toward dataset selection for predictable downstream gains rather than repeated architecture searches.

Core claim

Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact.

What carries the argument

Loss-to-loss scaling laws relating pretraining loss to downstream task performance, with pretraining data as the primary controlling factor.

If this is right

  • Practitioners should prioritize curating pretraining datasets to achieve desired downstream scaling behavior.
  • Model architectures and optimization settings can be chosen mainly for training speed and cost without changing the expected loss-to-loss relationship.
  • Scaling predictions for new tasks can be based primarily on the pretraining data used rather than the specific model details.
  • Different model families will exhibit similar scaling behavior when trained on the same data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Data curation may become the main lever for controlling generalization patterns across many future model designs.
  • The result suggests testing whether particular data properties, such as domain coverage or token statistics, are what actually fix the scaling line.
  • If data dominates, then methods that alter effective training data during pretraining could be used to steer downstream scaling without retraining from scratch.

Load-bearing premise

The range of models, data sources, and tasks tested is representative enough for the data-dominance conclusion to apply beyond these specific cases.

What would settle it

A clear case where two models with different architectures or sizes, trained on identical pretraining data, produce substantially different loss-to-loss scaling slopes on the same downstream tasks.

Figures

Figures reproduced from arXiv: 2502.12120 by Matthias Bethge, Prasanna Mayilvahanan, Sayak Mallick, Thadd\"aus Wiedemer, Wieland Brendel.

Figure 1
Figure 1. Figure 1: LLMs’ loss-to-loss scaling follows power laws pri￾marily shaped by the choice of pretraining data. Using Llama trained on FineWeb-Edu as a baseline, we intervene on various factors to assess their impact on train-to-test loss scaling. Chang￾ing the pretraining data has the largest effect. Changing the tokenizer, the architecture (e.g., from Llama to Mamba), model size, context length, and optimizer setting… view at source ↗
Figure 2
Figure 2. Figure 2: Loss-to-loss scaling consistently obeys power laws. We extend results from Brandfonbrener et al. (2024) to many architectures, training settings, and validation/test sets. We show illustrative shifted power laws for Mamba trained on FineWeb-Edu here; more configurations and test sets can be found in App. E. For clarity, scatter plots display a random sample of all data points; all points are used to fit th… view at source ↗
Figure 3
Figure 3. Figure 3: Schematic of our causal analysis. Checkpoints of a base model trained on different numbers of tokens and with dif￾ferent seeds lie on the same loss-to-loss line. Better-performing models (typically with higher compute) achieve lower loss (to￾wards the bottom left). We intervene on training settings (e.g., pretraining data, architecture, etc.) and retrain from scratch, yield￾ing new models that again consti… view at source ↗
Figure 4
Figure 4. Figure 4: Pretraining data has a substantial impact on loss-to-loss scaling laws. Models are matched on architecture and tokenizer. Tokenizers We train Llama and Mamba with either a tiktoken tokenizer (128 k vocabulary size) or the gpt2 tokenizer (50 257 vocabulary size). Pretrained models from Hugging Face use an almost identical GPT-2 tokenizer, dubbed gpt2-HF. This version does not explicitly pad text with beginn… view at source ↗
Figure 5
Figure 5. Figure 5: The tokenizer has a minor impact on loss-to-loss scaling laws. Models are matched on pretraining data and architecture. models) and Mamba (a state-space model). These results raise an important question: Do current architectures encode distinct inductive biases or converge to similar solutions given the same training data? Further research is needed to understand the implications of this finding. Takeaway … view at source ↗
Figure 6
Figure 6. Figure 6: Architecture has limited impact on loss-to-loss scaling laws. Models are matched on pretraining data and tokenizer. 3 4 5 6 7 8 FineWeb-Edu Validation Loss 4 5 6 7 8 Average Val. Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1 2 3 4 1e8 Size Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Context length does not affect loss-to-loss scaling. Again, distinct lines correspond to different pretraining distribu￾tions (compare [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Optimization settings do not affect loss-to-loss scaling. Implications for Balancing Performance If the aim is not only optimal average downstream performance but also a specific weighting between different tasks, e.g., to ensure a balanced downstream performance, individual train-to-test scaling laws can be used to tune a model’s performance. Here, too, the pretraining data has the largest impact and prac… view at source ↗
Figure 10
Figure 10. Figure 10: Example compute-to-loss scaling law fits. Each loss-to-loss scaling law requires fitting two compute-to-loss scaling laws to estimate Ex|p, Ey|p. The three fits here are used for the The Pile UC and HellaSwag curves in [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Loss-to-Loss Scaling for FineWeb-Edu-trained Llama. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Loss-to-Loss Scaling for C4-trained Llama. 3.5 4.0 4.5 5.0 FineWeb-Edu Validation Loss 3.0 3.5 4.0 4.5 5.0 5.5 Validation Loss Train-to-Train Validation Set The Pile UC RefineWeb Slimpajama C4 3.5 4.0 4.5 5.0 FineWeb-Edu Validation Loss 4 5 6 7 8 Test Loss Train-to-Test Test Set ARC-Challenge ARC-Easy COPA PIQA Winogrande HellaSwag CommonSenseQA Social IQa MMLU [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Loss-to-Loss Scaling for The Pile-trained Llama. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Loss-to-Loss Scaling for FineWeb-Edu-trained Mamba. 3.5 4.0 4.5 C4 Validation Loss 3.5 4.0 4.5 5.0 5.5 6.0 Validation Loss Train-to-Train Validation Set The Pile UC RefineWeb Slimpajama FineWeb-Edu 3.5 4.0 4.5 C4 Validation Loss 3 4 5 6 7 8 Test Loss Train-to-Test Test Set ARC-Challenge ARC-Easy COPA PIQA Winogrande HellaSwag CommonSenseQA Social IQa MMLU [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Loss-to-Loss Scaling for C4-trained Mamba. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Loss-to-Loss Scaling for The Pile-trained Mamba. 3 4 5 Average Val. Loss 4.0 4.5 5.0 C4 Val. Loss 5 6 7 Average Test Loss 4 5 C4 Val. Loss 3.5 4.0 4.5 C4 Val. Loss 4.0 4.5 C4 Val. Loss 4 5 C4 Val. Loss 3.0 3.5 4.0 C4 Val. Loss Architecture Tokenizer Llama tiktoken Architecture Tokenizer Llama gpt2 Architecture Tokenizer Llama gpt2-HF Architecture Tokenizer Mamba tiktoken Architecture Tokenizer Mamba gpt2 … view at source ↗
Figure 17
Figure 17. Figure 17: Pretraining data has a substantial impact on loss-to-loss scaling laws. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: The tokenizer has a minor impact on loss-to-loss scaling laws. 3.5 4.0 4.5 5.0 Average Val. Loss 3.0 3.5 4.0 C4 Val. Loss 5 6 7 Average Test Loss 3.5 4.0 C4 Val. Loss 3.5 4.0 4.5 C4 Val. Loss 4.0 4.5 C4 Val. Loss 4.0 4.5 C4 Val. Loss 3.25 3.50 3.75 C4 Val. Loss Pretraining Tokenizer C4 gpt2 Pretraining Tokenizer C4 tiktoken Pretraining Tokenizer FW-Edu gpt2 Pretraining Tokenizer FW-Edu tiktoken Pretrainin… view at source ↗
Figure 19
Figure 19. Figure 19: Architecture has limited impact on loss-to-loss scaling laws. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Model size does not affect train-to-test scaling. 3 4 5 6 7 C4 Validation Loss 4 5 6 7 Average Val. Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1500 2000 2500 3000 Context Length Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Context length does not affect train-to-test scaling. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Optimizer settings do not affect train-to-test scaling. 2 3 4 5 The Pile UC Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3 4 HellaSwag Test Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.0 3.5 FW-Edu Val. Loss Architecture Tokenizer Llama tiktoken Architecture Tokenizer Llama gpt2 Architecture Tokenizer Llama gpt2-HF Architecture Tokeni… view at source ↗
Figure 23
Figure 23. Figure 23: Pretraining data has a substantial impact on loss-to-loss scaling laws. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Pretraining data has a substantial impact on loss-to-loss scaling laws. 0.5 1.0 1.5 The Pile UC Val. BPB 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 1.2 1.4 HellaSwag Test BPB 0.6 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 FW-Edu Val. BPB 1.0 1.2 FW-Edu Val. BPB Architecture Pretraining Llama C4 Architecture Pretraining Llama FW-Edu Architecture Pretraining Llama Pile UC Architect… view at source ↗
Figure 25
Figure 25. Figure 25: The tokenizer has a minor impact on loss-to-loss scaling laws. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: The tokenizer has a minor impact on loss-to-loss scaling laws. 3 4 5 The Pile UC Val. Loss 3.0 3.5 4.0 FW-Edu Val. Loss 2.5 3.0 3.5 4.0 HellaSwag Test Loss 3.5 4.0 FW-Edu Val. Loss 3.0 3.5 4.0 FW-Edu Val. Loss 3.0 3.5 4.0 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.25 3.50 3.75 FW-Edu Val. Loss Pretraining Tokenizer C4 gpt2 Pretraining Tokenizer C4 tiktoken Pretraining Tokenizer FW-Edu gpt2 Pretrainin… view at source ↗
Figure 27
Figure 27. Figure 27: Architecture has limited impact on loss-to-loss scaling laws. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Architecture has limited impact on loss-to-loss scaling laws. 3 4 5 6 7 8 C4 Validation Loss 5 6 7 8 Average Test Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1 2 3 4 1e8 Size Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p026_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Model size does not affect train-to-test scaling. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Model size does not affect train-to-test scaling. 3 4 5 6 7 C4 Validation Loss 5.0 5.5 6.0 6.5 7.0 7.5 Average Test Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1500 2000 2500 3000 Context Length Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p027_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Context length does not affect train-to-test scaling. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Context length does not affect train-to-test scaling. 4 6 8 10 12 C4 Validation Loss 5 6 7 8 9 10 11 12 Average Test Loss Optimizer | LR | Weight Decay | Scheduler Adam | 3.0e-4 | 3.3e-2 | Cosine Adam | 3.0e-4 | 1.0e-1 | Cosine Adam | 3.0e-3 | 3.3e-2 | Cosine Adam | 3.0e-3 | 1.0e-1 | Cosine Adam | 3.0e-4 | 3.3e-2 | WSD Adam | 3.0e-4 | 1.0e-1 | WSD Adam | 3.0e-3 | 3.3e-2 | WSD Adam | 3.0e-3 | 1.0e-1 | WSD … view at source ↗
Figure 33
Figure 33. Figure 33: Optimizer settings do not affect train-to-test scaling. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Optimizer settings do not affect train-to-test scaling. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_34.png] view at source ↗
read the original abstract

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates factors influencing loss-to-loss scaling laws across pretraining datasets and downstream tasks in LLMs. It claims that pretraining data is the dominant determinant of the scaling trends, while model size, optimization hyperparameters, tokenizer choice, and even major architectural differences (e.g., transformer-based Llama vs. state-space Mamba) have limited impact. The authors conclude that practitioners should prioritize data curation over other design choices for optimal downstream performance.

Significance. If the central empirical claim holds after proper controls, the result would meaningfully redirect LLM scaling research and practice toward data-centric approaches rather than architecture or hyperparameter search, with direct implications for training efficiency and generalization. The work builds on recent loss-to-loss scaling literature by attempting to isolate the dominant variable through comparative experiments.

major comments (2)
  1. [Experiments / Results] Experimental comparisons (likely §4 or §5): the claim that architecture has limited impact (including Llama vs. Mamba) requires Llama and Mamba models to be pretrained on identical data distributions; if the runs used different corpora, observed scaling differences are confounded by data rather than architecture, directly undermining the isolation of data as the sole determinant.
  2. [Methods] Methods and experimental design (likely §3): the manuscript provides no details on controls for data overlap, statistical significance testing, data exclusion criteria, or variance across runs, leaving the support for the central claim that 'data determines the scaling trend' difficult to evaluate and potentially non-generalizable.
minor comments (2)
  1. [Introduction] Notation for loss-to-loss relations could be clarified with an explicit equation early in the paper to avoid ambiguity when comparing across sections.
  2. [Figures] Figure captions should explicitly state the exact model/data pairs shown to allow readers to verify the architecture-controlled comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, agreeing where controls or details are needed and committing to revisions that strengthen the manuscript without altering its central claims.

read point-by-point responses
  1. Referee: [Experiments / Results] Experimental comparisons (likely §4 or §5): the claim that architecture has limited impact (including Llama vs. Mamba) requires Llama and Mamba models to be pretrained on identical data distributions; if the runs used different corpora, observed scaling differences are confounded by data rather than architecture, directly undermining the isolation of data as the sole determinant.

    Authors: We agree that identical pretraining data distributions are required to isolate architecture. In our experiments, Llama and Mamba models were pretrained on the same data distributions precisely to avoid this confound; the observed similarity in loss-to-loss scaling is therefore attributable to architecture rather than data. We will revise §4 to explicitly state this control, describe the data-matching procedure, and include supporting details on the shared corpora. revision: yes

  2. Referee: [Methods] Methods and experimental design (likely §3): the manuscript provides no details on controls for data overlap, statistical significance testing, data exclusion criteria, or variance across runs, leaving the support for the central claim that 'data determines the scaling trend' difficult to evaluate and potentially non-generalizable.

    Authors: We acknowledge the absence of these methodological details in the current version. The revised manuscript will add a dedicated subsection in §3 covering: (i) controls and checks for data overlap between pretraining and downstream sets, (ii) the statistical significance tests applied to scaling trends, (iii) explicit data exclusion criteria, and (iv) reported variance or standard errors across multiple independent runs. These additions will make the support for the data-dominance claim fully evaluable. revision: yes

Circularity Check

0 steps flagged

Empirical comparisons with no circular derivation chain

full rationale

The paper reports experimental results on loss-to-loss scaling across varied pretraining data, model sizes, optimizers, tokenizers, and architectures (Llama vs. Mamba). No equations, fitted parameters, or first-principles derivations are presented that reduce to their own inputs by construction. The central claim rests on direct empirical contrasts rather than self-definitional relations, renamed known results, or load-bearing self-citations. The study is self-contained against external benchmarks via replication of the reported training runs and loss measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and does not introduce new mathematical axioms or entities; any scaling parameters would be fitted but not central to the claim.

pith-pipeline@v0.9.0 · 5681 in / 1129 out tokens · 33996 ms · 2026-05-23T02:45:31.633552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 24 internal anchors

  1. [1]

    Exploring the landscape of distributional robustness for question answering models, 2022

    Awadalla, A., Wortsman, M., Ilharco, G., Min, S., Magnusson, I., Hajishirzi, H., and Schmidt, L. Exploring the landscape of distributional robustness for question answering models, 2022. URL https://arxiv.org/abs/2210.12517

  2. [2]

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

    Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

  3. [3]

    PIQA: Reasoning about Physical Commonsense in Natural Language

    Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641

  4. [4]

    Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow

    Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. 2021. URL https://api.semanticscholar.org/CorpusID:245758737

  5. [5]

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model

    Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745

  6. [6]

    Loss-to-loss prediction: Scaling laws for all datasets, 2024

    Brandfonbrener, D., Anand, N., Vyas, N., Malach, E., and Kakade, S. Loss-to-loss prediction: Scaling laws for all datasets, 2024. URL https://arxiv.org/abs/2411.12925

  7. [7]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457

  8. [8]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Dao, T. and Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URL https://arxiv.org/abs/2405.21060

  9. [9]

    Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021

    Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021. URL https://arxiv.org/abs/2104.08758

  10. [10]

    Understanding emergent abilities of language models from the loss perspective, 2025

    Du, Z., Zeng, A., Dong, Y., and Tang, J. Understanding emergent abilities of language models from the loss perspective, 2025. URL https://arxiv.org/abs/2403.15796

  11. [11]

    Data determines distributional robustness in contrastive language image pre-training (clip), 2022

    Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., and Schmidt, L. Data determines distributional robustness in contrastive language image pre-training (clip), 2022. URL https://arxiv.org/abs/2205.01397

  12. [12]

    Gadre, S. Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., Xin, R., Nezhurina, M., Vasiljevic, I., Jitsev, J., Soldaini, L., Dimakis, A. G., Ilharco, G., Koh, P. W., Song, S., Kollar, T., Carmon, Y., Dave, A., Heckel, R., Muennighoff, N., and Schmidt, L. Language models scale reliably with over-t...

  13. [13]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL https://arxiv.org/abs/2101.00027

  14. [14]

    Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

    Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Noac’h, A. L., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 07 2024. URL http...

  15. [15]

    S., Kozareva, Z., and Roemmele, M

    Gordon, A. S., Kozareva, Z., and Roemmele, M. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011. URL https://api.semanticscholar.org/CorpusID:434646

  16. [16]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  17. [17]

    OLM o: Accelerating the science of language models

    Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Sha...

  18. [18]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

  19. [19]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

  20. [20]

    Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically, 2017. URL https://arxiv.org/abs/1712.00409

  21. [21]

    Training Compute-Optimal Large Language Models

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022. UR...

  22. [22]

    MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

    Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhao, W., Zhang, X., Thai, Z. L., Zhang, K., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M. Minicpm: Unveiling the potential of small language models with scalable training strategies, 2024. URL https://a...

  23. [23]

    Scaling laws for downstream task performance of large language models, 2024

    Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., and Koyejo, S. Scaling laws for downstream task performance of large language models, 2024. URL https://arxiv.org/abs/2402.04177

  24. [24]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

  25. [25]

    nanogpt, 2022

    Karpathy, A. nanogpt, 2022. URL https://github.com/karpathy/nanoGPT

  26. [26]

    Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

  27. [27]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts, 2017. URL https://arxiv.org/abs/1608.03983

  28. [28]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101

  29. [29]

    K., Schaeffer, R., Poulton, A., Koyejo, S., Stenetorp, P., Narang, S., and Hupkes, D

    Madaan, L., Singh, A. K., Schaeffer, R., Poulton, A., Koyejo, S., Stenetorp, P., Narang, S., and Hupkes, D. Quantifying variance in evaluation benchmarks, 2024. URL https://arxiv.org/abs/2406.10229

  30. [30]

    Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a

    Mayilvahanan, P., Wiedemer, T., Rusak, E., Bethge, M., and Brendel, W. Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a . URL https://arxiv.org/abs/2310.09562

  31. [31]

    S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W

    Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In search of forgotten domain generalization, 2024 b . URL https://arxiv.org/abs/2410.08258

  32. [32]

    W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L

    Miller, J., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L. Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization, 2021. URL https://arxiv.org/abs/2107.04649

  33. [33]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. URL https://arxiv.org/abs/2306.01116

  34. [34]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

  35. [35]

    Resolving discrepancies in compute-optimal scaling of language models, 2025

    Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., and Carmon, Y. Resolving discrepancies in compute-optimal scaling of language models, 2025. URL https://arxiv.org/abs/2406.19146

  36. [36]

    Language models are unsupervised multitask learners

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

  37. [37]

    Roeder, G., Metz, L., and Kingma, D. P. On linear identifiability of learned representations, 2020. URL https://arxiv.org/abs/2007.00810

  38. [38]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/1907.10641

  39. [39]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y. Socialiqa: Commonsense reasoning about social interactions, 2019. URL https://arxiv.org/abs/1904.09728

  40. [40]

    J., and Kumar, S

    Saunshi, N., Karp, S., Krishnan, S., Miryoosefi, S., Reddi, S. J., and Kumar, S. On the inductive bias of stacking towards improving reasoning, 2024. URL https://arxiv.org/abs/2409.19044

  41. [41]

    Why has predicting downstream capabilities of frontier ai models with scale remained elusive?, 2024

    Schaeffer, R., Schoelkopf, H., Miranda, B., Mukobi, G., Madan, V., Ibrahim, A., Bradley, H., Biderman, S., and Koyejo, S. Why has predicting downstream capabilities of frontier ai models with scale remained elusive?, 2024. URL https://arxiv.org/abs/2406.04391

  42. [42]

    Slimpajama-dc: Understanding dat a combinations for llm training

    Shen, Z., Tao, T., Ma, L., Neiswanger, W., Liu, Z., Wang, H., Tan, B., Hestness, J., Vassilieva, N., Soboleva, D., and Xing, E. Slimpajama-dc: Understanding data combinations for llm training, 2024. URL https://arxiv.org/abs/2309.10818

  43. [43]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053

  44. [44]

    CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

    Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URL https://arxiv.org/abs/1811.00937

  45. [45]

    Measuring robustness to natural distribution shifts in image classification, 2020

    Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts in image classification, 2020. URL https://arxiv.org/abs/2007.00644

  46. [46]

    W., Fedus, W., Rao, J., Narang, S., Tran, V

    Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., and Metzler, D. Scaling laws vs model architectures: How does inductive bias influence scaling?, 2022. URL https://arxiv.org/abs/2207.10551

  47. [47]

    Y., Haziza, D., Wehrstedt, L., Copet, J., Teytaud, O., and Lopez-Paz, D

    Videau, M., Idrissi, B. Y., Haziza, D., Wehrstedt, L., Copet, J., Teytaud, O., and Lopez-Paz, D. Meta Lingua : A minimal PyTorch LLM training library, 2024. URL https://github.com/facebookresearch/lingua

  48. [48]

    E., et al

    Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt , S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, \.I ., Feng, Y., Moore, E. W., VanderPlas , J., Laxalde, D., Perktold,...

  49. [49]

    and Komatsuzaki, A

    Wang, B. and Komatsuzaki, A. Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021

  50. [50]

    Scaling laws across model architectures: A comparative analysis of dense and moe models in large language models, 2024

    Wang, S., Chen, Z., Li, B., He, K., Zhang, M., and Wang, J. Scaling laws across model architectures: A comparative analysis of dense and moe models in large language models, 2024. URL https://arxiv.org/abs/2410.05661

  51. [51]

    Pretraining frequency predicts compositional generalization of CLIP on real-world tasks

    Wiedemer, T., Sharma, Y., Prabhu, A., Bethge, M., and Brendel, W. Pretraining frequency predicts compositional generalization of CLIP on real-world tasks. In NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward, 2024. URL https://openreview.net/forum?id=NDXoM1wYgl

  52. [52]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Huggingface's transformers: State-of-the-art natural language processing, 2020. URL https://arxi...

  53. [53]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830

  54. [54]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...