LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Matthias Bethge; Prasanna Mayilvahanan; Sayak Mallick; Thadd\"aus Wiedemer; Wieland Brendel

arxiv: 2502.12120 · v3 · pith:FSBVO47Ynew · submitted 2025-02-17 · 💻 cs.LG · cs.AI· cs.CL

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Prasanna Mayilvahanan , Thadd\"aus Wiedemer , Sayak Mallick , Matthias Bethge , Wieland Brendel This is my paper

Pith reviewed 2026-05-23 02:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords scaling lawslarge language modelspretraining dataloss-to-loss scalingmodel architecturedownstream performancetransformersstate-space models

0 comments

The pith

Pretraining data determines loss-to-loss scaling trends in LLMs, overriding model size, architecture, and training choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests what controls how pretraining loss relates to performance on later tasks through loss-to-loss scaling laws. Experiments across multiple setups show that the specific pretraining data used fixes the scaling relationship. Model size, optimization details, tokenizers, and even large architectural shifts between transformers like Llama and state-space models like Mamba produce only minor changes when data stays the same. A reader would care because this shifts priority toward dataset selection for predictable downstream gains rather than repeated architecture searches.

Core claim

Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact.

What carries the argument

Loss-to-loss scaling laws relating pretraining loss to downstream task performance, with pretraining data as the primary controlling factor.

If this is right

Practitioners should prioritize curating pretraining datasets to achieve desired downstream scaling behavior.
Model architectures and optimization settings can be chosen mainly for training speed and cost without changing the expected loss-to-loss relationship.
Scaling predictions for new tasks can be based primarily on the pretraining data used rather than the specific model details.
Different model families will exhibit similar scaling behavior when trained on the same data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Data curation may become the main lever for controlling generalization patterns across many future model designs.
The result suggests testing whether particular data properties, such as domain coverage or token statistics, are what actually fix the scaling line.
If data dominates, then methods that alter effective training data during pretraining could be used to steer downstream scaling without retraining from scratch.

Load-bearing premise

The range of models, data sources, and tasks tested is representative enough for the data-dominance conclusion to apply beyond these specific cases.

What would settle it

A clear case where two models with different architectures or sizes, trained on identical pretraining data, produce substantially different loss-to-loss scaling slopes on the same downstream tasks.

Figures

Figures reproduced from arXiv: 2502.12120 by Matthias Bethge, Prasanna Mayilvahanan, Sayak Mallick, Thadd\"aus Wiedemer, Wieland Brendel.

**Figure 1.** Figure 1: LLMs’ loss-to-loss scaling follows power laws primarily shaped by the choice of pretraining data. Using Llama trained on FineWeb-Edu as a baseline, we intervene on various factors to assess their impact on train-to-test loss scaling. Changing the pretraining data has the largest effect. Changing the tokenizer, the architecture (e.g., from Llama to Mamba), model size, context length, and optimizer setting… view at source ↗

**Figure 2.** Figure 2: Loss-to-loss scaling consistently obeys power laws. We extend results from Brandfonbrener et al. (2024) to many architectures, training settings, and validation/test sets. We show illustrative shifted power laws for Mamba trained on FineWeb-Edu here; more configurations and test sets can be found in App. E. For clarity, scatter plots display a random sample of all data points; all points are used to fit th… view at source ↗

**Figure 3.** Figure 3: Schematic of our causal analysis. Checkpoints of a base model trained on different numbers of tokens and with different seeds lie on the same loss-to-loss line. Better-performing models (typically with higher compute) achieve lower loss (towards the bottom left). We intervene on training settings (e.g., pretraining data, architecture, etc.) and retrain from scratch, yielding new models that again consti… view at source ↗

**Figure 4.** Figure 4: Pretraining data has a substantial impact on loss-to-loss scaling laws. Models are matched on architecture and tokenizer. Tokenizers We train Llama and Mamba with either a tiktoken tokenizer (128 k vocabulary size) or the gpt2 tokenizer (50 257 vocabulary size). Pretrained models from Hugging Face use an almost identical GPT-2 tokenizer, dubbed gpt2-HF. This version does not explicitly pad text with beginn… view at source ↗

**Figure 5.** Figure 5: The tokenizer has a minor impact on loss-to-loss scaling laws. Models are matched on pretraining data and architecture. models) and Mamba (a state-space model). These results raise an important question: Do current architectures encode distinct inductive biases or converge to similar solutions given the same training data? Further research is needed to understand the implications of this finding. Takeaway … view at source ↗

**Figure 6.** Figure 6: Architecture has limited impact on loss-to-loss scaling laws. Models are matched on pretraining data and tokenizer. 3 4 5 6 7 8 FineWeb-Edu Validation Loss 4 5 6 7 8 Average Val. Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1 2 3 4 1e8 Size Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 8.** Figure 8: Context length does not affect loss-to-loss scaling. Again, distinct lines correspond to different pretraining distributions (compare [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Optimization settings do not affect loss-to-loss scaling. Implications for Balancing Performance If the aim is not only optimal average downstream performance but also a specific weighting between different tasks, e.g., to ensure a balanced downstream performance, individual train-to-test scaling laws can be used to tune a model’s performance. Here, too, the pretraining data has the largest impact and prac… view at source ↗

**Figure 10.** Figure 10: Example compute-to-loss scaling law fits. Each loss-to-loss scaling law requires fitting two compute-to-loss scaling laws to estimate Ex|p, Ey|p. The three fits here are used for the The Pile UC and HellaSwag curves in [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Loss-to-Loss Scaling for FineWeb-Edu-trained Llama. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Loss-to-Loss Scaling for C4-trained Llama. 3.5 4.0 4.5 5.0 FineWeb-Edu Validation Loss 3.0 3.5 4.0 4.5 5.0 5.5 Validation Loss Train-to-Train Validation Set The Pile UC RefineWeb Slimpajama C4 3.5 4.0 4.5 5.0 FineWeb-Edu Validation Loss 4 5 6 7 8 Test Loss Train-to-Test Test Set ARC-Challenge ARC-Easy COPA PIQA Winogrande HellaSwag CommonSenseQA Social IQa MMLU [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Loss-to-Loss Scaling for The Pile-trained Llama. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Loss-to-Loss Scaling for FineWeb-Edu-trained Mamba. 3.5 4.0 4.5 C4 Validation Loss 3.5 4.0 4.5 5.0 5.5 6.0 Validation Loss Train-to-Train Validation Set The Pile UC RefineWeb Slimpajama FineWeb-Edu 3.5 4.0 4.5 C4 Validation Loss 3 4 5 6 7 8 Test Loss Train-to-Test Test Set ARC-Challenge ARC-Easy COPA PIQA Winogrande HellaSwag CommonSenseQA Social IQa MMLU [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Loss-to-Loss Scaling for C4-trained Mamba. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Loss-to-Loss Scaling for The Pile-trained Mamba. 3 4 5 Average Val. Loss 4.0 4.5 5.0 C4 Val. Loss 5 6 7 Average Test Loss 4 5 C4 Val. Loss 3.5 4.0 4.5 C4 Val. Loss 4.0 4.5 C4 Val. Loss 4 5 C4 Val. Loss 3.0 3.5 4.0 C4 Val. Loss Architecture Tokenizer Llama tiktoken Architecture Tokenizer Llama gpt2 Architecture Tokenizer Llama gpt2-HF Architecture Tokenizer Mamba tiktoken Architecture Tokenizer Mamba gpt2 … view at source ↗

**Figure 17.** Figure 17: Pretraining data has a substantial impact on loss-to-loss scaling laws. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗

**Figure 18.** Figure 18: The tokenizer has a minor impact on loss-to-loss scaling laws. 3.5 4.0 4.5 5.0 Average Val. Loss 3.0 3.5 4.0 C4 Val. Loss 5 6 7 Average Test Loss 3.5 4.0 C4 Val. Loss 3.5 4.0 4.5 C4 Val. Loss 4.0 4.5 C4 Val. Loss 4.0 4.5 C4 Val. Loss 3.25 3.50 3.75 C4 Val. Loss Pretraining Tokenizer C4 gpt2 Pretraining Tokenizer C4 tiktoken Pretraining Tokenizer FW-Edu gpt2 Pretraining Tokenizer FW-Edu tiktoken Pretrainin… view at source ↗

**Figure 19.** Figure 19: Architecture has limited impact on loss-to-loss scaling laws. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗

**Figure 20.** Figure 20: Model size does not affect train-to-test scaling. 3 4 5 6 7 C4 Validation Loss 4 5 6 7 Average Val. Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1500 2000 2500 3000 Context Length Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗

**Figure 21.** Figure 21: Context length does not affect train-to-test scaling. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗

**Figure 22.** Figure 22: Optimizer settings do not affect train-to-test scaling. 2 3 4 5 The Pile UC Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3 4 HellaSwag Test Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.0 3.5 FW-Edu Val. Loss Architecture Tokenizer Llama tiktoken Architecture Tokenizer Llama gpt2 Architecture Tokenizer Llama gpt2-HF Architecture Tokeni… view at source ↗

**Figure 23.** Figure 23: Pretraining data has a substantial impact on loss-to-loss scaling laws. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_23.png] view at source ↗

**Figure 24.** Figure 24: Pretraining data has a substantial impact on loss-to-loss scaling laws. 0.5 1.0 1.5 The Pile UC Val. BPB 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 1.2 1.4 HellaSwag Test BPB 0.6 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 FW-Edu Val. BPB 0.8 1.0 FW-Edu Val. BPB 1.0 1.2 FW-Edu Val. BPB Architecture Pretraining Llama C4 Architecture Pretraining Llama FW-Edu Architecture Pretraining Llama Pile UC Architect… view at source ↗

**Figure 25.** Figure 25: The tokenizer has a minor impact on loss-to-loss scaling laws. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_25.png] view at source ↗

**Figure 26.** Figure 26: The tokenizer has a minor impact on loss-to-loss scaling laws. 3 4 5 The Pile UC Val. Loss 3.0 3.5 4.0 FW-Edu Val. Loss 2.5 3.0 3.5 4.0 HellaSwag Test Loss 3.5 4.0 FW-Edu Val. Loss 3.0 3.5 4.0 FW-Edu Val. Loss 3.0 3.5 4.0 FW-Edu Val. Loss 3.5 4.0 4.5 FW-Edu Val. Loss 3.25 3.50 3.75 FW-Edu Val. Loss Pretraining Tokenizer C4 gpt2 Pretraining Tokenizer C4 tiktoken Pretraining Tokenizer FW-Edu gpt2 Pretrainin… view at source ↗

**Figure 27.** Figure 27: Architecture has limited impact on loss-to-loss scaling laws. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_27.png] view at source ↗

**Figure 28.** Figure 28: Architecture has limited impact on loss-to-loss scaling laws. 3 4 5 6 7 8 C4 Validation Loss 5 6 7 8 Average Test Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1 2 3 4 1e8 Size Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p026_28.png] view at source ↗

**Figure 29.** Figure 29: Model size does not affect train-to-test scaling. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_29.png] view at source ↗

**Figure 30.** Figure 30: Model size does not affect train-to-test scaling. 3 4 5 6 7 C4 Validation Loss 5.0 5.5 6.0 6.5 7.0 7.5 Average Test Loss Pretraining Data C4 FineWeb-Edu The Pile UC 1500 2000 2500 3000 Context Length Architecture Tokenizer Llama | Mamba tiktoken [PITH_FULL_IMAGE:figures/full_fig_p027_30.png] view at source ↗

**Figure 31.** Figure 31: Context length does not affect train-to-test scaling. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_31.png] view at source ↗

**Figure 32.** Figure 32: Context length does not affect train-to-test scaling. 4 6 8 10 12 C4 Validation Loss 5 6 7 8 9 10 11 12 Average Test Loss Optimizer | LR | Weight Decay | Scheduler Adam | 3.0e-4 | 3.3e-2 | Cosine Adam | 3.0e-4 | 1.0e-1 | Cosine Adam | 3.0e-3 | 3.3e-2 | Cosine Adam | 3.0e-3 | 1.0e-1 | Cosine Adam | 3.0e-4 | 3.3e-2 | WSD Adam | 3.0e-4 | 1.0e-1 | WSD Adam | 3.0e-3 | 3.3e-2 | WSD Adam | 3.0e-3 | 1.0e-1 | WSD … view at source ↗

**Figure 33.** Figure 33: Optimizer settings do not affect train-to-test scaling. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_33.png] view at source ↗

**Figure 34.** Figure 34: Optimizer settings do not affect train-to-test scaling. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_34.png] view at source ↗

read the original abstract

Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate suitable pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Data sets the loss-to-loss scaling trend while architecture and size matter less, but the Llama-Mamba comparison needs same-data runs to be convincing.

read the letter

Colleague, the main result is that pretraining data determines how losses relate across tasks and models, while size, optimizer settings, tokenizer, and even architecture shifts like Llama versus Mamba show limited effect on the scaling line. The experiments test this by varying one factor at a time and tracking the resulting loss-to-loss relationships. That isolation is the useful part: it gives practitioners a reason to focus curation effort on data rather than endlessly tweaking model details. The paper builds directly on recent loss-to-loss work without obvious circular definitions or fitted-parameter artifacts. The citation pattern looks standard and points back to the relevant scaling-law papers. The soft spot is the architecture claim. Saying Llama and Mamba have similar scaling requires the two families to have been trained on identical data distributions; otherwise the comparison mixes data effects with architecture effects. The abstract does not detail the exact corpora used for each, so that control needs verification in the methods section. If the runs were on different data, the evidence for architecture having limited impact weakens. This is aimed at groups working on scaling laws and data selection. A reader who already follows loss-to-loss papers will find the data-dominance result worth checking. It is solid enough on its own terms to deserve referee time, even if the architecture section needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates factors influencing loss-to-loss scaling laws across pretraining datasets and downstream tasks in LLMs. It claims that pretraining data is the dominant determinant of the scaling trends, while model size, optimization hyperparameters, tokenizer choice, and even major architectural differences (e.g., transformer-based Llama vs. state-space Mamba) have limited impact. The authors conclude that practitioners should prioritize data curation over other design choices for optimal downstream performance.

Significance. If the central empirical claim holds after proper controls, the result would meaningfully redirect LLM scaling research and practice toward data-centric approaches rather than architecture or hyperparameter search, with direct implications for training efficiency and generalization. The work builds on recent loss-to-loss scaling literature by attempting to isolate the dominant variable through comparative experiments.

major comments (2)

[Experiments / Results] Experimental comparisons (likely §4 or §5): the claim that architecture has limited impact (including Llama vs. Mamba) requires Llama and Mamba models to be pretrained on identical data distributions; if the runs used different corpora, observed scaling differences are confounded by data rather than architecture, directly undermining the isolation of data as the sole determinant.
[Methods] Methods and experimental design (likely §3): the manuscript provides no details on controls for data overlap, statistical significance testing, data exclusion criteria, or variance across runs, leaving the support for the central claim that 'data determines the scaling trend' difficult to evaluate and potentially non-generalizable.

minor comments (2)

[Introduction] Notation for loss-to-loss relations could be clarified with an explicit equation early in the paper to avoid ambiguity when comparing across sections.
[Figures] Figure captions should explicitly state the exact model/data pairs shown to allow readers to verify the architecture-controlled comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, agreeing where controls or details are needed and committing to revisions that strengthen the manuscript without altering its central claims.

read point-by-point responses

Referee: [Experiments / Results] Experimental comparisons (likely §4 or §5): the claim that architecture has limited impact (including Llama vs. Mamba) requires Llama and Mamba models to be pretrained on identical data distributions; if the runs used different corpora, observed scaling differences are confounded by data rather than architecture, directly undermining the isolation of data as the sole determinant.

Authors: We agree that identical pretraining data distributions are required to isolate architecture. In our experiments, Llama and Mamba models were pretrained on the same data distributions precisely to avoid this confound; the observed similarity in loss-to-loss scaling is therefore attributable to architecture rather than data. We will revise §4 to explicitly state this control, describe the data-matching procedure, and include supporting details on the shared corpora. revision: yes
Referee: [Methods] Methods and experimental design (likely §3): the manuscript provides no details on controls for data overlap, statistical significance testing, data exclusion criteria, or variance across runs, leaving the support for the central claim that 'data determines the scaling trend' difficult to evaluate and potentially non-generalizable.

Authors: We acknowledge the absence of these methodological details in the current version. The revised manuscript will add a dedicated subsection in §3 covering: (i) controls and checks for data overlap between pretraining and downstream sets, (ii) the statistical significance tests applied to scaling trends, (iii) explicit data exclusion criteria, and (iv) reported variance or standard errors across multiple independent runs. These additions will make the support for the data-dominance claim fully evaluable. revision: yes

Circularity Check

0 steps flagged

Empirical comparisons with no circular derivation chain

full rationale

The paper reports experimental results on loss-to-loss scaling across varied pretraining data, model sizes, optimizers, tokenizers, and architectures (Llama vs. Mamba). No equations, fitted parameters, or first-principles derivations are presented that reduce to their own inputs by construction. The central claim rests on direct empirical contrasts rather than self-definitional relations, renamed known results, or load-bearing self-citations. The study is self-contained against external benchmarks via replication of the reported training runs and loss measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and does not introduce new mathematical axioms or entities; any scaling parameters would be fitted but not central to the claim.

pith-pipeline@v0.9.0 · 5681 in / 1129 out tokens · 33996 ms · 2026-05-23T02:45:31.633552+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 24 internal anchors

[1]

Exploring the landscape of distributional robustness for question answering models, 2022

Awadalla, A., Wortsman, M., Ilharco, G., Min, S., Magnusson, I., Hajishirzi, H., and Schmidt, L. Exploring the landscape of distributional robustness for question answering models, 2022. URL https://arxiv.org/abs/2210.12517

work page arXiv 2022
[2]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

PIQA: Reasoning about Physical Commonsense in Natural Language

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow

Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. 2021. URL https://api.semanticscholar.org/CorpusID:245758737

work page 2021
[5]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Loss-to-loss prediction: Scaling laws for all datasets, 2024

Brandfonbrener, D., Anand, N., Vyas, N., Malach, E., and Kakade, S. Loss-to-loss prediction: Scaling laws for all datasets, 2024. URL https://arxiv.org/abs/2411.12925

work page arXiv 2024
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao, T. and Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URL https://arxiv.org/abs/2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021

Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021. URL https://arxiv.org/abs/2104.08758

work page arXiv 2021
[10]

Understanding emergent abilities of language models from the loss perspective, 2025

Du, Z., Zeng, A., Dong, Y., and Tang, J. Understanding emergent abilities of language models from the loss perspective, 2025. URL https://arxiv.org/abs/2403.15796

work page arXiv 2025
[11]

Data determines distributional robustness in contrastive language image pre-training (clip), 2022

Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., and Schmidt, L. Data determines distributional robustness in contrastive language image pre-training (clip), 2022. URL https://arxiv.org/abs/2205.01397

work page arXiv 2022
[12]

Gadre, S. Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., Xin, R., Nezhurina, M., Vasiljevic, I., Jitsev, J., Soldaini, L., Dimakis, A. G., Ilharco, G., Koh, P. W., Song, S., Kollar, T., Carmon, Y., Dave, A., Heckel, R., Muennighoff, N., and Schmidt, L. Language models scale reliably with over-t...

work page arXiv 2024
[13]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL https://arxiv.org/abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Noac’h, A. L., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 07 2024. URL http...

work page arXiv 2024
[15]

S., Kozareva, Z., and Roemmele, M

Gordon, A. S., Kozareva, Z., and Roemmele, M. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011. URL https://api.semanticscholar.org/CorpusID:434646

work page 2011
[16]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

OLM o: Accelerating the science of language models

Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Sha...

work page doi:10.18653/v1/2024.acl-long.841 2024
[18]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically, 2017. URL https://arxiv.org/abs/1712.00409

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Training Compute-Optimal Large Language Models

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhao, W., Zhang, X., Thai, Z. L., Zhang, K., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M. Minicpm: Unveiling the potential of small language models with scalable training strategies, 2024. URL https://a...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Scaling laws for downstream task performance of large language models, 2024

Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., and Koyejo, S. Scaling laws for downstream task performance of large language models, 2024. URL https://arxiv.org/abs/2402.04177

work page arXiv 2024
[24]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

nanogpt, 2022

Karpathy, A. nanogpt, 2022. URL https://github.com/karpathy/nanoGPT

work page 2022
[26]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts, 2017. URL https://arxiv.org/abs/1608.03983

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019
[29]

K., Schaeffer, R., Poulton, A., Koyejo, S., Stenetorp, P., Narang, S., and Hupkes, D

Madaan, L., Singh, A. K., Schaeffer, R., Poulton, A., Koyejo, S., Stenetorp, P., Narang, S., and Hupkes, D. Quantifying variance in evaluation benchmarks, 2024. URL https://arxiv.org/abs/2406.10229

work page arXiv 2024
[30]

Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a

Mayilvahanan, P., Wiedemer, T., Rusak, E., Bethge, M., and Brendel, W. Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a . URL https://arxiv.org/abs/2310.09562

work page arXiv 2024
[31]

S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W

Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In search of forgotten domain generalization, 2024 b . URL https://arxiv.org/abs/2410.08258

work page arXiv 2024
[32]

W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L

Miller, J., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L. Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization, 2021. URL https://arxiv.org/abs/2107.04649

work page arXiv 2021
[33]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. URL https://arxiv.org/abs/2306.01116

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Resolving discrepancies in compute-optimal scaling of language models, 2025

Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., and Carmon, Y. Resolving discrepancies in compute-optimal scaling of language models, 2025. URL https://arxiv.org/abs/2406.19146

work page arXiv 2025
[36]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

work page 2019
[37]

Roeder, G., Metz, L., and Kingma, D. P. On linear identifiability of learned representations, 2020. URL https://arxiv.org/abs/2007.00810

work page arXiv 2020
[38]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019
[39]

SocialIQA: Commonsense Reasoning about Social Interactions

Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y. Socialiqa: Commonsense reasoning about social interactions, 2019. URL https://arxiv.org/abs/1904.09728

work page internal anchor Pith review Pith/arXiv arXiv 2019
[40]

J., and Kumar, S

Saunshi, N., Karp, S., Krishnan, S., Miryoosefi, S., Reddi, S. J., and Kumar, S. On the inductive bias of stacking towards improving reasoning, 2024. URL https://arxiv.org/abs/2409.19044

work page arXiv 2024
[41]

Why has predicting downstream capabilities of frontier ai models with scale remained elusive?, 2024

Schaeffer, R., Schoelkopf, H., Miranda, B., Mukobi, G., Madan, V., Ibrahim, A., Bradley, H., Biderman, S., and Koyejo, S. Why has predicting downstream capabilities of frontier ai models with scale remained elusive?, 2024. URL https://arxiv.org/abs/2406.04391

work page arXiv 2024
[42]

Slimpajama-dc: Understanding dat a combinations for llm training

Shen, Z., Tao, T., Ma, L., Neiswanger, W., Liu, Z., Wang, H., Tan, B., Hestness, J., Vassilieva, N., Soboleva, D., and Xing, E. Slimpajama-dc: Understanding data combinations for llm training, 2024. URL https://arxiv.org/abs/2309.10818

work page arXiv 2024
[43]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020
[44]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URL https://arxiv.org/abs/1811.00937

work page internal anchor Pith review Pith/arXiv arXiv 2019
[45]

Measuring robustness to natural distribution shifts in image classification, 2020

Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts in image classification, 2020. URL https://arxiv.org/abs/2007.00644

work page arXiv 2020
[46]

W., Fedus, W., Rao, J., Narang, S., Tran, V

Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., and Metzler, D. Scaling laws vs model architectures: How does inductive bias influence scaling?, 2022. URL https://arxiv.org/abs/2207.10551

work page arXiv 2022
[47]

Y., Haziza, D., Wehrstedt, L., Copet, J., Teytaud, O., and Lopez-Paz, D

Videau, M., Idrissi, B. Y., Haziza, D., Wehrstedt, L., Copet, J., Teytaud, O., and Lopez-Paz, D. Meta Lingua : A minimal PyTorch LLM training library, 2024. URL https://github.com/facebookresearch/lingua

work page 2024
[48]

E., et al

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt , S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, \.I ., Feng, Y., Moore, E. W., VanderPlas , J., Laxalde, D., Perktold,...

work page doi:10.1038/s41592-019-0686-2 2020
[49]

and Komatsuzaki, A

Wang, B. and Komatsuzaki, A. Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021

work page 2021
[50]

Scaling laws across model architectures: A comparative analysis of dense and moe models in large language models, 2024

Wang, S., Chen, Z., Li, B., He, K., Zhang, M., and Wang, J. Scaling laws across model architectures: A comparative analysis of dense and moe models in large language models, 2024. URL https://arxiv.org/abs/2410.05661

work page arXiv 2024
[51]

Pretraining frequency predicts compositional generalization of CLIP on real-world tasks

Wiedemer, T., Sharma, Y., Prabhu, A., Bethge, M., and Brendel, W. Pretraining frequency predicts compositional generalization of CLIP on real-world tasks. In NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward, 2024. URL https://openreview.net/forum?id=NDXoM1wYgl

work page 2024
[52]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Huggingface's transformers: State-of-the-art natural language processing, 2020. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[53]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019
[54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[1] [1]

Exploring the landscape of distributional robustness for question answering models, 2022

Awadalla, A., Wortsman, M., Ilharco, G., Min, S., Magnusson, I., Hajishirzi, H., and Schmidt, L. Exploring the landscape of distributional robustness for question answering models, 2022. URL https://arxiv.org/abs/2210.12517

work page arXiv 2022

[2] [2]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling, 2023. URL https://arxiv.org/abs/2304.01373

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

PIQA: Reasoning about Physical Commonsense in Natural Language

Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y. Piqa: Reasoning about physical commonsense in natural language, 2019. URL https://arxiv.org/abs/1911.11641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow

Black, S., Gao, L., Wang, P., Leahy, C., and Biderman, S. Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. 2021. URL https://api.semanticscholar.org/CorpusID:245758737

work page 2021

[5] [5]

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., Pieler, M., Prashanth, U. S., Purohit, S., Reynolds, L., Tow, J., Wang, B., and Weinbach, S. Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Loss-to-loss prediction: Scaling laws for all datasets, 2024

Brandfonbrener, D., Anand, N., Vyas, N., Malach, E., and Kakade, S. Loss-to-loss prediction: Scaling laws for all datasets, 2024. URL https://arxiv.org/abs/2411.12925

work page arXiv 2024

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018. URL https://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Dao, T. and Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024. URL https://arxiv.org/abs/2405.21060

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021

Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., and Gardner, M. Documenting large webtext corpora: A case study on the colossal clean crawled corpus, 2021. URL https://arxiv.org/abs/2104.08758

work page arXiv 2021

[10] [10]

Understanding emergent abilities of language models from the loss perspective, 2025

Du, Z., Zeng, A., Dong, Y., and Tang, J. Understanding emergent abilities of language models from the loss perspective, 2025. URL https://arxiv.org/abs/2403.15796

work page arXiv 2025

[11] [11]

Data determines distributional robustness in contrastive language image pre-training (clip), 2022

Fang, A., Ilharco, G., Wortsman, M., Wan, Y., Shankar, V., Dave, A., and Schmidt, L. Data determines distributional robustness in contrastive language image pre-training (clip), 2022. URL https://arxiv.org/abs/2205.01397

work page arXiv 2022

[12] [12]

Gadre, S. Y., Smyrnis, G., Shankar, V., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., Xin, R., Nezhurina, M., Vasiljevic, I., Jitsev, J., Soldaini, L., Dimakis, A. G., Ilharco, G., Koh, P. W., Song, S., Kollar, T., Carmon, Y., Dave, A., Heckel, R., Muennighoff, N., and Schmidt, L. Language models scale reliably with over-t...

work page arXiv 2024

[13] [13]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The pile: An 800gb dataset of diverse text for language modeling, 2020. URL https://arxiv.org/abs/2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

Thomas Hartvigsen, Saadia Gabriel, Hamid Palangi, Maarten Sap, Dipankar Ray, and Ece Kamar

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Noac’h, A. L., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. A framework for few-shot language model evaluation, 07 2024. URL http...

work page arXiv 2024

[15] [15]

S., Kozareva, Z., and Roemmele, M

Gordon, A. S., Kozareva, Z., and Roemmele, M. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning, 2011. URL https://api.semanticscholar.org/CorpusID:434646

work page 2011

[16] [16]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

OLM o: Accelerating the science of language models

Groeneveld, D., Beltagy, I., Walsh, E., Bhagia, A., Kinney, R., Tafjord, O., Jha, A., Ivison, H., Magnusson, I., Wang, Y., Arora, S., Atkinson, D., Authur, R., Chandu, K., Cohan, A., Dumas, J., Elazar, Y., Gu, Y., Hessel, J., Khot, T., Merrill, W., Morrison, J., Muennighoff, N., Naik, A., Nam, C., Peters, M., Pyatkin, V., Ravichander, A., Schwenk, D., Sha...

work page doi:10.18653/v1/2024.acl-long.841 2024

[18] [18]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces, 2024. URL https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Measuring Massive Multitask Language Understanding

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding, 2021. URL https://arxiv.org/abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically, 2017. URL https://arxiv.org/abs/1712.00409

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Training Compute-Optimal Large Language Models

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models, 2022. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Hu, S., Tu, Y., Han, X., He, C., Cui, G., Long, X., Zheng, Z., Fang, Y., Huang, Y., Zhao, W., Zhang, X., Thai, Z. L., Zhang, K., Wang, C., Yao, Y., Zhao, C., Zhou, J., Cai, J., Zhai, Z., Ding, N., Jia, C., Zeng, G., Li, D., Liu, Z., and Sun, M. Minicpm: Unveiling the potential of small language models with scalable training strategies, 2024. URL https://a...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

Scaling laws for downstream task performance of large language models, 2024

Isik, B., Ponomareva, N., Hazimeh, H., Paparas, D., Vassilvitskii, S., and Koyejo, S. Scaling laws for downstream task performance of large language models, 2024. URL https://arxiv.org/abs/2402.04177

work page arXiv 2024

[24] [24]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2020

[25] [25]

nanogpt, 2022

Karpathy, A. nanogpt, 2022. URL https://github.com/karpathy/nanoGPT

work page 2022

[26] [26]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2017

[27] [27]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts, 2017. URL https://arxiv.org/abs/1608.03983

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101

work page internal anchor Pith review Pith/arXiv arXiv 2019

[29] [29]

K., Schaeffer, R., Poulton, A., Koyejo, S., Stenetorp, P., Narang, S., and Hupkes, D

Madaan, L., Singh, A. K., Schaeffer, R., Poulton, A., Koyejo, S., Stenetorp, P., Narang, S., and Hupkes, D. Quantifying variance in evaluation benchmarks, 2024. URL https://arxiv.org/abs/2406.10229

work page arXiv 2024

[30] [30]

Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a

Mayilvahanan, P., Wiedemer, T., Rusak, E., Bethge, M., and Brendel, W. Does clip's generalization performance mainly stem from high train-test similarity?, 2024 a . URL https://arxiv.org/abs/2310.09562

work page arXiv 2024

[31] [31]

S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W

Mayilvahanan, P., Zimmermann, R. S., Wiedemer, T., Rusak, E., Juhos, A., Bethge, M., and Brendel, W. In search of forgotten domain generalization, 2024 b . URL https://arxiv.org/abs/2410.08258

work page arXiv 2024

[32] [32]

W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L

Miller, J., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., Liang, P., Carmon, Y., and Schmidt, L. Accuracy on the line: On the strong correlation between out-of-distribution and in-distribution generalization, 2021. URL https://arxiv.org/abs/2107.04649

work page arXiv 2021

[33] [33]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023. URL https://arxiv.org/abs/2306.01116

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Penedo, G., Kydlíček, H., allal, L. B., Lozhkov, A., Mitchell, M., Raffel, C., Werra, L. V., and Wolf, T. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Resolving discrepancies in compute-optimal scaling of language models, 2025

Porian, T., Wortsman, M., Jitsev, J., Schmidt, L., and Carmon, Y. Resolving discrepancies in compute-optimal scaling of language models, 2025. URL https://arxiv.org/abs/2406.19146

work page arXiv 2025

[36] [36]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533

work page 2019

[37] [37]

Roeder, G., Metz, L., and Kingma, D. P. On linear identifiability of learned representations, 2020. URL https://arxiv.org/abs/2007.00810

work page arXiv 2020

[38] [38]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale, 2019. URL https://arxiv.org/abs/1907.10641

work page internal anchor Pith review Pith/arXiv arXiv 2019

[39] [39]

SocialIQA: Commonsense Reasoning about Social Interactions

Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y. Socialiqa: Commonsense reasoning about social interactions, 2019. URL https://arxiv.org/abs/1904.09728

work page internal anchor Pith review Pith/arXiv arXiv 2019

[40] [40]

J., and Kumar, S

Saunshi, N., Karp, S., Krishnan, S., Miryoosefi, S., Reddi, S. J., and Kumar, S. On the inductive bias of stacking towards improving reasoning, 2024. URL https://arxiv.org/abs/2409.19044

work page arXiv 2024

[41] [41]

Why has predicting downstream capabilities of frontier ai models with scale remained elusive?, 2024

Schaeffer, R., Schoelkopf, H., Miranda, B., Mukobi, G., Madan, V., Ibrahim, A., Bradley, H., Biderman, S., and Koyejo, S. Why has predicting downstream capabilities of frontier ai models with scale remained elusive?, 2024. URL https://arxiv.org/abs/2406.04391

work page arXiv 2024

[42] [42]

Slimpajama-dc: Understanding dat a combinations for llm training

Shen, Z., Tao, T., Ma, L., Neiswanger, W., Liu, Z., Wang, H., Tan, B., Hestness, J., Vassilieva, N., Soboleva, D., and Xing, E. Slimpajama-dc: Understanding data combinations for llm training, 2024. URL https://arxiv.org/abs/2309.10818

work page arXiv 2024

[43] [43]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020

[44] [44]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Talmor, A., Herzig, J., Lourie, N., and Berant, J. Commonsenseqa: A question answering challenge targeting commonsense knowledge, 2019. URL https://arxiv.org/abs/1811.00937

work page internal anchor Pith review Pith/arXiv arXiv 2019

[45] [45]

Measuring robustness to natural distribution shifts in image classification, 2020

Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and Schmidt, L. Measuring robustness to natural distribution shifts in image classification, 2020. URL https://arxiv.org/abs/2007.00644

work page arXiv 2020

[46] [46]

W., Fedus, W., Rao, J., Narang, S., Tran, V

Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., Narang, S., Tran, V. Q., Yogatama, D., and Metzler, D. Scaling laws vs model architectures: How does inductive bias influence scaling?, 2022. URL https://arxiv.org/abs/2207.10551

work page arXiv 2022

[47] [47]

Y., Haziza, D., Wehrstedt, L., Copet, J., Teytaud, O., and Lopez-Paz, D

Videau, M., Idrissi, B. Y., Haziza, D., Wehrstedt, L., Copet, J., Teytaud, O., and Lopez-Paz, D. Meta Lingua : A minimal PyTorch LLM training library, 2024. URL https://github.com/facebookresearch/lingua

work page 2024

[48] [48]

E., et al

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt , S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, \.I ., Feng, Y., Moore, E. W., VanderPlas , J., Laxalde, D., Perktold,...

work page doi:10.1038/s41592-019-0686-2 2020

[49] [49]

and Komatsuzaki, A

Wang, B. and Komatsuzaki, A. Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021

work page 2021

[50] [50]

Scaling laws across model architectures: A comparative analysis of dense and moe models in large language models, 2024

Wang, S., Chen, Z., Li, B., He, K., Zhang, M., and Wang, J. Scaling laws across model architectures: A comparative analysis of dense and moe models in large language models, 2024. URL https://arxiv.org/abs/2410.05661

work page arXiv 2024

[51] [51]

Pretraining frequency predicts compositional generalization of CLIP on real-world tasks

Wiedemer, T., Sharma, Y., Prabhu, A., Bethge, M., and Brendel, W. Pretraining frequency predicts compositional generalization of CLIP on real-world tasks. In NeurIPS 2024 Workshop on Compositional Learning: Perspectives, Methods, and Paths Forward, 2024. URL https://openreview.net/forum?id=NDXoM1wYgl

work page 2024

[52] [52]

HuggingFace's Transformers: State-of-the-art Natural Language Processing

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Huggingface's transformers: State-of-the-art natural language processing, 2020. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[53] [53]

HellaSwag: Can a Machine Really Finish Your Sentence?

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence?, 2019. URL https://arxiv.org/abs/1905.07830

work page internal anchor Pith review Pith/arXiv arXiv 2019

[54] [54]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page