pith. sign in

arxiv: 2605.06546 · v2 · pith:HLJ5K5JNnew · submitted 2026-05-07 · 💻 cs.CL

Efficient Pre-Training with Token Superposition

Pith reviewed 2026-05-20 22:49 UTC · model grok-4.3

classification 💻 cs.CL
keywords token superpositionpre-training efficiencylarge language modelsmulti-hot cross-entropydata throughputmixture of experts
0
0 comments X

The pith

Token-Superposition Training processes bags of tokens with a multi-hot loss in an initial phase then recovers with standard training to raise data throughput per FLOP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Token-Superposition Training as a two-phase procedure that first packs many contiguous tokens into a single bag and optimizes them jointly via multi-hot cross-entropy, then switches back to ordinary next-token prediction. This change is presented as a drop-in replacement that leaves the model architecture, optimizer, tokenizer, data mixture, and parallelism untouched. Experiments at 270 M, 600 M, 3 B, and 10 B scales show the method reaches lower loss and better downstream scores than a matched baseline while, when training is stopped at equal loss, delivering up to a 2.5-fold reduction in wall-clock pre-training time at the largest scale examined.

Core claim

Token-Superposition Training consists of a superposition phase that replaces single-token cross-entropy with a multi-hot objective over bags of contiguous tokens, followed by a recovery phase that restores ordinary single-token training; the combined schedule increases tokens processed per FLOP and, at equal final loss, shortens total pre-training time by up to 2.5 times on a 10 B mixture-of-experts model.

What carries the argument

The multi-hot cross-entropy objective applied to bags of contiguous tokens during the superposition phase, which allows one forward-backward pass to update the model on multiple tokens simultaneously.

Load-bearing premise

The recovery phase restores model quality after the superposition phase without needing enough extra steps to erase the throughput gains of the first phase.

What would settle it

Run a controlled comparison in which a TST schedule and a standard baseline are trained until both reach the same validation loss; if the TST run requires more total steps than the baseline, the claimed time saving disappears.

Figures

Figures reproduced from arXiv: 2605.06546 by Bowen Peng, Jeffrey Quesnelle, Th\'eo Gigant.

Figure 1
Figure 1. Figure 1: Loss curves during the pre-training of two Qwen3-like MoE models (10B-A1B) with view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between standard next token prediction, TST and a few methods that superfi view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between standard next token prediction, TST and a few methods that superfi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Same constraints comparisons between baseline training and Token Superposition Train view at source ↗
Figure 4
Figure 4. Figure 4: Superposition results with respect to loss at varying superposition bag sizes and superpo view at source ↗
Figure 5
Figure 5. Figure 5: Downstream evals at varying superposition bag sizes and superposition step ratio view at source ↗
Figure 5
Figure 5. Figure 5: Downstream evals at varying superposition bag sizes and superposition step ratio [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Input and Output Superposition ablations, only the recovery phase (ii) is represented. view at source ↗
Figure 6
Figure 6. Figure 6: Input and Output Superposition ablations, only the recovery phase (ii) is represented. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Learning rate sweeps at varying model sizes, with the optimal learning rate being used for view at source ↗
Figure 7
Figure 7. Figure 7: Learning rate sweeps at varying model sizes, with the optimal learning rate being used for [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between superposition using uniform output loss and power law output loss at view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between superposition using uniform output loss and power law output loss at [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training loss after resuming from weighted superposition view at source ↗
Figure 9
Figure 9. Figure 9: Training loss after resuming from weighted superposition [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Mutual information between pairs of tokens in DCLM decays with distance following a view at source ↗
Figure 10
Figure 10. Figure 10: Mutual information between pairs of tokens in DCLM decays with distance following a [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Rows from top to bottom: Hellaswag and ARC-Easy downstream evals at varying super view at source ↗
Figure 11
Figure 11. Figure 11: Rows from top to bottom: Hellaswag and ARC-Easy downstream evals at varying super [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Token-Superposition Training (TST), a two-phase drop-in method for LLM pre-training. Phase (i) combines contiguous tokens into bags and optimizes a multi-hot cross-entropy objective to raise data throughput per FLOP; phase (ii) reverts to standard next-token training for recovery. Experiments on 270M–10B-scale models (including a 10B A1B MoE) report that TST yields lower loss and better downstream metrics than baseline, and under equal-loss budgets achieves up to 2.5× reduction in wall-clock pre-training time at the 10B scale.

Significance. A verified, architecture- and infrastructure-agnostic efficiency technique would be a practical contribution to lowering the cost of pre-training. The method’s simplicity and reported robustness across scales are positive features. However, the central efficiency claim rests on an empirical comparison whose supporting measurements are not yet reported in sufficient detail to allow independent verification of net savings.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Results): the 2.5× total pre-training time reduction at the 10B A1B scale under equal-loss settings is the central quantitative claim. The manuscript supplies neither the exact step or token budgets allocated to the superposition phase versus the recovery phase nor the loss curves that would demonstrate the sum of both phases is substantially smaller than the baseline step count to the same loss. Without these data it is impossible to rule out that slower recovery from the multi-hot objective erases the throughput advantage.
  2. [§3 and Table 2] §3 (Experimental Setup) and Table 2: positive results are reported across model scales, yet the text gives no concrete baseline configurations, hyper-parameter values, number of runs, or statistical significance tests for the loss and downstream improvements. This leaves the “consistently outperforms” statement weakly supported.
minor comments (2)
  1. [§2] The multi-hot cross-entropy loss is described in prose but would be clearer if written as an explicit equation (e.g., showing how the target distribution is formed from the bag of tokens).
  2. [Figures] Figure captions and axis labels in the loss curves should explicitly state whether the x-axis is steps or tokens and whether the plotted loss is the standard or the multi-hot objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen verifiability without altering the core claims or experimental outcomes.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): the 2.5× total pre-training time reduction at the 10B A1B scale under equal-loss settings is the central quantitative claim. The manuscript supplies neither the exact step or token budgets allocated to the superposition phase versus the recovery phase nor the loss curves that would demonstrate the sum of both phases is substantially smaller than the baseline step count to the same loss. Without these data it is impossible to rule out that slower recovery from the multi-hot objective erases the throughput advantage.

    Authors: We agree that explicit phase budgets and loss curves would improve independent verification of the net savings. The reported 2.5× figure reflects measured wall-clock time to equal loss on the 10B A1B model, where the superposition phase delivers higher tokens-per-FLOP and the recovery phase restores standard next-token performance without fully offsetting the gain. In the revision we will add a table listing exact step/token allocations per phase for the 10B run and a figure with overlaid loss curves for TST versus baseline, confirming the combined budget is smaller. revision: yes

  2. Referee: [§3 and Table 2] §3 (Experimental Setup) and Table 2: positive results are reported across model scales, yet the text gives no concrete baseline configurations, hyper-parameter values, number of runs, or statistical significance tests for the loss and downstream improvements. This leaves the “consistently outperforms” statement weakly supported.

    Authors: Section 3 describes the overall setup and data, but we acknowledge that hyper-parameter tables, run counts, and significance metrics are not presented with sufficient granularity. The improvements were observed consistently across scales with the same optimizer and learning-rate schedule as the baseline. In revision we will expand §3 and Table 2 (or add an appendix table) with full hyper-parameter values, state that smaller-scale results used three independent runs with reported standard deviations, and include statistical significance where sample sizes permit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with direct experimental validation

full rationale

The paper introduces TST as a two-phase empirical training procedure (superposition with multi-hot cross-entropy followed by standard recovery) and supports its claims of loss improvement and up to 2.5x time reduction solely through direct measurements on 270M–10B models. No mathematical derivation, first-principles prediction, or fitted parameter is presented that reduces to its own inputs by construction. Results rest on reported loss curves, downstream evaluations, and wall-clock comparisons rather than any self-referential equation or self-citation chain. The central efficiency claim is therefore an observed outcome, not a tautological restatement of an internal fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.0 · 5717 in / 1025 out tokens · 44544 ms · 2026-05-20T22:49:59.008782+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 8 internal anchors

  1. [1]

    Agarwal, L

    S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, and others. gpt-oss-120b & gpt-oss-20b model card. 10

  2. [2]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- licah, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Mon- teiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language ...

  3. [3]

    L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Bl ´azquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydl´ıˇcek, A. P. Lajar´ın, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. v. Werra, and T. Wolf. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language ...

  4. [4]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    URLhttp://arxiv.org/abs/2502.02737. arXiv:2502.02737 [cs]

  5. [5]

    Anagnostidis, G

    S. Anagnostidis, G. Bachmann, I. Schlag, and T. Hofmann. Navigating scaling laws: compute optimality in adaptive model training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofICML’24, pages 1511–1530, Vienna, Austria, 2024. JMLR.org

  6. [6]

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, and others. Qwen technical report

  7. [7]

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, and others. Deepseek llm: Scaling open-source language models with longtermism

  8. [8]

    Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

  9. [9]

    M. Chen, B. Hui, Z. Cui, J. Yang, D. Liu, J. Sun, J. Lin, and Z. Liu. Parallel Scaling Law for Language Models. Oct. 2025. URLhttps://openreview.net/forum?id=dEi1S731lk

  10. [10]

    Cheng, W

    X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y . Li, and others. Conditional memory via scalable lookup: A new axis of sparsity for large language models

  11. [11]

    Clark, K

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Ex- ploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 confer- ence of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 2924–2936, 2019

  12. [12]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  13. [13]

    Cortes and V

    C. Cortes and V . Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

  14. [14]

    D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Com- pu...

  15. [15]

    doi: 10.18653/v1/2024.acl-long.70

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.70. URL https://aclanthology.org/2024.acl-long.70/

  16. [16]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Oct. 2020. URLhttps://openreview.net/forum?id=YicbFdNTTy&utm_ campaign=The%20Batch&utm_source=hs_email&utm_medium=email&_...

  17. [17]

    Ebeling and T

    W. Ebeling and T. P ¨oschel. Entropy and Long-Range Correlations in Literary English.Euro- physics Letters, 26(4):241, May 1994. ISSN 0295-5075. doi: 10.1209/0295-5075/26/4/001. URLhttps://doi.org/10.1209/0295-5075/26/4/001

  18. [18]

    S. Y . Gadre, G. Smyrnis, V . Shankar, S. Gururangan, M. Wortsman, R. Shao, J. Mercat, A. Fang, J. Li, S. Keh, and others. Language models scale reliably with over-training and on downstream tasks

  19. [19]

    L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 12 2023. URLhttps: //zenodo.org/records/10256836

  20. [20]

    Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

    T. Gigant, B. Peng, and J. Quesnelle. Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation, Apr. 2026. URLhttp://arxiv.org/ abs/2604.27263. arXiv:2604.27263 [cs]

  21. [21]

    Gisserot-Boukhlef, N

    H. Gisserot-Boukhlef, N. Boizard, M. Faysse, D. M. Alves, E. Malherbe, A. Martins, C. Hude- lot, and P. Colombo. Should We Still Pretrain Encoders with Masked Language Modeling? Oct. 2025. URLhttps://openreview.net/forum?id=jpz7e3jhRq

  22. [22]

    Gloeckle, B

    F. Gloeckle, B. Y . Idrissi, B. Rozi `ere, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. InProceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofICML’24, pages 15706–15734, Vienna, Austria,

  23. [23]

    Grattafiori, A

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, and others. The llama 3 herd of models

  24. [24]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  25. [25]

    M. Y . Hu, J. Petty, C. Shi, W. Merrill, and T. Linzen. Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 9691–9709, Vienna, Aus- ...

  26. [26]

    S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhao, X. Zhang, Z. L. Thai, K. Zhang, C. Wang, Y . Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, D. Li, Z. Liu, and M. Sun. Minicpm: Unveiling the potential of small language mod- els with scalable training strategies, 2024. URLhttps://arxiv.org/abs/2404.06395

  27. [27]

    Hwang, B

    S. Hwang, B. Wang, and A. Gu. Dynamic Chunking for End-to-End Hierarchical Sequence Modeling, July 2025. URLhttp://arxiv.org/abs/2507.07955. arXiv:2507.07955 [cs]

  28. [28]

    A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of Experts, Jan. 2024. UR...

  29. [29]

    New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt

    KellerJordan. New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt. URL https://github.com/KellerJordan/modded-nanogpt/pull/178

  30. [30]

    K. Kim, S. Kotha, P. Liang, and T. Hashimoto. Pre-training under infinite compute, Sept. 2025. URLhttp://arxiv.org/abs/2509.14786. arXiv:2509.14786 [cs]

  31. [31]

    Kudo and J

    T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. InProceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations, pages 66–71. 12

  32. [32]

    D. Lee, S. Han, A. Kumar, and P. Agrawal. Training Language Models via Neural Cellular Automata, 2026. URLhttps://arxiv.org/abs/2603.10055. Version Number: 1

  33. [33]

    Leviathan, M

    Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofICML’23, pages 19274–19286, Honolulu, Hawaii, USA, 2023. JMLR.org

  34. [34]

    J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururan- gan, M. Wortsman, A. Albalak, Y . Bitton, M. Nezhurina, A. Abbas, C.-Y . Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, S. Sanyal, G. Ilharco, G. Daras, K. Marathe...

  35. [35]

    Liang, T

    W. Liang, T. Liu, L. Wright, W. Constable, A. Gu, C.-C. Huang, I. Zhang, W. Feng, H. Huang, J. Wang, S. Purandare, G. Nadathur, and S. Idreos. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. Oct. 2024. URLhttps://openreview. net/forum?id=SFN6Wm7YBI

  36. [36]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, and others. Deepseek-v3 technical report

  37. [37]

    A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models

  38. [38]

    H. Liu, J. Zhang, C. Wang, X. Hu, L. Lyu, J. Sun, X. Yang, B. Wang, F. Li, Y . Qian, and others. Scaling embeddings outperforms scaling experts in language models

  39. [39]

    Y . Liu, Y . Song, Y . Wang, K. Ge, A. Lamb, Q. Guo, K. Chen, B. Zhou, and Z. Lin. Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models, Feb. 2026. URLhttp://arxiv.org/abs/2602.08984. arXiv:2602.08984 [cs]

  40. [40]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  41. [41]

    Lozhkov, L

    A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf. Fineweb-edu: the finest collection of educational content, 2024. URLhttps://huggingface.co/datasets/HuggingFaceFW/ fineweb-edu

  42. [42]

    H. P. Luhn. The Automatic Creation of Literature Abstracts.IBM Journal of Research and Development, 2(2):159–165, Apr. 1958. ISSN 0018-8646. doi: 10.1147/rd.22.0159. URL https://ieeexplore.ieee.org/document/5392672. Conference Name: IBM Journal of Research and Development

  43. [43]

    Mahajan, S

    D. Mahajan, S. Goyal, B. Y . Idrissi, M. Pezeshki, I. Mitliagkas, D. Lopez-Paz, and K. Ahuja. Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries, Oct. 2025. URL http://arxiv.org/abs/2510.14751. arXiv:2510.14751 [cs]

  44. [44]

    Mihaylov, P

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

  45. [45]

    Mikolov, K

    T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Repre- sentations in Vector Space. Jan. 2013. URLhttps://www.semanticscholar.org/ paper/Efficient-Estimation-of-Word-Representations-in-Mikolov-Chen/ f6b51c8753a871dc94ff32152c00c01e94f90f09

  46. [46]

    Minixhofer, T

    B. Minixhofer, T. Murray, T. Limisiewicz, A. Korhonen, L. Zettlemoyer, N. A. Smith, E. M. Ponti, L. Soldaini, and V . Hofmann. Bolmo: Byteifying the Next Generation of Language Models, Dec. 2025. URLhttp://arxiv.org/abs/2512.15586. arXiv:2512.15586 [cs]. 13

  47. [47]

    S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li. Large Language Diffusion Models. Oct. 2025. URLhttps://openreview.net/forum? id=KnqiC0znVF

  48. [48]

    Pagnoni, R

    A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. E. We- ston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte Latent Transformer: Patches Scale Better Than Tokens. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, edi- tors,Proceedings of the 63rd Annual Meeting of the Association for Computati...

  49. [49]

    Sakaguchi, R

    K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  50. [50]

    Sennrich, B

    R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725

  51. [51]

    C. E. Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

  52. [52]

    C. Shao, F. Meng, and J. Zhou. Beyond next token prediction: Patch-level training for large language models. InThe Thirteenth International Conference on Learning Representations,

  53. [53]

    URLhttps://openreview.net/forum?id=dDpB23VbVa

  54. [54]

    Sparck Jones

    K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972

  55. [55]

    Y . Tay, M. Dehghani, V . Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schus- ter, S. Zheng, D. Zhou, N. Houlsby, and D. Metzler. UL2: Unifying Language Learning Paradigms. Sept. 2022. URLhttps://openreview.net/forum?id=6ruVLB727MC

  56. [56]

    Touvron, T

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, and others. Llama: Open and efficient foundation language models

  57. [57]

    Videau, B

    M. Videau, B. Y . Idrissi, A. Leite, M. Schoenauer, O. Teytaud, and D. Lopez-Paz. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets. Oct. 2025. URLhttps: //openreview.net/forum?id=FnFf7Ru2ur

  58. [58]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod- els. Oct. 2022. URLhttps://openreview.net/forum?id=_VjQlMeSB_J&trk=public_ post_comment-text

  59. [59]

    K. Wen, Z. Li, J. Wang, D. Hall, P. Liang, and T. Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective, 2024. URLhttps://arxiv.org/ abs/2410.05192

  60. [60]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, and others. Qwen3 technical report

  61. [61]

    J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computatio...

  62. [62]

    Zaheer, G

    M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: transformers for longer sequences. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, pages 17283–17297, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 978-1-7138- ...

  63. [63]

    Zellers, A

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine re- ally finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

  64. [64]

    Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Math- ews, and S. Li. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023. ISSN 2150-8097. doi: 10.14778/3611540.3611569. URL...

  65. [65]

    Proxy Compression for Language Modeling

    L. Zheng, X. Li, Q. Liu, X. Feng, and L. Kong. Proxy Compression for Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2602.04289. arXiv:2602.04289 [cs]

  66. [66]

    R.-J. Zhu, Z. Wang, K. Hua, T. Zhang, Z. Li, H. Que, B. Wei, Z. Wen, F. Yin, H. Xing, and others. Scaling latent reasoning via looped language models

  67. [67]

    Z. M. K. Zuhri, E. H. Fuadi, and A. F. Aji. Predicting the Order of Upcoming Tokens Improves Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2508.19228. arXiv:2508.19228 [cs]. 15 A Code 1 2[...] # within train loop 3 4if s u p e r p o s i t i o n _ b a g _ s i z e is not None and s u p e r p o s i t i o n _ b a g _ s i z e > 1: 5bs , seq = inputs . ...