Efficient Pre-Training with Token Superposition

Bowen Peng; Jeffrey Quesnelle; Th\'eo Gigant

arxiv: 2605.06546 · v2 · pith:HLJ5K5JNnew · submitted 2026-05-07 · 💻 cs.CL

Efficient Pre-Training with Token Superposition

Bowen Peng , Th\'eo Gigant , Jeffrey Quesnelle This is my paper

Pith reviewed 2026-05-20 22:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords token superpositionpre-training efficiencylarge language modelsmulti-hot cross-entropydata throughputmixture of experts

0 comments

The pith

Token-Superposition Training processes bags of tokens with a multi-hot loss in an initial phase then recovers with standard training to raise data throughput per FLOP.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Token-Superposition Training as a two-phase procedure that first packs many contiguous tokens into a single bag and optimizes them jointly via multi-hot cross-entropy, then switches back to ordinary next-token prediction. This change is presented as a drop-in replacement that leaves the model architecture, optimizer, tokenizer, data mixture, and parallelism untouched. Experiments at 270 M, 600 M, 3 B, and 10 B scales show the method reaches lower loss and better downstream scores than a matched baseline while, when training is stopped at equal loss, delivering up to a 2.5-fold reduction in wall-clock pre-training time at the largest scale examined.

Core claim

Token-Superposition Training consists of a superposition phase that replaces single-token cross-entropy with a multi-hot objective over bags of contiguous tokens, followed by a recovery phase that restores ordinary single-token training; the combined schedule increases tokens processed per FLOP and, at equal final loss, shortens total pre-training time by up to 2.5 times on a 10 B mixture-of-experts model.

What carries the argument

The multi-hot cross-entropy objective applied to bags of contiguous tokens during the superposition phase, which allows one forward-backward pass to update the model on multiple tokens simultaneously.

Load-bearing premise

The recovery phase restores model quality after the superposition phase without needing enough extra steps to erase the throughput gains of the first phase.

What would settle it

Run a controlled comparison in which a TST schedule and a standard baseline are trained until both reach the same validation loss; if the TST run requires more total steps than the baseline, the claimed time saving disappears.

Figures

Figures reproduced from arXiv: 2605.06546 by Bowen Peng, Jeffrey Quesnelle, Th\'eo Gigant.

**Figure 1.** Figure 1: Loss curves during the pre-training of two Qwen3-like MoE models (10B-A1B) with view at source ↗

**Figure 2.** Figure 2: Comparison between standard next token prediction, TST and a few methods that superfi view at source ↗

**Figure 2.** Figure 2: Comparison between standard next token prediction, TST and a few methods that superfi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Same constraints comparisons between baseline training and Token Superposition Train view at source ↗

**Figure 4.** Figure 4: Superposition results with respect to loss at varying superposition bag sizes and superpo view at source ↗

**Figure 5.** Figure 5: Downstream evals at varying superposition bag sizes and superposition step ratio view at source ↗

**Figure 5.** Figure 5: Downstream evals at varying superposition bag sizes and superposition step ratio [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Input and Output Superposition ablations, only the recovery phase (ii) is represented. view at source ↗

**Figure 6.** Figure 6: Input and Output Superposition ablations, only the recovery phase (ii) is represented. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Learning rate sweeps at varying model sizes, with the optimal learning rate being used for view at source ↗

**Figure 7.** Figure 7: Learning rate sweeps at varying model sizes, with the optimal learning rate being used for [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between superposition using uniform output loss and power law output loss at view at source ↗

**Figure 8.** Figure 8: Comparison between superposition using uniform output loss and power law output loss at [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Training loss after resuming from weighted superposition view at source ↗

**Figure 9.** Figure 9: Training loss after resuming from weighted superposition [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Mutual information between pairs of tokens in DCLM decays with distance following a view at source ↗

**Figure 10.** Figure 10: Mutual information between pairs of tokens in DCLM decays with distance following a [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Rows from top to bottom: Hellaswag and ARC-Easy downstream evals at varying super view at source ↗

**Figure 11.** Figure 11: Rows from top to bottom: Hellaswag and ARC-Easy downstream evals at varying super [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TST's bag-based multi-hot phase plus recovery delivers reported speedups at scale, but net gains rest on unshown details about recovery step counts.

read the letter

The main thing to know is that this paper introduces a two-phase method: first train on bags of contiguous tokens with a multi-hot cross-entropy loss for higher throughput, then switch to ordinary next-token prediction to recover quality. Experiments claim this beats baseline loss and downstream scores while cutting total pre-training time by up to 2.5x at the 10B MoE scale under equal-loss conditions, all without touching architecture, optimizer, or data pipeline.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Token-Superposition Training (TST), a two-phase drop-in method for LLM pre-training. Phase (i) combines contiguous tokens into bags and optimizes a multi-hot cross-entropy objective to raise data throughput per FLOP; phase (ii) reverts to standard next-token training for recovery. Experiments on 270M–10B-scale models (including a 10B A1B MoE) report that TST yields lower loss and better downstream metrics than baseline, and under equal-loss budgets achieves up to 2.5× reduction in wall-clock pre-training time at the 10B scale.

Significance. A verified, architecture- and infrastructure-agnostic efficiency technique would be a practical contribution to lowering the cost of pre-training. The method’s simplicity and reported robustness across scales are positive features. However, the central efficiency claim rests on an empirical comparison whose supporting measurements are not yet reported in sufficient detail to allow independent verification of net savings.

major comments (2)

[Abstract and §4] Abstract and §4 (Results): the 2.5× total pre-training time reduction at the 10B A1B scale under equal-loss settings is the central quantitative claim. The manuscript supplies neither the exact step or token budgets allocated to the superposition phase versus the recovery phase nor the loss curves that would demonstrate the sum of both phases is substantially smaller than the baseline step count to the same loss. Without these data it is impossible to rule out that slower recovery from the multi-hot objective erases the throughput advantage.
[§3 and Table 2] §3 (Experimental Setup) and Table 2: positive results are reported across model scales, yet the text gives no concrete baseline configurations, hyper-parameter values, number of runs, or statistical significance tests for the loss and downstream improvements. This leaves the “consistently outperforms” statement weakly supported.

minor comments (2)

[§2] The multi-hot cross-entropy loss is described in prose but would be clearer if written as an explicit equation (e.g., showing how the target distribution is formed from the bag of tokens).
[Figures] Figure captions and axis labels in the loss curves should explicitly state whether the x-axis is steps or tokens and whether the plotted loss is the standard or the multi-hot objective.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commit to revisions that strengthen verifiability without altering the core claims or experimental outcomes.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): the 2.5× total pre-training time reduction at the 10B A1B scale under equal-loss settings is the central quantitative claim. The manuscript supplies neither the exact step or token budgets allocated to the superposition phase versus the recovery phase nor the loss curves that would demonstrate the sum of both phases is substantially smaller than the baseline step count to the same loss. Without these data it is impossible to rule out that slower recovery from the multi-hot objective erases the throughput advantage.

Authors: We agree that explicit phase budgets and loss curves would improve independent verification of the net savings. The reported 2.5× figure reflects measured wall-clock time to equal loss on the 10B A1B model, where the superposition phase delivers higher tokens-per-FLOP and the recovery phase restores standard next-token performance without fully offsetting the gain. In the revision we will add a table listing exact step/token allocations per phase for the 10B run and a figure with overlaid loss curves for TST versus baseline, confirming the combined budget is smaller. revision: yes
Referee: [§3 and Table 2] §3 (Experimental Setup) and Table 2: positive results are reported across model scales, yet the text gives no concrete baseline configurations, hyper-parameter values, number of runs, or statistical significance tests for the loss and downstream improvements. This leaves the “consistently outperforms” statement weakly supported.

Authors: Section 3 describes the overall setup and data, but we acknowledge that hyper-parameter tables, run counts, and significance metrics are not presented with sufficient granularity. The improvements were observed consistently across scales with the same optimizer and learning-rate schedule as the baseline. In revision we will expand §3 and Table 2 (or add an appendix table) with full hyper-parameter values, state that smaller-scale results used three independent runs with reported standard deviations, and include statistical significance where sample sizes permit. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with direct experimental validation

full rationale

The paper introduces TST as a two-phase empirical training procedure (superposition with multi-hot cross-entropy followed by standard recovery) and supports its claims of loss improvement and up to 2.5x time reduction solely through direct measurements on 270M–10B models. No mathematical derivation, first-principles prediction, or fitted parameter is presented that reduces to its own inputs by construction. Results rest on reported loss curves, downstream evaluations, and wall-clock comparisons rather than any self-referential equation or self-citation chain. The central efficiency claim is therefore an observed outcome, not a tautological restatement of an internal fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract alone.

pith-pipeline@v0.9.0 · 5717 in / 1025 out tokens · 44544 ms · 2026-05-20T22:49:59.008782+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 8 internal anchors

[1]

Agarwal, L

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, and others. gpt-oss-120b & gpt-oss-20b model card. 10

work page
[2]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- licah, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Mon- teiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language ...

work page 2022
[3]

L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Bl ´azquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydl´ıˇcek, A. P. Lajar´ın, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. v. Werra, and T. Wolf. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language ...

work page
[4]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

URLhttp://arxiv.org/abs/2502.02737. arXiv:2502.02737 [cs]

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Anagnostidis, G

S. Anagnostidis, G. Bachmann, I. Schlag, and T. Hofmann. Navigating scaling laws: compute optimality in adaptive model training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofICML’24, pages 1511–1530, Vienna, Austria, 2024. JMLR.org

work page 2024
[6]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, and others. Qwen technical report

work page
[7]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, and others. Deepseek llm: Scaling open-source language models with longtermism

work page
[8]

Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020
[9]

M. Chen, B. Hui, Z. Cui, J. Yang, D. Liu, J. Sun, J. Lin, and Z. Liu. Parallel Scaling Law for Language Models. Oct. 2025. URLhttps://openreview.net/forum?id=dEi1S731lk

work page 2025
[10]

Cheng, W

X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y . Li, and others. Conditional memory via scalable lookup: A new axis of sparsity for large language models

work page
[11]

Clark, K

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Ex- ploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 confer- ence of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 2924–2936, 2019

work page 2019
[12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Cortes and V

C. Cortes and V . Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

work page 1995
[14]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Com- pu...

work page
[15]

doi: 10.18653/v1/2024.acl-long.70

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.70. URL https://aclanthology.org/2024.acl-long.70/

work page doi:10.18653/v1/2024.acl-long.70 2024
[16]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Oct. 2020. URLhttps://openreview.net/forum?id=YicbFdNTTy&utm_ campaign=The%20Batch&utm_source=hs_email&utm_medium=email&_...

work page 2020
[17]

Ebeling and T

W. Ebeling and T. P ¨oschel. Entropy and Long-Range Correlations in Literary English.Euro- physics Letters, 26(4):241, May 1994. ISSN 0295-5075. doi: 10.1209/0295-5075/26/4/001. URLhttps://doi.org/10.1209/0295-5075/26/4/001

work page doi:10.1209/0295-5075/26/4/001 1994
[18]

S. Y . Gadre, G. Smyrnis, V . Shankar, S. Gururangan, M. Wortsman, R. Shao, J. Mercat, A. Fang, J. Li, S. Keh, and others. Language models scale reliably with over-training and on downstream tasks

work page
[19]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 12 2023. URLhttps: //zenodo.org/records/10256836

work page arXiv 2023
[20]

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

T. Gigant, B. Peng, and J. Quesnelle. Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation, Apr. 2026. URLhttp://arxiv.org/ abs/2604.27263. arXiv:2604.27263 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

Gisserot-Boukhlef, N

H. Gisserot-Boukhlef, N. Boizard, M. Faysse, D. M. Alves, E. Malherbe, A. Martins, C. Hude- lot, and P. Colombo. Should We Still Pretrain Encoders with Masked Language Modeling? Oct. 2025. URLhttps://openreview.net/forum?id=jpz7e3jhRq

work page 2025
[22]

Gloeckle, B

F. Gloeckle, B. Y . Idrissi, B. Rozi `ere, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. InProceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofICML’24, pages 15706–15734, Vienna, Austria,

work page
[23]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, and others. The llama 3 herd of models

work page
[24]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[25]

M. Y . Hu, J. Petty, C. Shi, W. Merrill, and T. Linzen. Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 9691–9709, Vienna, Aus- ...

work page 2025
[26]

S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhao, X. Zhang, Z. L. Thai, K. Zhang, C. Wang, Y . Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, D. Li, Z. Liu, and M. Sun. Minicpm: Unveiling the potential of small language mod- els with scalable training strategies, 2024. URLhttps://arxiv.org/abs/2404.06395

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Hwang, B

S. Hwang, B. Wang, and A. Gu. Dynamic Chunking for End-to-End Hierarchical Sequence Modeling, July 2025. URLhttp://arxiv.org/abs/2507.07955. arXiv:2507.07955 [cs]

work page arXiv 2025
[28]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of Experts, Jan. 2024. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt

KellerJordan. New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt. URL https://github.com/KellerJordan/modded-nanogpt/pull/178

work page
[30]

K. Kim, S. Kotha, P. Liang, and T. Hashimoto. Pre-training under infinite compute, Sept. 2025. URLhttp://arxiv.org/abs/2509.14786. arXiv:2509.14786 [cs]

work page arXiv 2025
[31]

Kudo and J

T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. InProceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations, pages 66–71. 12

work page 2018
[32]

D. Lee, S. Han, A. Kumar, and P. Agrawal. Training Language Models via Neural Cellular Automata, 2026. URLhttps://arxiv.org/abs/2603.10055. Version Number: 1

work page arXiv 2026
[33]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofICML’23, pages 19274–19286, Honolulu, Hawaii, USA, 2023. JMLR.org

work page 2023
[34]

J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururan- gan, M. Wortsman, A. Albalak, Y . Bitton, M. Nezhurina, A. Abbas, C.-Y . Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, S. Sanyal, G. Ilharco, G. Daras, K. Marathe...

work page 2024
[35]

Liang, T

W. Liang, T. Liu, L. Wright, W. Constable, A. Gu, C.-C. Huang, I. Zhang, W. Feng, H. Huang, J. Wang, S. Purandare, G. Nadathur, and S. Idreos. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. Oct. 2024. URLhttps://openreview. net/forum?id=SFN6Wm7YBI

work page 2024
[36]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, and others. Deepseek-v3 technical report

work page
[37]

A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models

work page
[38]

H. Liu, J. Zhang, C. Wang, X. Hu, L. Lyu, J. Sun, X. Yang, B. Wang, F. Li, Y . Qian, and others. Scaling embeddings outperforms scaling experts in language models

work page
[39]

Y . Liu, Y . Song, Y . Wang, K. Ge, A. Lamb, Q. Guo, K. Chen, B. Zhou, and Z. Lin. Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models, Feb. 2026. URLhttp://arxiv.org/abs/2602.08984. arXiv:2602.08984 [cs]

work page arXiv 2026
[40]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[41]

Lozhkov, L

A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf. Fineweb-edu: the finest collection of educational content, 2024. URLhttps://huggingface.co/datasets/HuggingFaceFW/ fineweb-edu

work page 2024
[42]

H. P. Luhn. The Automatic Creation of Literature Abstracts.IBM Journal of Research and Development, 2(2):159–165, Apr. 1958. ISSN 0018-8646. doi: 10.1147/rd.22.0159. URL https://ieeexplore.ieee.org/document/5392672. Conference Name: IBM Journal of Research and Development

work page doi:10.1147/rd.22.0159 1958
[43]

Mahajan, S

D. Mahajan, S. Goyal, B. Y . Idrissi, M. Pezeshki, I. Mitliagkas, D. Lopez-Paz, and K. Ahuja. Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries, Oct. 2025. URL http://arxiv.org/abs/2510.14751. arXiv:2510.14751 [cs]

work page arXiv 2025
[44]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

work page 2018
[45]

Mikolov, K

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Repre- sentations in Vector Space. Jan. 2013. URLhttps://www.semanticscholar.org/ paper/Efficient-Estimation-of-Word-Representations-in-Mikolov-Chen/ f6b51c8753a871dc94ff32152c00c01e94f90f09

work page 2013
[46]

Minixhofer, T

B. Minixhofer, T. Murray, T. Limisiewicz, A. Korhonen, L. Zettlemoyer, N. A. Smith, E. M. Ponti, L. Soldaini, and V . Hofmann. Bolmo: Byteifying the Next Generation of Language Models, Dec. 2025. URLhttp://arxiv.org/abs/2512.15586. arXiv:2512.15586 [cs]. 13

work page arXiv 2025
[47]

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li. Large Language Diffusion Models. Oct. 2025. URLhttps://openreview.net/forum? id=KnqiC0znVF

work page 2025
[48]

Pagnoni, R

A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. E. We- ston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte Latent Transformer: Patches Scale Better Than Tokens. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, edi- tors,Proceedings of the 63rd Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/2025.acl-long.453 2025
[49]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[50]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725

work page
[51]

C. E. Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

work page 1948
[52]

C. Shao, F. Meng, and J. Zhou. Beyond next token prediction: Patch-level training for large language models. InThe Thirteenth International Conference on Learning Representations,

work page
[53]

URLhttps://openreview.net/forum?id=dDpB23VbVa

work page
[54]

Sparck Jones

K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972

work page 1972
[55]

Y . Tay, M. Dehghani, V . Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schus- ter, S. Zheng, D. Zhou, N. Houlsby, and D. Metzler. UL2: Unifying Language Learning Paradigms. Sept. 2022. URLhttps://openreview.net/forum?id=6ruVLB727MC

work page 2022
[56]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, and others. Llama: Open and efficient foundation language models

work page
[57]

Videau, B

M. Videau, B. Y . Idrissi, A. Leite, M. Schoenauer, O. Teytaud, and D. Lopez-Paz. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets. Oct. 2025. URLhttps: //openreview.net/forum?id=FnFf7Ru2ur

work page 2025
[58]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod- els. Oct. 2022. URLhttps://openreview.net/forum?id=_VjQlMeSB_J&trk=public_ post_comment-text

work page 2022
[59]

K. Wen, Z. Li, J. Wang, D. Hall, P. Liang, and T. Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective, 2024. URLhttps://arxiv.org/ abs/2410.05192

work page arXiv 2024
[60]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, and others. Qwen3 technical report

work page
[61]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computatio...

work page doi:10.18653/v1/2025.acl-long.1126 2025
[62]

Zaheer, G

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: transformers for longer sequences. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, pages 17283–17297, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 978-1-7138- ...

work page doi:10.5555/3495724.3497174 2020
[63]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine re- ally finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

work page 2019
[64]

Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Math- ews, and S. Li. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023. ISSN 2150-8097. doi: 10.14778/3611540.3611569. URL...

work page doi:10.14778/3611540.3611569 2023
[65]

Proxy Compression for Language Modeling

L. Zheng, X. Li, Q. Liu, X. Feng, and L. Kong. Proxy Compression for Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2602.04289. arXiv:2602.04289 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[66]

R.-J. Zhu, Z. Wang, K. Hua, T. Zhang, Z. Li, H. Que, B. Wei, Z. Wen, F. Yin, H. Xing, and others. Scaling latent reasoning via looped language models

work page
[67]

Z. M. K. Zuhri, E. H. Fuadi, and A. F. Aji. Predicting the Order of Upcoming Tokens Improves Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2508.19228. arXiv:2508.19228 [cs]. 15 A Code 1 2[...] # within train loop 3 4if s u p e r p o s i t i o n _ b a g _ s i z e is not None and s u p e r p o s i t i o n _ b a g _ s i z e > 1: 5bs , seq = inputs . ...

work page arXiv 2026

[1] [1]

Agarwal, L

S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y . Bai, B. Baker, H. Bao, and others. gpt-oss-120b & gpt-oss-20b model card. 10

work page

[2] [2]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Mil- licah, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Mon- teiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual language ...

work page 2022

[3] [3]

L. B. Allal, A. Lozhkov, E. Bakouch, G. M. Bl ´azquez, G. Penedo, L. Tunstall, A. Marafioti, H. Kydl´ıˇcek, A. P. Lajar´ın, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. v. Werra, and T. Wolf. SmolLM2: When Smol Goes Big – Data-Centric Training of a Small Language ...

work page

[4] [4]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

URLhttp://arxiv.org/abs/2502.02737. arXiv:2502.02737 [cs]

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Anagnostidis, G

S. Anagnostidis, G. Bachmann, I. Schlag, and T. Hofmann. Navigating scaling laws: compute optimality in adaptive model training. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofICML’24, pages 1511–1530, Vienna, Austria, 2024. JMLR.org

work page 2024

[6] [6]

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, and others. Qwen technical report

work page

[7] [7]

X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu, and others. Deepseek llm: Scaling open-source language models with longtermism

work page

[8] [8]

Y . Bisk, R. Zellers, J. Gao, Y . Choi, et al. Piqa: Reasoning about physical commonsense in natural language. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020

work page 2020

[9] [9]

M. Chen, B. Hui, Z. Cui, J. Yang, D. Liu, J. Sun, J. Lin, and Z. Liu. Parallel Scaling Law for Language Models. Oct. 2025. URLhttps://openreview.net/forum?id=dEi1S731lk

work page 2025

[10] [10]

Cheng, W

X. Cheng, W. Zeng, D. Dai, Q. Chen, B. Wang, Z. Xie, K. Huang, X. Yu, Z. Hao, Y . Li, and others. Conditional memory via scalable lookup: A new axis of sparsity for large language models

work page

[11] [11]

Clark, K

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. Boolq: Ex- ploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 confer- ence of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), pages 2924–2936, 2019

work page 2019

[12] [12]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Cortes and V

C. Cortes and V . Vapnik. Support-vector networks.Machine learning, 20(3):273–297, 1995

work page 1995

[14] [14]

D. Dai, C. Deng, C. Zhao, R. Xu, H. Gao, D. Chen, J. Li, W. Zeng, X. Yu, Y . Wu, Z. Xie, Y . Li, P. Huang, F. Luo, C. Ruan, Z. Sui, and W. Liang. DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. In L.-W. Ku, A. Martins, and V . Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Com- pu...

work page

[15] [15]

doi: 10.18653/v1/2024.acl-long.70

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.70. URL https://aclanthology.org/2024.acl-long.70/

work page doi:10.18653/v1/2024.acl-long.70 2024

[16] [16]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Oct. 2020. URLhttps://openreview.net/forum?id=YicbFdNTTy&utm_ campaign=The%20Batch&utm_source=hs_email&utm_medium=email&_...

work page 2020

[17] [17]

Ebeling and T

W. Ebeling and T. P ¨oschel. Entropy and Long-Range Correlations in Literary English.Euro- physics Letters, 26(4):241, May 1994. ISSN 0295-5075. doi: 10.1209/0295-5075/26/4/001. URLhttps://doi.org/10.1209/0295-5075/26/4/001

work page doi:10.1209/0295-5075/26/4/001 1994

[18] [18]

S. Y . Gadre, G. Smyrnis, V . Shankar, S. Gururangan, M. Wortsman, R. Shao, J. Mercat, A. Fang, J. Li, S. Keh, and others. Language models scale reliably with over-training and on downstream tasks

work page

[19] [19]

L. Gao, J. Tow, B. Abbasi, S. Biderman, S. Black, A. DiPofi, C. Foster, L. Golding, J. Hsu, A. Le Noac’h, H. Li, K. McDonell, N. Muennighoff, C. Ociepa, J. Phang, L. Reynolds, H. Schoelkopf, A. Skowron, L. Sutawika, E. Tang, A. Thite, B. Wang, K. Wang, and A. Zou. A framework for few-shot language model evaluation, 12 2023. URLhttps: //zenodo.org/records/10256836

work page arXiv 2023

[20] [20]

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

T. Gigant, B. Peng, and J. Quesnelle. Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation, Apr. 2026. URLhttp://arxiv.org/ abs/2604.27263. arXiv:2604.27263 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

Gisserot-Boukhlef, N

H. Gisserot-Boukhlef, N. Boizard, M. Faysse, D. M. Alves, E. Malherbe, A. Martins, C. Hude- lot, and P. Colombo. Should We Still Pretrain Encoders with Masked Language Modeling? Oct. 2025. URLhttps://openreview.net/forum?id=jpz7e3jhRq

work page 2025

[22] [22]

Gloeckle, B

F. Gloeckle, B. Y . Idrissi, B. Rozi `ere, D. Lopez-Paz, and G. Synnaeve. Better & faster large language models via multi-token prediction. InProceedings of the 41st International Confer- ence on Machine Learning, volume 235 ofICML’24, pages 15706–15734, Vienna, Austria,

work page

[23] [23]

Grattafiori, A

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, and others. The llama 3 herd of models

work page

[24] [24]

Measuring Massive Multitask Language Understanding

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[25] [25]

M. Y . Hu, J. Petty, C. Shi, W. Merrill, and T. Linzen. Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 9691–9709, Vienna, Aus- ...

work page 2025

[26] [26]

S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhao, X. Zhang, Z. L. Thai, K. Zhang, C. Wang, Y . Yao, C. Zhao, J. Zhou, J. Cai, Z. Zhai, N. Ding, C. Jia, G. Zeng, D. Li, Z. Liu, and M. Sun. Minicpm: Unveiling the potential of small language mod- els with scalable training strategies, 2024. URLhttps://arxiv.org/abs/2404.06395

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Hwang, B

S. Hwang, B. Wang, and A. Gu. Dynamic Chunking for End-to-End Hierarchical Sequence Modeling, July 2025. URLhttp://arxiv.org/abs/2507.07955. arXiv:2507.07955 [cs]

work page arXiv 2025

[28] [28]

A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lample, L. R. Lavaud, L. Saulnier, M.-A. Lachaux, P. Stock, S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, T. Gervet, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mixtral of Experts, Jan. 2024. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt

KellerJordan. New Record: Multi-token prediction and Untie LM Head 2/3rds through training (119.76 seconds) by varunneal · Pull Request #178 · KellerJordan/modded-nanogpt. URL https://github.com/KellerJordan/modded-nanogpt/pull/178

work page

[30] [30]

K. Kim, S. Kotha, P. Liang, and T. Hashimoto. Pre-training under infinite compute, Sept. 2025. URLhttp://arxiv.org/abs/2509.14786. arXiv:2509.14786 [cs]

work page arXiv 2025

[31] [31]

Kudo and J

T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. InProceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations, pages 66–71. 12

work page 2018

[32] [32]

D. Lee, S. Han, A. Kumar, and P. Agrawal. Training Language Models via Neural Cellular Automata, 2026. URLhttps://arxiv.org/abs/2603.10055. Version Number: 1

work page arXiv 2026

[33] [33]

Leviathan, M

Y . Leviathan, M. Kalman, and Y . Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofICML’23, pages 19274–19286, Honolulu, Hawaii, USA, 2023. JMLR.org

work page 2023

[34] [34]

J. Li, A. Fang, G. Smyrnis, M. Ivgi, M. Jordan, S. Gadre, H. Bansal, E. Guha, S. Keh, K. Arora, S. Garg, R. Xin, N. Muennighoff, R. Heckel, J. Mercat, M. Chen, S. Gururan- gan, M. Wortsman, A. Albalak, Y . Bitton, M. Nezhurina, A. Abbas, C.-Y . Hsieh, D. Ghosh, J. Gardner, M. Kilian, H. Zhang, R. Shao, S. Pratt, S. Sanyal, G. Ilharco, G. Daras, K. Marathe...

work page 2024

[35] [35]

Liang, T

W. Liang, T. Liu, L. Wright, W. Constable, A. Gu, C.-C. Huang, I. Zhang, W. Feng, H. Huang, J. Wang, S. Purandare, G. Nadathur, and S. Idreos. TorchTitan: One-stop PyTorch native solution for production ready LLM pretraining. Oct. 2024. URLhttps://openreview. net/forum?id=SFN6Wm7YBI

work page 2024

[36] [36]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, and others. Deepseek-v3 technical report

work page

[37] [37]

A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space travel for language models

work page

[38] [38]

H. Liu, J. Zhang, C. Wang, X. Hu, L. Lyu, J. Sun, X. Yang, B. Wang, F. Li, Y . Qian, and others. Scaling embeddings outperforms scaling experts in language models

work page

[39] [39]

Y . Liu, Y . Song, Y . Wang, K. Ge, A. Lamb, Q. Guo, K. Chen, B. Zhou, and Z. Lin. Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models, Feb. 2026. URLhttp://arxiv.org/abs/2602.08984. arXiv:2602.08984 [cs]

work page arXiv 2026

[40] [40]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[41] [41]

Lozhkov, L

A. Lozhkov, L. Ben Allal, L. von Werra, and T. Wolf. Fineweb-edu: the finest collection of educational content, 2024. URLhttps://huggingface.co/datasets/HuggingFaceFW/ fineweb-edu

work page 2024

[42] [42]

H. P. Luhn. The Automatic Creation of Literature Abstracts.IBM Journal of Research and Development, 2(2):159–165, Apr. 1958. ISSN 0018-8646. doi: 10.1147/rd.22.0159. URL https://ieeexplore.ieee.org/document/5392672. Conference Name: IBM Journal of Research and Development

work page doi:10.1147/rd.22.0159 1958

[43] [43]

Mahajan, S

D. Mahajan, S. Goyal, B. Y . Idrissi, M. Pezeshki, I. Mitliagkas, D. Lopez-Paz, and K. Ahuja. Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries, Oct. 2025. URL http://arxiv.org/abs/2510.14751. arXiv:2510.14751 [cs]

work page arXiv 2025

[44] [44]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2381–2391, 2018

work page 2018

[45] [45]

Mikolov, K

T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Repre- sentations in Vector Space. Jan. 2013. URLhttps://www.semanticscholar.org/ paper/Efficient-Estimation-of-Word-Representations-in-Mikolov-Chen/ f6b51c8753a871dc94ff32152c00c01e94f90f09

work page 2013

[46] [46]

Minixhofer, T

B. Minixhofer, T. Murray, T. Limisiewicz, A. Korhonen, L. Zettlemoyer, N. A. Smith, E. M. Ponti, L. Soldaini, and V . Hofmann. Bolmo: Byteifying the Next Generation of Language Models, Dec. 2025. URLhttp://arxiv.org/abs/2512.15586. arXiv:2512.15586 [cs]. 13

work page arXiv 2025

[47] [47]

S. Nie, F. Zhu, Z. You, X. Zhang, J. Ou, J. Hu, J. Zhou, Y . Lin, J.-R. Wen, and C. Li. Large Language Diffusion Models. Oct. 2025. URLhttps://openreview.net/forum? id=KnqiC0znVF

work page 2025

[48] [48]

Pagnoni, R

A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. E. We- ston, L. Zettlemoyer, G. Ghosh, M. Lewis, A. Holtzman, and S. Iyer. Byte Latent Transformer: Patches Scale Better Than Tokens. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, edi- tors,Proceedings of the 63rd Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/2025.acl-long.453 2025

[49] [49]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y . Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[50] [50]

Sennrich, B

R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), pages 1715–1725

work page

[51] [51]

C. E. Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

work page 1948

[52] [52]

C. Shao, F. Meng, and J. Zhou. Beyond next token prediction: Patch-level training for large language models. InThe Thirteenth International Conference on Learning Representations,

work page

[53] [53]

URLhttps://openreview.net/forum?id=dDpB23VbVa

work page

[54] [54]

Sparck Jones

K. Sparck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21, 1972

work page 1972

[55] [55]

Y . Tay, M. Dehghani, V . Q. Tran, X. Garcia, J. Wei, X. Wang, H. W. Chung, D. Bahri, T. Schus- ter, S. Zheng, D. Zhou, N. Houlsby, and D. Metzler. UL2: Unifying Language Learning Paradigms. Sept. 2022. URLhttps://openreview.net/forum?id=6ruVLB727MC

work page 2022

[56] [56]

Touvron, T

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, and others. Llama: Open and efficient foundation language models

work page

[57] [57]

Videau, B

M. Videau, B. Y . Idrissi, A. Leite, M. Schoenauer, O. Teytaud, and D. Lopez-Paz. From Bytes to Ideas: Language Modeling with Autoregressive U-Nets. Oct. 2025. URLhttps: //openreview.net/forum?id=FnFf7Ru2ur

work page 2025

[58] [58]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Mod- els. Oct. 2022. URLhttps://openreview.net/forum?id=_VjQlMeSB_J&trk=public_ post_comment-text

work page 2022

[59] [59]

K. Wen, Z. Li, J. Wang, D. Hall, P. Liang, and T. Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective, 2024. URLhttps://arxiv.org/ abs/2410.05192

work page arXiv 2024

[60] [60]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, and others. Qwen3 technical report

work page

[61] [61]

J. Yuan, H. Gao, D. Dai, J. Luo, L. Zhao, Z. Zhang, Z. Xie, Y . Wei, L. Wang, Z. Xiao, Y . Wang, C. Ruan, M. Zhang, W. Liang, and W. Zeng. Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. In W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computatio...

work page doi:10.18653/v1/2025.acl-long.1126 2025

[62] [62]

Zaheer, G

M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: transformers for longer sequences. InProceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, pages 17283–17297, Red Hook, NY , USA, 2020. Curran Associates Inc. ISBN 978-1-7138- ...

work page doi:10.5555/3495724.3497174 2020

[63] [63]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. Hellaswag: Can a machine re- ally finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

work page 2019

[64] [64]

Y . Zhao, A. Gu, R. Varma, L. Luo, C.-C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Math- ews, and S. Li. PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.Proc. VLDB Endow., 16(12):3848–3860, 2023. ISSN 2150-8097. doi: 10.14778/3611540.3611569. URL...

work page doi:10.14778/3611540.3611569 2023

[65] [65]

Proxy Compression for Language Modeling

L. Zheng, X. Li, Q. Liu, X. Feng, and L. Kong. Proxy Compression for Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2602.04289. arXiv:2602.04289 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2026

[66] [66]

R.-J. Zhu, Z. Wang, K. Hua, T. Zhang, Z. Li, H. Que, B. Wei, Z. Wen, F. Yin, H. Xing, and others. Scaling latent reasoning via looped language models

work page

[67] [67]

Z. M. K. Zuhri, E. H. Fuadi, and A. F. Aji. Predicting the Order of Upcoming Tokens Improves Language Modeling, Feb. 2026. URLhttp://arxiv.org/abs/2508.19228. arXiv:2508.19228 [cs]. 15 A Code 1 2[...] # within train loop 3 4if s u p e r p o s i t i o n _ b a g _ s i z e is not None and s u p e r p o s i t i o n _ b a g _ s i z e > 1: 5bs , seq = inputs . ...

work page arXiv 2026