Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

Akira Nakagawa; Kenichi Kobayashi; Koichi Shirahata; Thang Dang

arxiv: 2605.30080 · v1 · pith:5NQ53LM2new · submitted 2026-05-28 · 💻 cs.CL

Adaptive Targeted Dynamic Chunking for Tokenization-Free Hierarchical Model

Thang Dang , Akira Nakagawa , Kenichi Kobayashi , Koichi Shirahata This is my paper

Pith reviewed 2026-06-29 08:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords adaptive chunkinghierarchical modelsbyte-level processingcurriculum learningcompression ratiotokenization-freelanguage modelingbits-per-byte

0 comments

The pith

Adaptive Targeted Dynamic Chunking uses curriculum learning to progressively increase compression ratios for more stable training in byte-level hierarchical models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Targeted Dynamic Chunking as a way to manage compression ratios in tokenization-free hierarchical language models. By applying curriculum learning to raise the ratio from low to high over training, the method seeks to avoid instabilities that plague fixed high-compression setups. This leads to competitive bits-per-byte performance against both byte and token baselines on large datasets. It also delivers more stable training and stronger results on downstream tasks while keeping the advantages of byte-level input processing.

Core claim

The authors establish that ATDC, through progressive adjustment of the target compression ratio, produces hierarchical models with competitive Bits-Per-Byte scores, more stable dynamics during training, and better final performance on diverse tasks than models that hold the compression ratio fixed throughout.

What carries the argument

Adaptive Targeted Dynamic Chunking (ATDC), which employs curriculum learning to transition the compression ratio and uses the Bytes-Per-Innermost-Chunk metric to monitor chunk size changes.

Load-bearing premise

That progressively raising the target compression ratio via curriculum learning will reliably stabilize training and produce superior final performance without introducing new instabilities or requiring extensive hyperparameter retuning for each dataset.

What would settle it

Training a hierarchical model with the proposed curriculum on compression ratio and finding that it diverges or ends with higher bits-per-byte than a fixed-ratio control on the FineWeb-Edu dataset.

Figures

Figures reproduced from arXiv: 2605.30080 by Akira Nakagawa, Kenichi Kobayashi, Koichi Shirahata, Thang Dang.

**Figure 1.** Figure 1: Validation BPB. (48.5%). This suggests that direct byte-level modeling captures linguistic nuances and structural information that tokenizationbased preprocessing may obscure. Our implementation consistently achieves the lowest (best) BPB scores in every category, reaching a minimum of 0.760 at the 1.3B parameter scale. In terms of zero-shot reasoning: the ADTC outperforms the base H-Net model in nearly … view at source ↗

**Figure 2.** Figure 2: Validation BPIC curves to track the N values changing during training phase. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Chunk Boundary Visualization of H-Net 1.3B. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Tokenization-free hierarchical models are emerging as a promising alternative to traditional Large Language Models (LLMs), addressing inherent preprocessing issues such as vocabulary design complexity, out-of-vocabulary (OOV) errors, and language-specific constraints. However, a significant challenge in these byte-level methods is the optimization of the compression ratio, a critical factor that dictates model performance for processing bytes data via chunks. In this paper, we propose Adaptive Targeted Dynamic Chunking (ATDC), a novel byte-compression control mechanism designed to enhance the effectiveness of dynamic chunking within hierarchical architectures. Our approach utilizes curriculum learning to progressively adjust the compression ratio during training, transitioning from low to high compression to stabilize the learning process. We provide an analysis establishing the relationship between the target compression ratio and Bytes-Per-Innermost-Chunk (BPIC), allowing for tracking of chunk-size evolution throughout the training phase. Evaluations conducted on the FineWeb-Edu 100B dataset demonstrate that hierarchical models equipped with ATDC achieve competitive Bits-Per-Byte (BPB) performance compared to conventional baselines operating at both byte and token levels. Furthermore, the proposed method exhibits more stable training dynamics and superior final performance across diverse downstream tasks compared to models using fixed compression ratios, while maintaining the inherent robustness and flexibility of byte-level processing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATDC uses curriculum learning to ramp compression ratios in byte-level hierarchical models, but the abstract gives no numbers or ablations so the stability claims are unverified.

read the letter

The main point is that this paper introduces Adaptive Targeted Dynamic Chunking (ATDC), a curriculum schedule that starts with low compression and gradually raises the target ratio during training of hierarchical byte models. It claims this produces more stable dynamics and stronger downstream results than fixed-ratio versions while staying competitive on bits-per-byte.

The new element is the specific use of curriculum learning to control the compression ratio in dynamic chunking, plus the BPIC analysis that lets them track how chunk sizes change over training. That framing is a reasonable way to think about the problem in tokenization-free setups.

The work does a clear job laying out why compression ratio matters for these models and why byte-level processing keeps certain advantages. The idea of starting easy and increasing difficulty is a standard curriculum trick applied here in a targeted way.

The soft spots are straightforward. The abstract supplies no actual BPB numbers, no error bars, no ablation tables against fixed high-compression controls, and no description of the exact schedule or how total compute was matched. Without those details it is impossible to know whether the reported stability and gains are real or just the result of extra tuning effort. The stress-test worry about dataset-specific retuning looks reasonable given the missing evidence.

This paper is for the small group of people already working on byte-level hierarchical architectures. A reader in that niche might pick up the curriculum idea, but the current write-up is too light on data to change practice.

It deserves a serious referee so the experiments can be examined directly; the core idea is coherent enough to warrant that step even if heavy revision is likely needed.

Referee Report

2 major / 2 minor

Summary. The paper proposes Adaptive Targeted Dynamic Chunking (ATDC), a curriculum-learning mechanism that progressively raises the target compression ratio during training of tokenization-free hierarchical byte-level models. It includes an analysis relating target compression ratio to Bytes-Per-Innermost-Chunk (BPIC) for tracking chunk-size evolution. On the FineWeb-Edu 100B dataset, the authors claim that ATDC-equipped models achieve competitive Bits-Per-Byte (BPB) relative to byte- and token-level baselines while exhibiting more stable training dynamics and better downstream performance than fixed-ratio controls.

Significance. If the empirical results are robust, ATDC would offer a practical route to stable training of hierarchical byte-level models, preserving their robustness and flexibility advantages over tokenization while addressing compression-ratio optimization. The BPIC analysis provides a useful diagnostic tool for monitoring dynamic chunking behavior.

major comments (2)

[Abstract] Abstract: the central empirical claims of competitive BPB, more stable training dynamics, and superior downstream performance are stated without any quantitative values, error bars, ablation details, dataset splits, or compute-matched controls. This directly affects verifiability of the main result.
[§3 (Method) and §4 (Experiments)] The curriculum-learning claim (progressively raising compression ratio stabilizes training and outperforms fixed ratios) is load-bearing for the contribution, yet the manuscript supplies no schedule design details, ablation against fixed high-compression baselines, or hyperparameter-sensitivity analysis. Without these, it remains possible that observed gains reflect unequal optimization effort rather than the adaptive mechanism.

minor comments (2)

[§3.1] The relationship between target compression ratio and BPIC is described as an analysis tool; formalizing it as an equation or derivation would improve clarity and reproducibility.
[Figures 2-4] Figure captions and axis labels for training curves should explicitly state the compression-ratio schedule and baseline configurations for direct visual comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on improving the clarity and verifiability of our results. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claims of competitive BPB, more stable training dynamics, and superior downstream performance are stated without any quantitative values, error bars, ablation details, dataset splits, or compute-matched controls. This directly affects verifiability of the main result.

Authors: We agree that the abstract would benefit from greater specificity to aid verifiability. The current abstract summarizes qualitative outcomes, but the body of the manuscript contains the supporting quantitative results on FineWeb-Edu 100B. In revision we will update the abstract to report key BPB numbers relative to byte- and token-level baselines, note the use of curriculum progression for stability, reference downstream task improvements, and explicitly mention the dataset and fixed-ratio controls. Where multiple runs exist we will add error bars or stability indicators; otherwise we will qualify single-run results. revision: yes
Referee: [§3 (Method) and §4 (Experiments)] The curriculum-learning claim (progressively raising compression ratio stabilizes training and outperforms fixed ratios) is load-bearing for the contribution, yet the manuscript supplies no schedule design details, ablation against fixed high-compression baselines, or hyperparameter-sensitivity analysis. Without these, it remains possible that observed gains reflect unequal optimization effort rather than the adaptive mechanism.

Authors: This point is well taken; additional methodological transparency is required. While §3 describes the progressive adjustment of the target compression ratio via curriculum learning, we will expand it to include the exact schedule (e.g., the functional form and step-wise increments of the target ratio), the BPIC tracking equations, and all relevant hyperparameters. In §4 we will add explicit ablations that compare ATDC against fixed high-compression-ratio baselines under matched compute budgets, and we will report any hyperparameter sensitivity results already obtained or note their absence. These changes will directly address the possibility of unequal optimization effort. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on external dataset evaluations

full rationale

The paper reports experimental results on FineWeb-Edu 100B comparing ATDC-equipped hierarchical models against byte- and token-level baselines, claiming competitive BPB, more stable dynamics, and superior downstream performance. No equations, derivations, or self-citations are presented that reduce these outcomes to fitted inputs or definitional loops by construction. The mentioned analysis of target compression ratio versus BPIC is framed as a tracking tool rather than a load-bearing premise that forces the results. The curriculum-learning schedule and performance claims are therefore self-contained against the reported benchmarks and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the method is described at the level of a training schedule without mathematical derivation or new postulated objects.

pith-pipeline@v0.9.1-grok · 5762 in / 1210 out tokens · 21984 ms · 2026-06-29T08:01:12.467696+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Regularizing data for improving execution time of nlp model,

T. Dang, Y . Sakai, T. Tabaru, and A. Kasagi, “Regularizing data for improving execution time of nlp model,” inThe International FLAIRS Conference Proceedings, vol. 35, 2022

2022
[2]

Efficient and large scale pre-training techniques for japanese natural language processing,

A. Kasagi, M. Asaoka, A. Tabuchi, Y . Oyama, T. Honda, Y . Sakai, T. Dang, and T. Tabaru, “Efficient and large scale pre-training techniques for japanese natural language processing,” in2021 Ninth International Symposium on Computing and Networking (CANDAR). IEEE, 2021, pp. 108–113

2021
[3]

Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Conference on EMNLP, 2018, pp. 66–71

2018
[4]

Boundless byte pair encoding: Breaking the pre-tokenization barrier,

C. W. Schmidt, V . Reddy, C. Tanner, and Y . Pinter, “Boundless byte pair encoding: Breaking the pre-tokenization barrier,” inSecond CLM, 2025

2025
[5]

Fast wordpiece tokenization,

X. Song, A. Salcianu, Y . Song, D. Dopson, and D. Zhou, “Fast wordpiece tokenization,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 2089–2103

2021
[6]

Hierarchical autoregressive transformers: Combining byte- and word-level processing for robust, adaptable language models,

P. Neitemeier, B. Deiseroth, C. Eichenberg, and L. Balles, “Hierarchical autoregressive transformers: Combining byte- and word-level processing for robust, adaptable language models,” inThe 13th ICLR, 2025

2025
[7]

Do all languages cost the same? tokenization in the era of commercial language models,

O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. R. Mortensen, N. A. Smith, and Y . Tsvetkov, “Do all languages cost the same? tokenization in the era of commercial language models,” inThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023
[8]

Language model tokenizers introduce unfairness between languages,

A. Petrov, E. La Malfa, P. Torr, and A. Bibi, “Language model tokenizers introduce unfairness between languages,”Advances in NeurIPS, vol. 36, pp. 36 963–36 990, 2023

2023
[9]

Byte latent transformer: Patches scale better than tokens,

A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. E. Weston, L. Zettlemoyeret al., “Byte latent transformer: Patches scale better than tokens,” inProceedings of the 63rd the ALC (V olume 1: Long Papers), 2025, pp. 9238–9258

2025
[10]

Hierarchical transformers are more efficient language models,

P. Nawrot, S. Tworkowski, M. Tyrolski, Ł. Kaiser, Y . Wu, C. Szegedy, and H. Michalewski, “Hierarchical transformers are more efficient language models,” inFindings of the ACL: NAACL 2022, 2022, pp. 1559–1571

2022
[11]

Dynamic chunking for end-to-end hierarchical sequence modeling,

S. Hwang, B. Wang, and A. Gu, “Dynamic chunking for end-to-end hierarchical sequence modeling,” inThe 14th ICLR, 2026

2026
[12]

The fineweb datasets: Decanting the web for the finest text data at scale,

G. Penedo, H. Kydl ´ıˇcek, A. Lozhkov, M. Mitchell, C. A. Raffel, L. V on Werra, T. Wolfet al., “The fineweb datasets: Decanting the web for the finest text data at scale,”Advances in NeurIPS, vol. 37, pp. 30 811–30 849, 2024

2024
[13]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Byt5: Towards a token-free future with pre-trained byte-to-byte models,

L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, and C. Raffel, “Byt5: Towards a token-free future with pre-trained byte-to-byte models,”Transactions of the ACL, vol. 10, pp. 291–306, 2022

2022
[15]

MEGABYTE: Predicting million-byte sequences with mul- tiscale transformers,

L. YU, D. Simig, C. Flaherty, A. Aghajanyan, L. Zettlemoyer, and M. Lewis, “MEGABYTE: Predicting million-byte sequences with mul- tiscale transformers,” inThirty-seventh Conference on NeurIPS, 2023

2023
[16]

Mambabyte: Token-free selective state space model,

J. Wang, T. Gangavarapu, J. N. Yan, and A. M. Rush, “Mambabyte: Token-free selective state space model,” inFirst Conference on Lan- guage Modeling, 2024

2024
[17]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. Re, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022

2022
[18]

H-net++: Hierarchical dynamic chunking for tokenizer-free language modelling in morphologically-rich languages,

M. Zakershahrak and S. Ghodratnama, “H-net++: Hierarchical dynamic chunking for tokenizer-free language modelling in morphologically-rich languages,”arXiv preprint arXiv:2508.05628, 2025

work page arXiv 2025
[19]

Transformers are SSMs: Generalized models and ef- ficient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are SSMs: Generalized models and ef- ficient algorithms through structured state space duality,” inProceedings of the 41st ICML, vol. 235, 2024, pp. 10 041–10 071

2024
[20]

GLU Variants Improve Transformer

N. Shazeer, “Glu variants improve transformer,”arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[21]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. L ´eonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,”arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[22]

Exponential moving average of weights in deep learning: Dynamics and benefits,

D. Morales-Brotons, T. V ogels, and H. Hendrikx, “Exponential moving average of weights in deep learning: Dynamics and benefits,”Transac- tions on Machine Learning Research, 2024

2024
[23]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhaoet al., “Minicpm: Unveiling the potential of small language models with scalable training strategies,”arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Simple and scalable strategies to continually pre-train large language models,

A. Ibrahim, B. Th ´erien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish, “Simple and scalable strategies to continually pre-train large language models,”TMLR, 2024

2024

[1] [1]

Regularizing data for improving execution time of nlp model,

T. Dang, Y . Sakai, T. Tabaru, and A. Kasagi, “Regularizing data for improving execution time of nlp model,” inThe International FLAIRS Conference Proceedings, vol. 35, 2022

2022

[2] [2]

Efficient and large scale pre-training techniques for japanese natural language processing,

A. Kasagi, M. Asaoka, A. Tabuchi, Y . Oyama, T. Honda, Y . Sakai, T. Dang, and T. Tabaru, “Efficient and large scale pre-training techniques for japanese natural language processing,” in2021 Ninth International Symposium on Computing and Networking (CANDAR). IEEE, 2021, pp. 108–113

2021

[3] [3]

Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 Conference on EMNLP, 2018, pp. 66–71

2018

[4] [4]

Boundless byte pair encoding: Breaking the pre-tokenization barrier,

C. W. Schmidt, V . Reddy, C. Tanner, and Y . Pinter, “Boundless byte pair encoding: Breaking the pre-tokenization barrier,” inSecond CLM, 2025

2025

[5] [5]

Fast wordpiece tokenization,

X. Song, A. Salcianu, Y . Song, D. Dopson, and D. Zhou, “Fast wordpiece tokenization,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 2089–2103

2021

[6] [6]

Hierarchical autoregressive transformers: Combining byte- and word-level processing for robust, adaptable language models,

P. Neitemeier, B. Deiseroth, C. Eichenberg, and L. Balles, “Hierarchical autoregressive transformers: Combining byte- and word-level processing for robust, adaptable language models,” inThe 13th ICLR, 2025

2025

[7] [7]

Do all languages cost the same? tokenization in the era of commercial language models,

O. Ahia, S. Kumar, H. Gonen, J. Kasai, D. R. Mortensen, N. A. Smith, and Y . Tsvetkov, “Do all languages cost the same? tokenization in the era of commercial language models,” inThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

2023

[8] [8]

Language model tokenizers introduce unfairness between languages,

A. Petrov, E. La Malfa, P. Torr, and A. Bibi, “Language model tokenizers introduce unfairness between languages,”Advances in NeurIPS, vol. 36, pp. 36 963–36 990, 2023

2023

[9] [9]

Byte latent transformer: Patches scale better than tokens,

A. Pagnoni, R. Pasunuru, P. Rodriguez, J. Nguyen, B. Muller, M. Li, C. Zhou, L. Yu, J. E. Weston, L. Zettlemoyeret al., “Byte latent transformer: Patches scale better than tokens,” inProceedings of the 63rd the ALC (V olume 1: Long Papers), 2025, pp. 9238–9258

2025

[10] [10]

Hierarchical transformers are more efficient language models,

P. Nawrot, S. Tworkowski, M. Tyrolski, Ł. Kaiser, Y . Wu, C. Szegedy, and H. Michalewski, “Hierarchical transformers are more efficient language models,” inFindings of the ACL: NAACL 2022, 2022, pp. 1559–1571

2022

[11] [11]

Dynamic chunking for end-to-end hierarchical sequence modeling,

S. Hwang, B. Wang, and A. Gu, “Dynamic chunking for end-to-end hierarchical sequence modeling,” inThe 14th ICLR, 2026

2026

[12] [12]

The fineweb datasets: Decanting the web for the finest text data at scale,

G. Penedo, H. Kydl ´ıˇcek, A. Lozhkov, M. Mitchell, C. A. Raffel, L. V on Werra, T. Wolfet al., “The fineweb datasets: Decanting the web for the finest text data at scale,”Advances in NeurIPS, vol. 37, pp. 30 811–30 849, 2024

2024

[13] [13]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Byt5: Towards a token-free future with pre-trained byte-to-byte models,

L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, and C. Raffel, “Byt5: Towards a token-free future with pre-trained byte-to-byte models,”Transactions of the ACL, vol. 10, pp. 291–306, 2022

2022

[15] [15]

MEGABYTE: Predicting million-byte sequences with mul- tiscale transformers,

L. YU, D. Simig, C. Flaherty, A. Aghajanyan, L. Zettlemoyer, and M. Lewis, “MEGABYTE: Predicting million-byte sequences with mul- tiscale transformers,” inThirty-seventh Conference on NeurIPS, 2023

2023

[16] [16]

Mambabyte: Token-free selective state space model,

J. Wang, T. Gangavarapu, J. N. Yan, and A. M. Rush, “Mambabyte: Token-free selective state space model,” inFirst Conference on Lan- guage Modeling, 2024

2024

[17] [17]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. Re, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022

2022

[18] [18]

H-net++: Hierarchical dynamic chunking for tokenizer-free language modelling in morphologically-rich languages,

M. Zakershahrak and S. Ghodratnama, “H-net++: Hierarchical dynamic chunking for tokenizer-free language modelling in morphologically-rich languages,”arXiv preprint arXiv:2508.05628, 2025

work page arXiv 2025

[19] [19]

Transformers are SSMs: Generalized models and ef- ficient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are SSMs: Generalized models and ef- ficient algorithms through structured state space duality,” inProceedings of the 41st ICML, vol. 235, 2024, pp. 10 041–10 071

2024

[20] [20]

GLU Variants Improve Transformer

N. Shazeer, “Glu variants improve transformer,”arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[21] [21]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Y . Bengio, N. L ´eonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,”arXiv preprint arXiv:1308.3432, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[22] [22]

Exponential moving average of weights in deep learning: Dynamics and benefits,

D. Morales-Brotons, T. V ogels, and H. Hendrikx, “Exponential moving average of weights in deep learning: Dynamics and benefits,”Transac- tions on Machine Learning Research, 2024

2024

[23] [23]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[24] [24]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

S. Hu, Y . Tu, X. Han, C. He, G. Cui, X. Long, Z. Zheng, Y . Fang, Y . Huang, W. Zhaoet al., “Minicpm: Unveiling the potential of small language models with scalable training strategies,”arXiv preprint arXiv:2404.06395, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Simple and scalable strategies to continually pre-train large language models,

A. Ibrahim, B. Th ´erien, K. Gupta, M. L. Richter, Q. G. Anthony, E. Belilovsky, T. Lesort, and I. Rish, “Simple and scalable strategies to continually pre-train large language models,”TMLR, 2024

2024