Proxy Compression for Language Modeling

Lingpeng Kong; Lin Zheng; Qian Liu; Xiachong Feng; Xinyu Li

arxiv: 2602.04289 · v2 · pith:7KV2TX4Snew · submitted 2026-02-04 · 💻 cs.CL · cs.LG

Proxy Compression for Language Modeling

Lin Zheng , Xinyu Li , Qian Liu , Xiachong Feng , Lingpeng Kong This is my paper

Pith reviewed 2026-05-16 08:00 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords proxy compressionbyte-level modelingraw byteslanguage modelingtraining efficiencytokenizer alternativescode language modeling

0 comments

The pith

Proxy compression trains language models jointly on raw bytes and compressed sequences so they can use efficient inputs during training yet run purely on raw bytes at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern language models depend on external tokenizers that compress UTF-8 bytes into fixed token sequences, locking the model to that particular compressor. Proxy compression instead trains one model on both raw byte sequences and compressed views produced by external compressors at the same time. Through this joint training the model learns an internal alignment between the two formats. The alignment supports strong transfer, so the model can be trained mostly on compressed inputs that improve efficiency yet still perform well when the compressor is removed and only raw bytes remain at inference. Experiments on code language modeling show these efficiency gains grow with model scale until proxy-trained models match or exceed tokenizer approaches while keeping byte-level robustness.

Core claim

A single language model jointly trained on raw byte sequences and compressed views from external compressors learns an internal alignment between the formats; this alignment enables effective transfer so that training can use predominantly compressed inputs for efficiency while inference runs solely on raw bytes without performance loss or continued need for the compressor.

What carries the argument

Joint training on paired raw-byte sequences and their compressed counterparts to induce internal format alignment that transfers at inference.

If this is right

Training efficiency improves substantially over pure byte-level models under fixed compute budgets.
Performance gains become more pronounced as model scale increases.
Proxy-trained models eventually match or surpass traditional tokenizer-based approaches.
Models retain byte-level robustness while operating solely on raw bytes at inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce the need to choose and maintain a single tokenizer for deployment across different domains.
Similar joint-training alignments might extend to sequence tasks outside code, such as multilingual text or structured data.
The observed scaling pattern suggests hybrid training regimes could allocate compute differently between compressed and raw views in future models.

Load-bearing premise

The internal alignment learned from joint training transfers effectively to pure raw-byte inference without needing the compressed inputs at test time.

What would settle it

Large-scale code modeling runs in which proxy-trained models underperform pure byte-level baselines on raw-byte benchmarks would show the alignment does not transfer as claimed.

Figures

Figures reproduced from arXiv: 2602.04289 by Lingpeng Kong, Lin Zheng, Qian Liu, Xiachong Feng, Xinyu Li.

**Figure 1.** Figure 1: Overview of proxy compression for language modeling. During training, we prepare mixed-representation inputs by combining compressed sequences with raw UTF-8 bytes, which are packed together to train a single language model with next-symbol prediction over both representations. Different representations are associated with special sentinels, such as ⟨comp⟩, ⟨/comp⟩ for compressed data and ⟨raw⟩ and ⟨/raw⟩ … view at source ↗

**Figure 2.** Figure 2: Model performance (Pass@1) on MBPP-Plus across model scales. Bars show absolute performance (left axis); lines show the performance gap (∆) relative to the tokenizer baseline (right axis). While byte-level models exhibit a persistent or widening gap, proxy-based models progressively close the gap as the model scale increases. based compression fails to transfer effectively. 2. Proxy Compression We introdu… view at source ↗

**Figure 3.** Figure 3: Pass@1 performance on HumanEval-Plus for 14B models under different input representations, compared as a function of training FLOPs (left) and amount of training data (right). competitively, with neural compression achieving slightly lower compression (2.6×) than the tokenizer-based compressor (2.9×), yet still matching its performance at scale. Data versus Compute Efficiency. Byte-level and tokenizer-ba… view at source ↗

**Figure 4.** Figure 4: Compressor stability analysis under input perturbation: we apply random 10% character deletion to 80K samples and measure normalized Levenshtein distance between compressed outputs before and after perturbation. for Warmup-only, likely due to better effective compression when samples are not duplicated as pairs (Appendix A). This suggests transfer operates at a deeper semantic level beyond literal sequence… view at source ↗

**Figure 6.** Figure 6: Collision statistics for the neural compressor. The xaxis is the number of distinct byte chunks that collide on the same compressed segment, and the y-axis is the number of such compressed segments in log scale. 3.5. Robustness Evaluation Byte-level models have been shown to be more robust to input perturbations than tokenizer-based models (Pagnoni et al., 2024; Hwang et al., 2025). In this section, we in… view at source ↗

**Figure 7.** Figure 7: Robust pass (left, higher is better) and robust drop (right, lower is better) performance on ReCode per perturbation family with different models [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Inference-time interface comparison on HumanEval-Plus (left) and MBPP-Plus (right) for 14B token-proxy models. 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of Raw Bytes 0.08 0.10 0.12 0.14 0.16 0.18 HumanEval-Plus pass@1 Inference on Bytes Inference on Tokens Doc Count (Normalized) 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Doc Count (Normalized) [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 10.** Figure 10: Ablation on model architecture on validation BPB (left) and HumanEval pass@1 (right) [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Oracle translation pass@1 on neural proxy compression. Solid lines represent continual translation pairs (ALWAYS-ON), while dashed lines indicate abrupt removal after 10k steps (WARMUP-ONLY). 0.0 0.2 0.4 0.6 0.8 1.0 LCP ratio 0 5000 10000 15000 20000 25000 Compressed symbols [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of the LCP ratio across all compressed symbols that exhibit collisions. The distribution is heavily skewed towards high similarity (LCP ratio > 0.8), showing that collisions typically differ only in short suffixes. (>94%) throughout training, confirming that a small fraction of explicit pairs is both necessary and sufficient for reliable in-context transfer. D.3. Additional Results of Neural … view at source ↗

**Figure 13.** Figure 13: Format perturbations. et al., 2023), we consider four perturbation families: function name rewrites (Function), formatting changes (Format), syntactic rewrites (Syntax), and docstring paraphrases (Docstrings). We report the nominal pass rate where no perturbations are applied. We measure robustness based on three metrics: Robust Pass RPs@k, measuring worst-case pass@k under s variants of the same problem.… view at source ↗

**Figure 14.** Figure 14: Function perturbations. Tokenizer Byte-level Proxy (Tokenizer) Proxy (Tokenizer-tokens) Proxy (Neural) 0,1 0,12 0,14 0,12 0,11 0,14 0,13 0,13 RP (a) RP (↑) Tokenizer Byte-level Proxy (Tokenizer) Proxy (Tokenizer-tokens) Proxy (Neural) 0,5 0,55 0,6 0,65 0,63 0,58 0,58 0,59 0,53 RD (b) RD (↓) Tokenizer Byte-level Proxy (Tokenizer) Proxy (Tokenizer-tokens) Proxy (Neural) 0,2 0,3 0,4 0,5 0,6 0,22 0,5 0,5 0,51… view at source ↗

**Figure 15.** Figure 15: Syntax perturbations. fine-grained representations waste compute greatly and lack sufficient abstraction for effective learning, even when granted additional training steps or data. We observe consistent trends in longer training runs (Appendix D.7). These findings motivate representations that balance abstraction with granularity, rather than pursuing either extreme. D.6. On Document-boundary Attention M… view at source ↗

**Figure 16.** Figure 16: Docstring perturbations. 0.5 1.0 1.5 2.0 Total Training FLOPs 1e21 0.20 0.25 0.30 0.35 0.40 0.45 Validation Bits-per-Byte (BPB) Tokenizer (BPE) Double-byte (16-bit) Byte (8-bit) Halfbyte (4-bit) Double-bit (2-bit) Bit (1-bit) 0 50 100 150 Training Bytes (B) 0.20 0.25 0.30 0.35 0.40 0.45 Validation Bits-per-Byte (BPB) Tokenizer (BPE) Double-byte (16-bit) Byte (8-bit) Halfbyte (4-bit) Double-bit (2-bit) Bit… view at source ↗

**Figure 17.** Figure 17: Validation BPB performance of different data representations spanning tokens to bits, under FLOPs-matched (left) and data-matched (right) comparison. Scaling Behavior. Figures 19 to 22 visualize performance under matched FLOPs (left) and matched data (right) for models at 0.5B, 1.5B, 4B, and 7B parameters. At smaller scales (0.5B), proxy-trained models underperform baselines in both regimes; however, as m… view at source ↗

**Figure 18.** Figure 18: Ablation on document-boundary attention masking. 1 2 3 4 5 6 Total Training FLOPs 1e20 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Pass@1 Byte-level Proxy Compressor (Neural) Proxy Compressor (Tokenizer) Tokenizer-based 50 100 150 200 250 300 350 Training Bytes (B) 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Pass@1 Byte-level Proxy Compressor (Neural) Proxy Compressor (Tokenizer) Tokenizer-based [PITH_… view at source ↗

**Figure 19.** Figure 19: Pass@1 performance on HumanEval-Plus for 0.5B models under different input representations, compared as a function of training FLOPs (left) and amount of training data (right). compression yields more optimization steps and thus more compute per byte, allowing better fitting of the data distribution. D.8. Pairing Strategy Ablation As described in §2.1, we optionally pair compressed and raw views of the sa… view at source ↗

**Figure 20.** Figure 20: Pass@1 performance on HumanEval-Plus for 1.5B models under different input representations, compared as a function of training FLOPs (left) and amount of training data (right). 0.5 1.0 1.5 2.0 2.5 Total Training FLOPs 1e21 0.10 0.15 0.20 0.25 Pass@1 Byte-level Proxy Compressor (Neural) Proxy Compressor (Tokenizer) Tokenizer-based 50 100 150 200 250 300 350 Training Bytes (B) 0.10 0.15 0.20 0.25 Pass@1 Byt… view at source ↗

**Figure 21.** Figure 21: Pass@1 performance on HumanEval-Plus for 4B models under different input representations, compared as a function of training FLOPs (left) and amount of training data (right). D.9. Transfer Strength under Controlled Raw-Byte Exposure To isolate the effect of transfer from the effect of increased data volume, we compare two training configurations with matched raw-byte exposure but different total training … view at source ↗

**Figure 22.** Figure 22: Pass@1 performance on HumanEval-Plus for 7B models under different input representations, compared as a function of training FLOPs (left) and amount of training data (right) [PITH_FULL_IMAGE:figures/full_fig_p031_22.png] view at source ↗

**Figure 23.** Figure 23: HumanEval-Plus pass@10 for 1.5B models under extended training versus training FLOPs (left) and data consumed (right). We report pass@10 for stability across checkpoints during pretraining. 0.5 1.0 1.5 2.0 Total Training FLOPs 1e22 0.10 0.15 0.20 0.25 0.30 0.35 HumanEval-Plus Pass@10 Byte-level Proxy (Neural) Proxy (Tokenizer) Tokenizer-based 0 200 400 600 800 1000 Training Bytes (B) 0.10 0.15 0.20 0.25 0… view at source ↗

**Figure 24.** Figure 24: HumanEval-Plus pass@10 for 7B models under extended training versus training FLOPs (left) and data consumed (right). We report pass@10 for stability across checkpoints during pretraining [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Ablation on format sentinels with gzip proxy compression at 1.5B (left; Python corpus) and 7B (right; full corpus) scales. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_25.png] view at source ↗

read the original abstract

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or surpass tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling. Our code is available at https://github.com/LZhengisme/proxy-compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxy compression trains on mixed raw-byte and compressed views to learn an internal alignment, delivering efficiency gains that scale up to match tokenizers on code modeling while keeping a pure byte interface at inference.

read the letter

Proxy compression trains a single model jointly on raw byte sequences and compressed views from external compressors so the model learns to align the two internally. At inference the compressed inputs are dropped and the model runs on raw bytes alone. On code language modeling this beats pure byte baselines under fixed compute and the gap widens with scale until the proxy models match or exceed standard tokenizer performance while retaining byte-level robustness. The released code makes the recipe easy to inspect and reproduce. That combination of a new joint-training procedure and reported scaling behavior is the concrete advance here. The results are useful for anyone trying to reduce tokenizer dependence in code or multilingual settings. The main soft spot is the transfer claim itself. The abstract does not spell out how much gradient signal reaches the raw-byte pathway during training or whether the alignment holds across different compressors once the compressed views are removed. If the raw path is under-trained, the efficiency story could weaken at inference. More ablations on that point and clearer baseline details would tighten the evidence. This is the kind of practical training modification that deserves a serious referee. The idea is straightforward, the empirical direction is testable, and the code is available, so an editor should send it out even if revisions are needed on the transfer experiments.

Referee Report

3 major / 2 minor

Summary. The paper introduces proxy compression, a training scheme in which a single language model is jointly optimized on raw UTF-8 byte sequences and compressed views produced by external compressors. The model is claimed to learn an internal alignment between the two formats, allowing the compressed inputs to be discarded at inference while preserving efficiency gains. Experiments on code language modeling report that proxy-trained models substantially outperform pure byte-level baselines under fixed compute budgets, with the advantage increasing with scale until the models match or surpass tokenizer-based approaches, all while retaining an end-to-end raw-byte interface and byte-level robustness.

Significance. If the transfer result holds, the work would be significant for byte-level language modeling: it decouples training efficiency from the inference interface, potentially allowing models to enjoy compression benefits without locking into a fixed tokenizer. The reported scaling trend and robustness claims are noteworthy, as is the open-sourced code for reproducibility.

major comments (3)

[Experiments] The central claim requires that joint training produces an alignment supporting full transfer to pure raw-byte inference without degradation. The manuscript does not report an ablation training exclusively on compressed inputs followed by raw-byte evaluation, nor does it quantify the fraction of raw-byte examples or provide gradient-norm analysis on the raw pathway (Experiments section).
[Experiments] Table 2 and Figure 4 report efficiency gains and scaling crossovers, but lack exact baseline compute budgets, data exclusion rules, number of random seeds, and statistical significance tests, leaving the strength of the 'significantly outperforms' and 'eventually match or surpass' claims difficult to assess.
[Method] The method description does not specify how the loss is balanced between raw and compressed views or whether the raw-byte pathway receives sufficient gradient signal when compressed examples dominate training.

minor comments (2)

[Introduction] Clarify the precise definition of 'proxy compression' versus standard multi-view training in the introduction and method sections to avoid potential confusion with prior multi-task or auxiliary-input approaches.
[Abstract] The abstract states 'our code is available'; ensure the repository contains the exact training scripts, hyper-parameters, and evaluation code used for the reported results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to address the concerns regarding experimental validation, reproducibility details, and method clarity. Below we respond point by point.

read point-by-point responses

Referee: [Experiments] The central claim requires that joint training produces an alignment supporting full transfer to pure raw-byte inference without degradation. The manuscript does not report an ablation training exclusively on compressed inputs followed by raw-byte evaluation, nor does it quantify the fraction of raw-byte examples or provide gradient-norm analysis on the raw pathway (Experiments section).

Authors: We agree that an explicit ablation on exclusive compressed training would further isolate the benefit of joint optimization. In the revised manuscript we add this experiment (new Table 3), which shows clear degradation on raw-byte evaluation for compressed-only models relative to proxy compression. We also report the raw-byte fraction used during training (20% of examples) and include gradient-norm plots in the appendix confirming non-vanishing signal on the raw pathway throughout training. revision: yes
Referee: [Experiments] Table 2 and Figure 4 report efficiency gains and scaling crossovers, but lack exact baseline compute budgets, data exclusion rules, number of random seeds, and statistical significance tests, leaving the strength of the 'significantly outperforms' and 'eventually match or surpass' claims difficult to assess.

Authors: We appreciate the call for greater reproducibility. The revised version now specifies exact compute budgets in FLOPs for every baseline, clarifies that data exclusion followed only standard deduplication with no additional filtering, reports all main results over three random seeds with standard deviations, and adds paired t-test p-values confirming statistical significance of the reported gains. revision: yes
Referee: [Method] The method description does not specify how the loss is balanced between raw and compressed views or whether the raw-byte pathway receives sufficient gradient signal when compressed examples dominate training.

Authors: We have expanded Section 3 to describe the loss as a weighted sum with fixed coefficients 0.2 (raw) and 0.8 (compressed). We further include gradient-norm analysis demonstrating that the raw pathway maintains stable gradient magnitudes even when compressed examples constitute 80% of each batch, owing to parameter sharing across the two input formats. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on experiments

full rationale

The paper presents proxy compression as a joint-training procedure on raw-byte and externally compressed sequences, with the alignment and transfer claims validated through scaling experiments on code modeling. No equations, fitted parameters, or self-citations are shown that reduce reported efficiency gains or raw-byte performance to quantities defined by construction within the paper. The method uses standard optimizers and external compressors; outcomes are measured against fixed-compute baselines rather than derived tautologically from internal definitions. This is self-contained empirical work with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that joint training produces transferable alignment between compressed and raw sequences; no free parameters or invented entities are introduced beyond standard language modeling hyperparameters.

axioms (1)

domain assumption Joint training on multiple input views produces internal alignment that transfers to single-view inference
Invoked to justify discarding compressed inputs at inference time while retaining efficiency gains.

pith-pipeline@v0.9.0 · 5484 in / 1114 out tokens · 42752 ms · 2026-05-16T08:00:32.518982+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes.
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Efficient Pre-Training with Token Superposition
cs.CL 2026-05 unverdicted novelty 6.0

Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
cs.CL 2026-04 unverdicted novelty 6.0

Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.
Efficient Pre-Training with Token Superposition
cs.CL 2026-05 unverdicted novelty 5.0

Token-Superposition Training combines multiple tokens into bags for multi-hot cross-entropy pre-training followed by a recovery phase, yielding up to 2.5x reduction in training time at 10B scale under equal-loss conditions.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
cs.CL 2026-04 unverdicted novelty 5.0

Subword tokenization's main benefits arise from higher sample throughput and the use of subword boundaries as explicit priors or inductive biases, isolated via controlled byte-level simulations.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

emnlp-main.614/

URL https://aclanthology.org/2023. emnlp-main.614/. Ahia, O., Kumar, S., Gonen, H., Hoffman, V ., Limisiewicz, T., Tsvetkov, Y ., and Smith, N. A. Magnet: Improv- ing the multilingual fairness of language models with adaptive gradient-based tokenization.arXiv preprint arXiv:2407.08818, 2024. Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Char...

work page arXiv 2023
[2]

URL https://openreview.net/forum? id=PEpbUobfJv. Cao, K. and Rimell, L. You should evaluate your language model on marginal likelihood over tokeni- sations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

work page 2021
[3]

Evaluating Large Language Models Trained on Code

URL https://aclanthology.org/2021. emnlp-main.161/. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. Cheng, J., Liu, Y ., Zhang, X., Fei, Y ., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X.,...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

The Llama 3 Herd of Models

URL https://openreview.net/forum? id=jznbgiynus. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Sh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1423 2019
[5]

emnlp-industry.58/

URL https://aclanthology.org/2023. emnlp-industry.58/. Geh, R., Zhang, H., Ahmed, K., Wang, B., and Van Den Broeck, G. Where is the signal in tokenization space? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

work page 2023
[6]

Better & Faster Large Language Models via Multi-token Prediction

URL https://aclanthology.org/2024. emnlp-main.230/. Geh, R., Shao, Z., and Van Den Broeck, G. Adversarial tok- enization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. URLhttps://aclanthology. org/2025.acl-long.1012/. Geng, S., Ranchin, N., Yao, Y ., Peyrard, M., Wendler, C., Gastp...

work page internal anchor Pith review arXiv 2024
[7]

Hayase, J., Liu, A., Choi, Y ., Oh, S., and Smith, N

URL https://proceedings.mlr.press/ v162/hawthorne22a.html. Hayase, J., Liu, A., Choi, Y ., Oh, S., and Smith, N. A. Data mixture inference: What do bpe tokenizers reveal about their training data?arXiv preprint arXiv:2407.16607, 2024. Hayase, J., Liu, A., Smith, N. A., and Oh, S. Sampling from your language model one byte at a time.arXiv preprint arXiv:25...

work page arXiv 2024
[8]

low-resource

URL https://openreview.net/forum? id=rygGQyrFvH. Horton, M., Mehta, S., Farhadi, A., and Rastegari, M. Bytes are all you need: Transformers operating di- rectly on file bytes.Transactions on Machine Learn- ing Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=RkaqxxAOfN. Huang, H., Zhu, D., Wu, B., Zeng, Y ., Wang, Y ., Min, Q., and Xun...

work page doi:10.20944/preprints202503 2024
[9]

Lester, B., Lee, J., Alemi, A., Pennington, J., Roberts, A., Sohl-Dickstein, J., and Constant, N

URL https://proceedings.mlr.press/ v202/lee23g.html. Lester, B., Lee, J., Alemi, A., Pennington, J., Roberts, A., Sohl-Dickstein, J., and Constant, N. Training llms over neurally compressed text.arXiv preprint arXiv:2404.03626, 2024. Li, J., Zhao, W. X., Nie, J.-Y ., and Wen, J.-R. Glyphdiffu- sion: Text generation as image generation.arXiv preprint arXiv...

work page arXiv 2024
[10]

Minixhofer, T

URL https://openreview.net/forum? id=lcDRvffeNP. Liu, J., Xia, C. S., Wang, Y ., and ZHANG, L. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty- seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=1qvx610Cu7. Loshchilov, I. and H...

work page arXiv 2023
[11]

Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles

URL https://aclanthology.org/2023. emnlp-main.854. Schmidt, C. W., Reddy, V ., Zhang, H., Alameddine, A., Uzan, O., Pinter, Y ., and Tanner, C. Tokenization is more than compression.arXiv preprint arXiv:2402.18376, 2024. Schmidt, C. W., Reddy, V ., Tanner, C., and Pinter, Y . Bound- less byte pair encoding: Breaking the pre-tokenization barrier. InSecond ...

work page arXiv 2023
[12]

13 Proxy Compression for Language Modeling Schuster, M

URL https://openreview.net/forum? id=oPAjXGV8qQ. 13 Proxy Compression for Language Modeling Schuster, M. and Nakajima, K. Japanese and korean voice search. In2012 IEEE international conference on acous- tics, speech and signal processing (ICASSP), 2012. Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In...

work page arXiv 2012
[13]

Videau, M., Idrissi, B

URL https://blog.vllm.ai/2025/10/ 22/agent-lightning.html. Videau, M., Idrissi, B. Y ., Leite, A., Schoenauer, M., Teytaud, O., and Lopez-Paz, D. From bytes to ideas: Language modeling with autoregressive u-nets.arXiv preprint arXiv:2506.14761, 2025. Vieira, T., LeBrun, B., Giulianelli, M., Gastaldi, J. L., DuSell, B., Terilla, J., O’Donnell, T. J., and C...

work page arXiv 2025
[14]

DeepSeek-OCR: Contexts Optical Compression

URL https://aclanthology.org/2023. acl-long.773. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., Co-Reyes, J. D., Gur, I., Kumar, A., Novak, R., et al. Small-scale proxies for large-scale transformer training instabilitie...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M

URL https://openreview.net/forum? id=gH4BRa4ZP3. Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M. MEGABYTE: Predicting million-byte sequences with multiscale transformers. InThirty-seventh Conference on Neural Information Processing Systems,

work page
[16]

Zheng, B

URL https://openreview.net/forum? id=JTmO2V9Xpz. Zheng, B. S., Liu, A., Ahia, O., Hayase, J., Choi, Y ., and Smith, N. A. Broken tokens? your language model can secretly handle non-canonical tokenizations.arXiv preprint arXiv:2506.19004, 2025a. Zheng, L., Yuan, J., Wang, C., and Kong, L. Effi- cient attention via control variates. InThe Eleventh Internati...

work page arXiv
[17]

superword

URL https://openreview.net/forum? id=G-uNfHKrj46. Zheng, L., Zhao, X., Wang, G., Wu, C., Dong, D., Wang, A., Wang, M., Du, Y ., Bo, H., Sharma, A., Li, B., Zhang, K., Hu, C., Thakker, U., and Kong, L. Evabyte: Efficient byte- level language models at scale, 2025b. URL https: //hkunlp.github.io/blog/2025/evabyte. Zhu, T., Liu, Q., Wang, H., Chen, S., Gu, X...

work page arXiv 2025
[18]

Entropy jumps: positions where the finite difference ∆ht =h t −h t−1 exceeds a monotonicity threshold, indicating sudden changes in predictability. Similar entropy-based criteria also appear in BLTs (Pagnoni et al., 2024) for dynamic byte patchification within the model architecture; here we use them only for segmenting inputs for parallel arithmetic codi...

work page 2024
[19]

Run a forward pass of the compressor model to obtain next-byte distributions (on GPU)

work page
[20]

If the bitstream for the current window exceeds τ bits, emit the first τ bits, discard the consumed byte context, and return to step 1 with the truncated context

Perform arithmetic coding and count output bits in the resulting compressed bitstream (on CPU). If the bitstream for the current window exceeds τ bits, emit the first τ bits, discard the consumed byte context, and return to step 1 with the truncated context. We design a pipelined implementation to overlap GPU forward passes with CPU encoding across iterat...

work page
[21]

Reads a shard of the corpus

work page
[22]

Applies entropy-based segmentation (on GPUs)

work page
[23]

Compresses segments with arithmetic coding, equal-info windows (Lester et al., 2024), and cache lookup (GPU/CPU pipelining)

work page 2024
[24]

Packs the resulting compressed bitstream into fixed-bit symbols

work page
[25]

fuzziness

Writes segmentation metadata and compressed sequences. At training time, the proxy compressor simply reads the pre-computed compressed data and presents them to the mixed- representation training pipeline (§2.1). Our pipeline design improves efficiency significantly: we process ∼3.3TB of pretraining data at 0.57 GB/hour per process, compared to 0.005 GB/h...

work page 2024
[26]

Maximum scale 61 0.49 59.00 17.60

work page
[27]

Minimum scale 2 0.56 64.00 28.00

work page
[28]

Highest LCP 2 0.87 7.50 0.50

work page
[29]

base_uri

Lowest LCP 4 0.00 10.00 8.15 and function calls with formatting differences. These examples demonstrate that neural compression merges semantically equivalent content while abstracting away superficial formatting noise. Listing 1.Collision examples from the neural compressor. Case 1: Maximum scale (61 variants , 4 shown) [,\n ] [,\n ] [,\n ] [,\n ] Case 2...

work page 2023

[1] [1]

emnlp-main.614/

URL https://aclanthology.org/2023. emnlp-main.614/. Ahia, O., Kumar, S., Gonen, H., Hoffman, V ., Limisiewicz, T., Tsvetkov, Y ., and Smith, N. A. Magnet: Improv- ing the multilingual fairness of language models with adaptive gradient-based tokenization.arXiv preprint arXiv:2407.08818, 2024. Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Char...

work page arXiv 2023

[2] [2]

URL https://openreview.net/forum? id=PEpbUobfJv. Cao, K. and Rimell, L. You should evaluate your language model on marginal likelihood over tokeni- sations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

work page 2021

[3] [3]

Evaluating Large Language Models Trained on Code

URL https://aclanthology.org/2021. emnlp-main.161/. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. Cheng, J., Liu, Y ., Zhang, X., Fei, Y ., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X.,...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

The Llama 3 Herd of Models

URL https://openreview.net/forum? id=jznbgiynus. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Sh...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1423 2019

[5] [5]

emnlp-industry.58/

URL https://aclanthology.org/2023. emnlp-industry.58/. Geh, R., Zhang, H., Ahmed, K., Wang, B., and Van Den Broeck, G. Where is the signal in tokenization space? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,

work page 2023

[6] [6]

Better & Faster Large Language Models via Multi-token Prediction

URL https://aclanthology.org/2024. emnlp-main.230/. Geh, R., Shao, Z., and Van Den Broeck, G. Adversarial tok- enization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. URLhttps://aclanthology. org/2025.acl-long.1012/. Geng, S., Ranchin, N., Yao, Y ., Peyrard, M., Wendler, C., Gastp...

work page internal anchor Pith review arXiv 2024

[7] [7]

Hayase, J., Liu, A., Choi, Y ., Oh, S., and Smith, N

URL https://proceedings.mlr.press/ v162/hawthorne22a.html. Hayase, J., Liu, A., Choi, Y ., Oh, S., and Smith, N. A. Data mixture inference: What do bpe tokenizers reveal about their training data?arXiv preprint arXiv:2407.16607, 2024. Hayase, J., Liu, A., Smith, N. A., and Oh, S. Sampling from your language model one byte at a time.arXiv preprint arXiv:25...

work page arXiv 2024

[8] [8]

low-resource

URL https://openreview.net/forum? id=rygGQyrFvH. Horton, M., Mehta, S., Farhadi, A., and Rastegari, M. Bytes are all you need: Transformers operating di- rectly on file bytes.Transactions on Machine Learn- ing Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=RkaqxxAOfN. Huang, H., Zhu, D., Wu, B., Zeng, Y ., Wang, Y ., Min, Q., and Xun...

work page doi:10.20944/preprints202503 2024

[9] [9]

Lester, B., Lee, J., Alemi, A., Pennington, J., Roberts, A., Sohl-Dickstein, J., and Constant, N

URL https://proceedings.mlr.press/ v202/lee23g.html. Lester, B., Lee, J., Alemi, A., Pennington, J., Roberts, A., Sohl-Dickstein, J., and Constant, N. Training llms over neurally compressed text.arXiv preprint arXiv:2404.03626, 2024. Li, J., Zhao, W. X., Nie, J.-Y ., and Wen, J.-R. Glyphdiffu- sion: Text generation as image generation.arXiv preprint arXiv...

work page arXiv 2024

[10] [10]

Minixhofer, T

URL https://openreview.net/forum? id=lcDRvffeNP. Liu, J., Xia, C. S., Wang, Y ., and ZHANG, L. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty- seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=1qvx610Cu7. Loshchilov, I. and H...

work page arXiv 2023

[11] [11]

Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles

URL https://aclanthology.org/2023. emnlp-main.854. Schmidt, C. W., Reddy, V ., Zhang, H., Alameddine, A., Uzan, O., Pinter, Y ., and Tanner, C. Tokenization is more than compression.arXiv preprint arXiv:2402.18376, 2024. Schmidt, C. W., Reddy, V ., Tanner, C., and Pinter, Y . Bound- less byte pair encoding: Breaking the pre-tokenization barrier. InSecond ...

work page arXiv 2023

[12] [12]

13 Proxy Compression for Language Modeling Schuster, M

URL https://openreview.net/forum? id=oPAjXGV8qQ. 13 Proxy Compression for Language Modeling Schuster, M. and Nakajima, K. Japanese and korean voice search. In2012 IEEE international conference on acous- tics, speech and signal processing (ICASSP), 2012. Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In...

work page arXiv 2012

[13] [13]

Videau, M., Idrissi, B

URL https://blog.vllm.ai/2025/10/ 22/agent-lightning.html. Videau, M., Idrissi, B. Y ., Leite, A., Schoenauer, M., Teytaud, O., and Lopez-Paz, D. From bytes to ideas: Language modeling with autoregressive u-nets.arXiv preprint arXiv:2506.14761, 2025. Vieira, T., LeBrun, B., Giulianelli, M., Gastaldi, J. L., DuSell, B., Terilla, J., O’Donnell, T. J., and C...

work page arXiv 2025

[14] [14]

DeepSeek-OCR: Contexts Optical Compression

URL https://aclanthology.org/2023. acl-long.773. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., Co-Reyes, J. D., Gur, I., Kumar, A., Novak, R., et al. Small-scale proxies for large-scale transformer training instabilitie...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M

URL https://openreview.net/forum? id=gH4BRa4ZP3. Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M. MEGABYTE: Predicting million-byte sequences with multiscale transformers. InThirty-seventh Conference on Neural Information Processing Systems,

work page

[16] [16]

Zheng, B

URL https://openreview.net/forum? id=JTmO2V9Xpz. Zheng, B. S., Liu, A., Ahia, O., Hayase, J., Choi, Y ., and Smith, N. A. Broken tokens? your language model can secretly handle non-canonical tokenizations.arXiv preprint arXiv:2506.19004, 2025a. Zheng, L., Yuan, J., Wang, C., and Kong, L. Effi- cient attention via control variates. InThe Eleventh Internati...

work page arXiv

[17] [17]

superword

URL https://openreview.net/forum? id=G-uNfHKrj46. Zheng, L., Zhao, X., Wang, G., Wu, C., Dong, D., Wang, A., Wang, M., Du, Y ., Bo, H., Sharma, A., Li, B., Zhang, K., Hu, C., Thakker, U., and Kong, L. Evabyte: Efficient byte- level language models at scale, 2025b. URL https: //hkunlp.github.io/blog/2025/evabyte. Zhu, T., Liu, Q., Wang, H., Chen, S., Gu, X...

work page arXiv 2025

[18] [18]

Entropy jumps: positions where the finite difference ∆ht =h t −h t−1 exceeds a monotonicity threshold, indicating sudden changes in predictability. Similar entropy-based criteria also appear in BLTs (Pagnoni et al., 2024) for dynamic byte patchification within the model architecture; here we use them only for segmenting inputs for parallel arithmetic codi...

work page 2024

[19] [19]

Run a forward pass of the compressor model to obtain next-byte distributions (on GPU)

work page

[20] [20]

If the bitstream for the current window exceeds τ bits, emit the first τ bits, discard the consumed byte context, and return to step 1 with the truncated context

Perform arithmetic coding and count output bits in the resulting compressed bitstream (on CPU). If the bitstream for the current window exceeds τ bits, emit the first τ bits, discard the consumed byte context, and return to step 1 with the truncated context. We design a pipelined implementation to overlap GPU forward passes with CPU encoding across iterat...

work page

[21] [21]

Reads a shard of the corpus

work page

[22] [22]

Applies entropy-based segmentation (on GPUs)

work page

[23] [23]

Compresses segments with arithmetic coding, equal-info windows (Lester et al., 2024), and cache lookup (GPU/CPU pipelining)

work page 2024

[24] [24]

Packs the resulting compressed bitstream into fixed-bit symbols

work page

[25] [25]

fuzziness

Writes segmentation metadata and compressed sequences. At training time, the proxy compressor simply reads the pre-computed compressed data and presents them to the mixed- representation training pipeline (§2.1). Our pipeline design improves efficiency significantly: we process ∼3.3TB of pretraining data at 0.57 GB/hour per process, compared to 0.005 GB/h...

work page 2024

[26] [26]

Maximum scale 61 0.49 59.00 17.60

work page

[27] [27]

Minimum scale 2 0.56 64.00 28.00

work page

[28] [28]

Highest LCP 2 0.87 7.50 0.50

work page

[29] [29]

base_uri

Lowest LCP 4 0.00 10.00 8.15 and function calls with formatting differences. These examples demonstrate that neural compression merges semantically equivalent content while abstracting away superficial formatting noise. Listing 1.Collision examples from the neural compressor. Case 1: Maximum scale (61 variants , 4 shown) [,\n ] [,\n ] [,\n ] [,\n ] Case 2...

work page 2023