Proxy Compression for Language Modeling
Pith reviewed 2026-05-16 08:00 UTC · model grok-4.3
The pith
Proxy compression trains language models jointly on raw bytes and compressed sequences so they can use efficient inputs during training yet run purely on raw bytes at inference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A single language model jointly trained on raw byte sequences and compressed views from external compressors learns an internal alignment between the formats; this alignment enables effective transfer so that training can use predominantly compressed inputs for efficiency while inference runs solely on raw bytes without performance loss or continued need for the compressor.
What carries the argument
Joint training on paired raw-byte sequences and their compressed counterparts to induce internal format alignment that transfers at inference.
If this is right
- Training efficiency improves substantially over pure byte-level models under fixed compute budgets.
- Performance gains become more pronounced as model scale increases.
- Proxy-trained models eventually match or surpass traditional tokenizer-based approaches.
- Models retain byte-level robustness while operating solely on raw bytes at inference.
Where Pith is reading between the lines
- The approach could reduce the need to choose and maintain a single tokenizer for deployment across different domains.
- Similar joint-training alignments might extend to sequence tasks outside code, such as multilingual text or structured data.
- The observed scaling pattern suggests hybrid training regimes could allocate compute differently between compressed and raw views in future models.
Load-bearing premise
The internal alignment learned from joint training transfers effectively to pure raw-byte inference without needing the compressed inputs at test time.
What would settle it
Large-scale code modeling runs in which proxy-trained models underperform pure byte-level baselines on raw-byte benchmarks would show the alignment does not transfer as claimed.
Figures
read the original abstract
Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or surpass tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling. Our code is available at https://github.com/LZhengisme/proxy-compression.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces proxy compression, a training scheme in which a single language model is jointly optimized on raw UTF-8 byte sequences and compressed views produced by external compressors. The model is claimed to learn an internal alignment between the two formats, allowing the compressed inputs to be discarded at inference while preserving efficiency gains. Experiments on code language modeling report that proxy-trained models substantially outperform pure byte-level baselines under fixed compute budgets, with the advantage increasing with scale until the models match or surpass tokenizer-based approaches, all while retaining an end-to-end raw-byte interface and byte-level robustness.
Significance. If the transfer result holds, the work would be significant for byte-level language modeling: it decouples training efficiency from the inference interface, potentially allowing models to enjoy compression benefits without locking into a fixed tokenizer. The reported scaling trend and robustness claims are noteworthy, as is the open-sourced code for reproducibility.
major comments (3)
- [Experiments] The central claim requires that joint training produces an alignment supporting full transfer to pure raw-byte inference without degradation. The manuscript does not report an ablation training exclusively on compressed inputs followed by raw-byte evaluation, nor does it quantify the fraction of raw-byte examples or provide gradient-norm analysis on the raw pathway (Experiments section).
- [Experiments] Table 2 and Figure 4 report efficiency gains and scaling crossovers, but lack exact baseline compute budgets, data exclusion rules, number of random seeds, and statistical significance tests, leaving the strength of the 'significantly outperforms' and 'eventually match or surpass' claims difficult to assess.
- [Method] The method description does not specify how the loss is balanced between raw and compressed views or whether the raw-byte pathway receives sufficient gradient signal when compressed examples dominate training.
minor comments (2)
- [Introduction] Clarify the precise definition of 'proxy compression' versus standard multi-view training in the introduction and method sections to avoid potential confusion with prior multi-task or auxiliary-input approaches.
- [Abstract] The abstract states 'our code is available'; ensure the repository contains the exact training scripts, hyper-parameters, and evaluation code used for the reported results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We have revised the manuscript to address the concerns regarding experimental validation, reproducibility details, and method clarity. Below we respond point by point.
read point-by-point responses
-
Referee: [Experiments] The central claim requires that joint training produces an alignment supporting full transfer to pure raw-byte inference without degradation. The manuscript does not report an ablation training exclusively on compressed inputs followed by raw-byte evaluation, nor does it quantify the fraction of raw-byte examples or provide gradient-norm analysis on the raw pathway (Experiments section).
Authors: We agree that an explicit ablation on exclusive compressed training would further isolate the benefit of joint optimization. In the revised manuscript we add this experiment (new Table 3), which shows clear degradation on raw-byte evaluation for compressed-only models relative to proxy compression. We also report the raw-byte fraction used during training (20% of examples) and include gradient-norm plots in the appendix confirming non-vanishing signal on the raw pathway throughout training. revision: yes
-
Referee: [Experiments] Table 2 and Figure 4 report efficiency gains and scaling crossovers, but lack exact baseline compute budgets, data exclusion rules, number of random seeds, and statistical significance tests, leaving the strength of the 'significantly outperforms' and 'eventually match or surpass' claims difficult to assess.
Authors: We appreciate the call for greater reproducibility. The revised version now specifies exact compute budgets in FLOPs for every baseline, clarifies that data exclusion followed only standard deduplication with no additional filtering, reports all main results over three random seeds with standard deviations, and adds paired t-test p-values confirming statistical significance of the reported gains. revision: yes
-
Referee: [Method] The method description does not specify how the loss is balanced between raw and compressed views or whether the raw-byte pathway receives sufficient gradient signal when compressed examples dominate training.
Authors: We have expanded Section 3 to describe the loss as a weighted sum with fixed coefficients 0.2 (raw) and 0.8 (compressed). We further include gradient-norm analysis demonstrating that the raw pathway maintains stable gradient magnitudes even when compressed examples constitute 80% of each batch, owing to parameter sharing across the two input formats. revision: yes
Circularity Check
No significant circularity; empirical claims rest on experiments
full rationale
The paper presents proxy compression as a joint-training procedure on raw-byte and externally compressed sequences, with the alignment and transfer claims validated through scaling experiments on code modeling. No equations, fitted parameters, or self-citations are shown that reduce reported efficiency gains or raw-byte performance to quantities defined by construction within the paper. The method uses standard optimizers and external compressors; outcomes are measured against fixed-compute baselines rather than derived tautologically from internal definitions. This is self-contained empirical work with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Joint training on multiple input views produces internal alignment that transfers to single-view inference
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes.
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 5 Pith papers
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
Efficient Pre-Training with Token Superposition
Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.
-
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.
-
Efficient Pre-Training with Token Superposition
Token-Superposition Training combines multiple tokens into bags for multi-hot cross-entropy pre-training followed by a recovery phase, yielding up to 2.5x reduction in training time at 10B scale under equal-loss conditions.
-
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
Subword tokenization's main benefits arise from higher sample throughput and the use of subword boundaries as explicit priors or inductive biases, isolated via controlled byte-level simulations.
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2023. emnlp-main.614/. Ahia, O., Kumar, S., Gonen, H., Hoffman, V ., Limisiewicz, T., Tsvetkov, Y ., and Smith, N. A. Magnet: Improv- ing the multilingual fairness of language models with adaptive gradient-based tokenization.arXiv preprint arXiv:2407.08818, 2024. Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L. Char...
-
[2]
URL https://openreview.net/forum? id=PEpbUobfJv. Cao, K. and Rimell, L. You should evaluate your language model on marginal likelihood over tokeni- sations. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,
work page 2021
-
[3]
Evaluating Large Language Models Trained on Code
URL https://aclanthology.org/2021. emnlp-main.161/. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. Cheng, J., Liu, Y ., Zhang, X., Fei, Y ., Hong, W., Lyu, R., Wang, W., Su, Z., Gu, X.,...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
URL https://openreview.net/forum? id=jznbgiynus. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for lan- guage understanding. InProceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technolo- gies, Volume 1 (Long and Sh...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/n19-1423 2019
-
[5]
URL https://aclanthology.org/2023. emnlp-industry.58/. Geh, R., Zhang, H., Ahmed, K., Wang, B., and Van Den Broeck, G. Where is the signal in tokenization space? InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
work page 2023
-
[6]
Better & Faster Large Language Models via Multi-token Prediction
URL https://aclanthology.org/2024. emnlp-main.230/. Geh, R., Shao, Z., and Van Den Broeck, G. Adversarial tok- enization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. URLhttps://aclanthology. org/2025.acl-long.1012/. Geng, S., Ranchin, N., Yao, Y ., Peyrard, M., Wendler, C., Gastp...
work page internal anchor Pith review arXiv 2024
-
[7]
Hayase, J., Liu, A., Choi, Y ., Oh, S., and Smith, N
URL https://proceedings.mlr.press/ v162/hawthorne22a.html. Hayase, J., Liu, A., Choi, Y ., Oh, S., and Smith, N. A. Data mixture inference: What do bpe tokenizers reveal about their training data?arXiv preprint arXiv:2407.16607, 2024. Hayase, J., Liu, A., Smith, N. A., and Oh, S. Sampling from your language model one byte at a time.arXiv preprint arXiv:25...
-
[8]
URL https://openreview.net/forum? id=rygGQyrFvH. Horton, M., Mehta, S., Farhadi, A., and Rastegari, M. Bytes are all you need: Transformers operating di- rectly on file bytes.Transactions on Machine Learn- ing Research, 2024. ISSN 2835-8856. URL https: //openreview.net/forum?id=RkaqxxAOfN. Huang, H., Zhu, D., Wu, B., Zeng, Y ., Wang, Y ., Min, Q., and Xun...
-
[9]
Lester, B., Lee, J., Alemi, A., Pennington, J., Roberts, A., Sohl-Dickstein, J., and Constant, N
URL https://proceedings.mlr.press/ v202/lee23g.html. Lester, B., Lee, J., Alemi, A., Pennington, J., Roberts, A., Sohl-Dickstein, J., and Constant, N. Training llms over neurally compressed text.arXiv preprint arXiv:2404.03626, 2024. Li, J., Zhao, W. X., Nie, J.-Y ., and Wen, J.-R. Glyphdiffu- sion: Text generation as image generation.arXiv preprint arXiv...
-
[10]
URL https://openreview.net/forum? id=lcDRvffeNP. Liu, J., Xia, C. S., Wang, Y ., and ZHANG, L. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. InThirty- seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=1qvx610Cu7. Loshchilov, I. and H...
-
[11]
Xenia Schmalz, Eva Marinus, Max Coltheart, and Anne Castles
URL https://aclanthology.org/2023. emnlp-main.854. Schmidt, C. W., Reddy, V ., Zhang, H., Alameddine, A., Uzan, O., Pinter, Y ., and Tanner, C. Tokenization is more than compression.arXiv preprint arXiv:2402.18376, 2024. Schmidt, C. W., Reddy, V ., Tanner, C., and Pinter, Y . Bound- less byte pair encoding: Breaking the pre-tokenization barrier. InSecond ...
-
[12]
13 Proxy Compression for Language Modeling Schuster, M
URL https://openreview.net/forum? id=oPAjXGV8qQ. 13 Proxy Compression for Language Modeling Schuster, M. and Nakajima, K. Japanese and korean voice search. In2012 IEEE international conference on acous- tics, speech and signal processing (ICASSP), 2012. Sennrich, R., Haddow, B., and Birch, A. Neural machine translation of rare words with subword units. In...
-
[13]
URL https://blog.vllm.ai/2025/10/ 22/agent-lightning.html. Videau, M., Idrissi, B. Y ., Leite, A., Schoenauer, M., Teytaud, O., and Lopez-Paz, D. From bytes to ideas: Language modeling with autoregressive u-nets.arXiv preprint arXiv:2506.14761, 2025. Vieira, T., LeBrun, B., Giulianelli, M., Gastaldi, J. L., DuSell, B., Terilla, J., O’Donnell, T. J., and C...
-
[14]
DeepSeek-OCR: Contexts Optical Compression
URL https://aclanthology.org/2023. acl-long.773. Wei, H., Sun, Y ., and Li, Y . Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025. Wortsman, M., Liu, P. J., Xiao, L., Everett, K., Alemi, A., Adlam, B., Co-Reyes, J. D., Gur, I., Kumar, A., Novak, R., et al. Small-scale proxies for large-scale transformer training instabilitie...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M
URL https://openreview.net/forum? id=gH4BRa4ZP3. Yu, L., Simig, D., Flaherty, C., Aghajanyan, A., Zettlemoyer, L., and Lewis, M. MEGABYTE: Predicting million-byte sequences with multiscale transformers. InThirty-seventh Conference on Neural Information Processing Systems,
-
[16]
URL https://openreview.net/forum? id=JTmO2V9Xpz. Zheng, B. S., Liu, A., Ahia, O., Hayase, J., Choi, Y ., and Smith, N. A. Broken tokens? your language model can secretly handle non-canonical tokenizations.arXiv preprint arXiv:2506.19004, 2025a. Zheng, L., Yuan, J., Wang, C., and Kong, L. Effi- cient attention via control variates. InThe Eleventh Internati...
-
[17]
URL https://openreview.net/forum? id=G-uNfHKrj46. Zheng, L., Zhao, X., Wang, G., Wu, C., Dong, D., Wang, A., Wang, M., Du, Y ., Bo, H., Sharma, A., Li, B., Zhang, K., Hu, C., Thakker, U., and Kong, L. Evabyte: Efficient byte- level language models at scale, 2025b. URL https: //hkunlp.github.io/blog/2025/evabyte. Zhu, T., Liu, Q., Wang, H., Chen, S., Gu, X...
-
[18]
Entropy jumps: positions where the finite difference ∆ht =h t −h t−1 exceeds a monotonicity threshold, indicating sudden changes in predictability. Similar entropy-based criteria also appear in BLTs (Pagnoni et al., 2024) for dynamic byte patchification within the model architecture; here we use them only for segmenting inputs for parallel arithmetic codi...
work page 2024
-
[19]
Run a forward pass of the compressor model to obtain next-byte distributions (on GPU)
-
[20]
Perform arithmetic coding and count output bits in the resulting compressed bitstream (on CPU). If the bitstream for the current window exceeds τ bits, emit the first τ bits, discard the consumed byte context, and return to step 1 with the truncated context. We design a pipelined implementation to overlap GPU forward passes with CPU encoding across iterat...
-
[21]
Reads a shard of the corpus
-
[22]
Applies entropy-based segmentation (on GPUs)
-
[23]
Compresses segments with arithmetic coding, equal-info windows (Lester et al., 2024), and cache lookup (GPU/CPU pipelining)
work page 2024
-
[24]
Packs the resulting compressed bitstream into fixed-bit symbols
-
[25]
Writes segmentation metadata and compressed sequences. At training time, the proxy compressor simply reads the pre-computed compressed data and presents them to the mixed- representation training pipeline (§2.1). Our pipeline design improves efficiency significantly: we process ∼3.3TB of pretraining data at 0.57 GB/hour per process, compared to 0.005 GB/h...
work page 2024
-
[26]
Maximum scale 61 0.49 59.00 17.60
-
[27]
Minimum scale 2 0.56 64.00 28.00
-
[28]
Highest LCP 2 0.87 7.50 0.50
-
[29]
Lowest LCP 4 0.00 10.00 8.15 and function calls with formatting differences. These examples demonstrate that neural compression merges semantically equivalent content while abstracting away superficial formatting noise. Listing 1.Collision examples from the neural compressor. Case 1: Maximum scale (61 variants , 4 shown) [,\n ] [,\n ] [,\n ] [,\n ] Case 2...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.