Proxy Compression for Language Modeling

Lin Zheng , Xinyu Li , Qian Liu , Xiachong Feng , Lingpeng Kong

Authors on Pith no claims yet

classification 💻 cs.CL cs.LG

keywords compressedlanguagemodelsequencestrainingcompressionmodelingproxy

read the original abstract

Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, a single language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs that are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or surpass tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling. Our code is available at https://github.com/LZhengisme/proxy-compression.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
Efficient Pre-Training with Token Superposition
cs.CL 2026-05 unverdicted novelty 6.0

Token superposition in an initial training phase followed by recovery allows large language models to reach target loss with substantially less total compute.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
cs.CL 2026-04 unverdicted novelty 6.0

Byte-level simulations show subword tokenization improves LLM training mainly via increased throughput and boundary priors.
Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation
cs.CL 2026-04 unverdicted novelty 5.0

Subword tokenization's main benefits arise from higher sample throughput and the use of subword boundaries as explicit priors or inductive biases, isolated via controlled byte-level simulations.