Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.
hub
Language Modeling Is Compression
22 Pith papers cite this work. Polarity classification is still indexing.
abstract
It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
SemanticZip is a pilot framework introducing LLM-mediated lossy text compression with an experimental interface evaluating six representation regimes on five diagnostic cases for semantic atom recovery and token efficiency.
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
A new chain of lightweight neural predictors with information inheritance achieves near state-of-the-art lossless compression ratios while delivering 1.2-6.3x faster encoding and 2.8-12.3x faster decoding than PAC on GPUs.
Probabilistic language tries unify compression, sequential decision making, and inference caching by making explicit the prefix structure of any generative model over sequences.
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
A hypernetwork produces a condition-dependent beta that meta-gates SwiGLU nonlinearity, giving LLMs adaptive behavior across task, domain, persona and style inputs without finetuning.
A dual-stream decoupler plus hierarchical refiner and parallel pipeline yields state-of-the-art compression ratio and throughput with lowest reported latency and memory in learned data compression.
Refined probabilistic and smooth l0 pruning techniques approximate minimum description length for neural networks, achieving high compression with minimal accuracy loss and empirically verifying better sample efficiency and generalization on image and text tasks.
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Interestingness is defined as an inductive signal for future compression progress, with proofs that expected progress decays exponentially with time since last breakthrough and that the Algorithmic Prior yields quadratic gains over the Length Prior.
Machine learning is inherently rhetorical and is often deployed as 'manipulation as a service' in business models.
Presents TextEconomizer, a transformer-based encoder-decoder for lossy text compression claiming 5.39x ratio, near-perfect semantic quality via standard metrics, and 153x fewer parameters than comparables.
citing papers explorer
-
Effective Context in Transformers: An Analysis of Fragmentation and Tokenization
Fragmentation strictly raises optimal finite-context log-loss on Markov sources while tokenization can make a short token window equivalent to a longer source window under reliability and compression conditions.
-
Are Flat Minima an Illusion?
Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.
-
Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Manifold steering along activation geometry induces behavioral trajectories matching the natural manifold of outputs, while linear steering produces off-manifold unnatural behaviors.
-
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
-
Hierarchical SVG Tokenization: Learning Compact Visual Programs for Scalable Vector Graphics Modeling
HiVG introduces hierarchical SVG tokenization with atomic and segment tokens plus HMN initialization to enable more efficient and stable autoregressive generation of vector graphics programs.
-
SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
SemanticZip is a pilot framework introducing LLM-mediated lossy text compression with an experimental interface evaluating six representation regimes on five diagnostic cases for semantic atom recovery and token efficiency.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Lossless Compression via Chained Lightweight Neural Predictors with Information Inheritance
A new chain of lightweight neural predictors with information inheritance achieves near state-of-the-art lossless compression ratios while delivering 1.2-6.3x faster encoding and 2.8-12.3x faster decoding than PAC on GPUs.
-
Probabilistic Language Tries: A Unified Framework for Compression, Decision Policies, and Execution Reuse
Probabilistic language tries unify compression, sequential decision making, and inference caching by making explicit the prefix structure of any generative model over sequences.
-
Scaling Synthetic Data Creation with 1,000,000,000 Personas
A curated set of one billion personas enables scalable, diverse synthetic data generation for LLM training across reasoning, instructions, knowledge, NPCs, and tools.
-
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
-
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
-
Learn-To-Learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
A hypernetwork produces a condition-dependent beta that meta-gates SwiGLU nonlinearity, giving LLMs adaptive behavior across task, domain, persona and style inputs without finetuning.
-
Efficient Learned Data Compression via Dual-Stream Feature Decoupling
A dual-stream decoupler plus hierarchical refiner and parallel pipeline yields state-of-the-art compression ratio and throughput with lowest reported latency and memory in learned data compression.
-
Efficient compression of neural networks and datasets
Refined probabilistic and smooth l0 pruning techniques approximate minimum description length for neural networks, achieving high compression with minimal accuracy loss and empirically verifying better sample efficiency and generalization on image and text tasks.
-
Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression
Introduces Tree Generation (TG-SFT) to generate synthetic instruction-tuning data from LLMs, reducing catastrophic forgetting when fine-tuning MLLMs on domain-specific or multimodal data.
-
Hallucination of Multimodal Large Language Models: A Survey
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Interestingness as an Inductive Heuristic for Future Compression Progress
Interestingness is defined as an inductive signal for future compression progress, with proofs that expected progress decays exponentially with time since last breakthrough and that the Algorithmic Prior yields quadratic gains over the Length Prior.
-
The Rhetoric of Machine Learning
Machine learning is inherently rhetorical and is often deployed as 'manipulation as a service' in business models.
-
TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding
Presents TextEconomizer, a transformer-based encoder-decoder for lossy text compression claiming 5.39x ratio, near-perfect semantic quality via standard metrics, and 153x fewer parameters than comparables.
- HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench