The Sample Complexity of Lossless Data Compression

Ioannis Kontoyiannis; Terence Viaud

arxiv: 2601.06688 · v5 · pith:LGBLR5XLnew · submitted 2026-01-10 · 💻 cs.IT · math.IT· math.ST· stat.TH

The Sample Complexity of Lossless Data Compression

Terence Viaud , Ioannis Kontoyiannis This is my paper

Pith reviewed 2026-05-16 15:13 UTC · model grok-4.3

classification 💻 cs.IT math.ITmath.STstat.TH

keywords sample complexitylossless compressionRényi entropynonasymptotic boundshypothesis testingmemoryless sourcesuniversal compressioninformation theory

0 comments

The pith

Lossless compression sample complexity for memoryless sources is governed by Rényi entropy of order 1/2 instead of Shannon entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines sample complexity as the smallest blocklength allowing compression at a fixed constrained rate while keeping the probability of rate excess below a target. This definition creates a direct bridge to hypothesis-testing problems, transferring their known sample-complexity results to compression. For memoryless sources the minimal blocklength scales according to the source Rényi entropy of order 1/2 rather than its usual entropy. Nonasymptotic bounds with explicit constants are derived from this link. The same order-1/2 Rényi entropy rate governs Markov sources, while universal compression over source families is characterized by the smallest Rényi divergence of order 1/2 to the uniform distribution.

Core claim

The sample complexity of lossless compression is tightly coupled with the sample complexity of prefix-free and fixed-length codes for arbitrary sources. For memoryless sources it is characterized by the Rényi entropy of order 1/2, yielding nonasymptotic bounds with explicit constants. The characterization extends to Markov sources through the Rényi entropy rate of order 1/2 and to universal compression through the minimum Rényi divergence of order 1/2 to the uniform distribution.

What carries the argument

The Rényi entropy of order 1/2, which replaces Shannon entropy in the finite-sample scaling law because the chosen rate and excess-probability definitions transfer error-exponent results from hypothesis testing without mismatch.

If this is right

General variable-length compressors have the same sample complexity as prefix-free and fixed-length codes for any source.
Explicit nonasymptotic upper and lower bounds on sample complexity follow for memoryless sources from the Rényi entropy of order 1/2.
Markov sources obey sample-complexity bounds determined by their Rényi entropy rate of order 1/2.
Universal compression over a family of memoryless sources has sample complexity set by the minimum Rényi divergence of order 1/2 to the uniform distribution.
The same framework connects compression limits to the separation rates of identity testing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Finite-blocklength compression is limited more by tail behavior than by average information content, since order-1/2 Rényi entropy weights unlikely events more heavily.
Identity testing for uniformity may act as a building block for achieving the universal compression rates derived here.
The approach suggests analogous sample-complexity characterizations could be developed for lossy compression or other statistical tasks.
The explicit constants make it possible to calculate concrete blocklength requirements for practical sources and error targets.

Load-bearing premise

The specific definitions of constrained rate and excess-rate probability match those in hypothesis testing closely enough that results transfer directly without added looseness.

What would settle it

For a concrete Bernoulli source, compute the smallest n such that a compressor achieves a target rate with excess-rate probability below a fixed delta; check whether this n scales exactly with the source's Rényi entropy of order 1/2 rather than its Shannon entropy.

read the original abstract

A new framework is introduced for examining and evaluating the fundamental limits of lossless data compression, that emphasizes genuinely non-asymptotic results. The {\em sample complexity} of compressing a given source is defined as the smallest blocklength at which it is possible to compress that source at a specifically constrained rate and to within a specified excess-rate probability. This formulation parallels corresponding developments in statistics and computer science, and it facilitates the use of existing results on the sample complexity of various hypothesis testing problems. For arbitrary sources, the sample complexity of general variable-length compressors is shown to be tightly coupled with the sample complexity of prefix-free codes and fixed-length codes. For memoryless sources, it is shown that the sample complexity is characterized not by the source entropy, but by its R\'{e}nyi entropy of order~$1/2$. Nonasymptotic bounds on the sample complexity are obtained, with explicit constants. Generalizations to Markov sources are established, showing that the sample complexity is determined by the source's R\'{e}nyi entropy rate of order~$1/2$. Finally, bounds on the sample complexity of universal data compression are developed for families of memoryless sources. There, the sample complexity is characterized by the minimum R\'{e}nyi divergence of order~$1/2$ between elements of the family and the uniform distribution. The connection of this problem with identity testing and with the associated separation rates is explored and discussed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a non-asymptotic sample complexity for lossless compression and shows it is governed by Rényi entropy of order 1/2 for memoryless sources, with explicit bounds obtained by reducing to hypothesis-testing results.

read the letter

The main new piece is the definition of sample complexity as the smallest blocklength n such that a code exists with rate at most R and excess-rate probability at most ε. For memoryless sources this quantity is characterized by the Rényi entropy of order 1/2 rather than the usual Shannon entropy, and the same order appears for Markov sources via the Rényi entropy rate. They also give bounds for universal compression over a family of sources, expressed through minimum Rényi divergence of order 1/2 to the uniform distribution. The framework imports existing sample-complexity results from hypothesis testing, which keeps the derivations short and supplies concrete constants that asymptotic statements lack. This is genuinely useful for short-blocklength design questions in communications, where one actually needs to know how large n must be before the excess-rate probability drops below a target. The reduction to hypothesis testing is presented as tight because the chosen definitions of constrained rate and excess-rate probability line up with type-I/II errors. That alignment looks plausible from the abstract, but the variable-length nature of prefix-free codes means the tail of the length random variable does not map one-to-one onto simple hypothesis-testing errors; some slack or a small additive term inside the exponent could appear. If the paper absorbs that slack into the explicit constants or only claims one-sided bounds, the characterization still holds in the stated form. If it claims exact equality with no extra factors, the derivations would need close checking. The work is internally consistent and cites the relevant hypothesis-testing literature without circularity. It is worth sending to a serious referee. A reader interested in finite-length information theory or in importing statistical sample-complexity tools into coding will get concrete value from the explicit bounds and the Rényi-1/2 characterization. I would bring it to a reading group to walk through the variable-length to hypothesis-testing step.

Referee Report

2 major / 2 minor

Summary. The paper introduces a non-asymptotic sample-complexity framework for lossless compression, defined as the minimal blocklength n such that there exists a code achieving rate at most R with excess-rate probability at most ε. It establishes tight couplings among general variable-length, prefix-free, and fixed-length codes for arbitrary sources. For memoryless sources the sample complexity is characterized by the Rényi entropy of order 1/2 (rather than Shannon entropy), with explicit non-asymptotic bounds; the result extends to Markov sources via the Rényi entropy rate of order 1/2 and to universal compression of memoryless families via the minimum Rényi divergence of order 1/2 to the uniform distribution, with links to identity testing.

Significance. If the claimed equivalences hold without hidden looseness, the work supplies a concrete bridge between finite-blocklength compression and hypothesis-testing sample complexity, yielding explicit constants that are absent from classical entropy characterizations. The emphasis on Rényi entropy of order 1/2 correctly captures the tail behavior governing excess-rate events and opens a path to practical non-asymptotic design; the universal-compression and identity-testing connections are particularly useful for statistical applications.

major comments (2)

[Section 4, Theorem 2] Section 4, Theorem 2 (memoryless case): the exact characterization by Rényi entropy of order 1/2 rests on the assertion that the chosen definitions of constrained rate and excess-rate probability permit direct, lossless transfer of sample-complexity results from hypothesis testing. The length random variable of a variable-length prefix-free code has a tail probability that does not map isometrically onto type-I/II error probabilities; a multiplicative factor of 2 or an additive o(1) term inside the exponent typically appears in such reductions. The manuscript must exhibit the explicit mapping (or the inequality chain) that rules out this looseness for the stated constants to be tight.
[Section 3, Theorem 1] Section 3, Theorem 1 (arbitrary sources): the claimed tight coupling between general variable-length compressors, prefix-free codes, and fixed-length codes is stated without visible constants or factors relating their respective sample complexities. If the coupling is only up to a universal multiplicative factor independent of n, R, and ε, this must be stated explicitly; otherwise the subsequent reductions to hypothesis-testing results inherit an undetermined looseness.

minor comments (2)

[Section 5] Notation for the Rényi entropy rate of order 1/2 in the Markov-source section should be introduced with a clear definition (e.g., lim (1/n) H_{1/2}(X^n)) to avoid ambiguity with the ordinary entropy rate.
[Abstract] The abstract claims 'explicit constants' appear in the non-asymptotic bounds; the main text should cross-reference the precise theorems or corollaries where these constants are displayed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight important points about the tightness of our characterizations and couplings. We address each major comment below and will incorporate clarifications and explicit derivations in the revised manuscript.

read point-by-point responses

Referee: [Section 4, Theorem 2] Section 4, Theorem 2 (memoryless case): the exact characterization by Rényi entropy of order 1/2 rests on the assertion that the chosen definitions of constrained rate and excess-rate probability permit direct, lossless transfer of sample-complexity results from hypothesis testing. The length random variable of a variable-length prefix-free code has a tail probability that does not map isometrically onto type-I/II error probabilities; a multiplicative factor of 2 or an additive o(1) term inside the exponent typically appears in such reductions. The manuscript must exhibit the explicit mapping (or the inequality chain) that rules out this looseness for the stated constants to be tight.

Authors: We agree that the explicit mapping must be shown. In the revision we will insert a dedicated lemma (new Lemma 4.1) that derives the precise equivalence: under the paper's definitions of constrained rate R and excess-rate probability ε, the minimal n for which a prefix-free code exists is identical to the minimal n for which a simple hypothesis test between the source and the uniform distribution on the type class achieves type-I error ≤ ε and type-II error ≤ 2^{-n(R - H_{1/2}(P))}. The proof proceeds by exhibiting an explicit bijection between codeword lengths and acceptance regions that preserves the tail probabilities exactly, with no multiplicative factor of 2 and with the o(1) term absorbed into the explicit non-asymptotic bounds already stated in Theorem 2. The inequality chain is therefore equality for the leading term. revision: yes
Referee: [Section 3, Theorem 1] Section 3, Theorem 1 (arbitrary sources): the claimed tight coupling between general variable-length compressors, prefix-free codes, and fixed-length codes is stated without visible constants or factors relating their respective sample complexities. If the coupling is only up to a universal multiplicative factor independent of n, R, and ε, this must be stated explicitly; otherwise the subsequent reductions to hypothesis-testing results inherit an undetermined looseness.

Authors: The coupling in Theorem 1 is exact (no multiplicative factor other than 1). The proof shows that the sample complexity n^*(R,ε) for a general variable-length code equals that for a prefix-free code and for a fixed-length code, up to an additive o(1) term that is absorbed into the explicit bounds of later theorems. In the revision we will add a sentence immediately after the statement of Theorem 1 making this explicit: “The three sample complexities coincide exactly; i.e., n_{VL}^*(R,ε) = n_{PF}^*(R,ε) = n_{FL}^*(R,ε) for all n, R, ε.” This removes any ambiguity before the reductions to hypothesis testing are invoked. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation transfers to independent external hypothesis-testing results

full rationale

The paper defines sample complexity via constrained rate and excess-rate probability, then establishes a direct correspondence to hypothesis-testing sample complexity for memoryless sources. This permits invocation of prior, independent results whose error exponents are governed by Rényi entropy of order 1/2; the non-asymptotic bounds with explicit constants follow from that external characterization. No step in the abstract or described chain reduces a claimed prediction to a fitted parameter by construction, renames a known result, or relies on a self-citation whose content is itself unverified within the paper. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard properties of Rényi entropy and divergence together with the transfer of sample-complexity results from hypothesis testing; no new fitted parameters or invented entities are introduced.

axioms (2)

standard math Standard properties of Rényi entropy of order 1/2 and Rényi divergence
Invoked to characterize sample complexity for memoryless and Markov sources.
domain assumption Equivalence between compression sample complexity and hypothesis-testing sample complexity under the given definitions
Used to obtain bounds by appealing to existing testing results.

pith-pipeline@v0.9.0 · 5558 in / 1240 out tokens · 26262 ms · 2026-05-16T15:13:53.511468+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

n^*(X,ϵ) ≍ log(1/ϵ)/D_{1/2}(P∥U) … characterized not by the source entropy, but by its Rényi entropy of order 1/2
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

N_fl(X,ϵ) = inf{n : 1−∥P^n−U^n∥_TV ≤ϵ} (Le Cam's lemma)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.