Efficient algorithms for modifying and sampling from a categorical distribution

Daniel Tang

arxiv: 1906.11700 · v1 · pith:67TSUXAFnew · submitted 2019-06-27 · 💻 cs.DS

Efficient algorithms for modifying and sampling from a categorical distribution

Daniel Tang This is my paper

Pith reviewed 2026-05-25 14:12 UTC · model grok-4.3

classification 💻 cs.DS

keywords Huffman treecategorical distributionsamplingdynamic data structureprobabilistic programmingadaptive Huffman codinglogarithmic time operations

0 comments

The pith

A Huffman tree represents categorical distributions so sampling and updates take O(log n) time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a Huffman tree can serve as the underlying structure for a categorical distribution over n categories. With this structure, drawing a sample and performing add, delete or modify operations on categories each run in O(log n) time, whereas standard approaches require O(n) time for any parameter change. When n is large and the probabilities are hyper-parameters that must be adjusted often, the linear cost becomes prohibitive in probabilistic programming and machine learning settings. The work also reports that practical sampling times stay within a few percent of the information-theoretic optimum and notes the same algorithms apply to adaptive Huffman coding.

Core claim

A Huffman tree is an efficient data structure for representing categorical distributions and algorithms exist to generate samples as well as add, delete and modify categories in O(log n) time. The time to sample from the distribution remains, in practice, within a few percent of the theoretical optimal value. The same algorithm may also be useful in the context of adaptive Huffman coding where computational efficiency is important.

What carries the argument

Huffman tree that encodes category probabilities as prefix-free paths, enabling tree traversal for sampling and local rotations or updates for modifications.

If this is right

Sampling time stays within a few percent of the information-theoretic lower bound.
Add, delete and modify operations avoid the O(n) rebuild cost of array-based representations.
The same procedures support efficient adaptive Huffman coding.
Frequent hyper-parameter changes become feasible even when n is large.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The tree could be combined with other balanced search trees to handle additional constraints such as range queries over categories.
In settings where categories arrive or disappear online, the logarithmic bound would keep total work linear in the number of changes rather than quadratic.
The approach supplies a concrete dynamic alternative whenever a static alias table or cumulative-sum array is currently used for categorical sampling.

Load-bearing premise

Dynamic add, delete and modify operations on the Huffman tree can be implemented to run in O(log n) time while preserving the exact sampling probabilities.

What would settle it

An implementation of the claimed add/delete/modify procedures on a 10,000-category distribution where any single update exceeds O(log n) time or produces sample frequencies that deviate from the stated probabilities.

Figures

Figures reproduced from arXiv: 1906.11700 by Daniel Tang.

**Figure 2.** Figure 2: Algorithm to add a new category to a tree [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Probabilistic programming languages and other machine learning applications often require samples to be generated from a categorical distribution where the probability of each one of $n$ categories is specified as a parameter. If the parameters are hyper-parameters then they need to be modified, however, current implementations of categorical distributions take $\mathcal{O}(n)$ time to modify a parameter. If $n$ is large and the parameters are being frequently modified, this can become prohibitive. Here we present the insight that a Huffman tree is an efficient data structure for representing categorical distributions and present algorithms to generate samples as well as add, delete and modify categories in $\mathcal{O}(\log(n))$ time. We demonstrate that the time to sample from the distribution remains, in practice, within a few percent of the theoretical optimal value. The same algorithm may also be useful in the context of adaptive Huffman coding where computational efficiency is important.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The O(log n) worst-case claim fails because Huffman trees can reach linear depth on skewed distributions, though the practical sampling results look usable.

read the letter

The main takeaway is that the paper's O(log n) time bound for sampling and updates does not hold in the worst case. Huffman trees built on arbitrary probabilities can form long chains, so traversal to a leaf or path updates can take linear time. The abstract states the bound without qualifiers, restrictions on the distribution, or auxiliary balancing, and the stress-test example with p_i = 2^{-i} shows the issue directly. That said, the core idea of maintaining a Huffman tree for dynamic categorical sampling is a reasonable practical suggestion. The work shows algorithms for add, delete, modify, and sample, and reports that observed sampling times stay close to the information-theoretic optimum in their tests. This is useful when n is large and updates are frequent, as standard array-based methods are O(n) per change. The practical performance data is the strongest part here. The main soft spot is the complexity claim. Without a proof sketch, pseudocode, or discussion of depth, it is hard to see how the bound is achieved for general inputs. The paper also lacks comparison to other dynamic sampling structures like Fenwick trees or alias methods with updates. This is a short note rather than a deep theoretical result. It is aimed at people implementing probabilistic programs or ML samplers who care about update cost more than worst-case guarantees. A reader who needs a drop-in replacement for categorical sampling with frequent changes could get value from the implementation details once they are written out. I would send it to peer review after the authors address the time bound and add a comparison section; the practical angle is worth referee time even if the asymptotic claim needs fixing.

Referee Report

1 major / 0 minor

Summary. The paper claims that Huffman trees provide an efficient representation for categorical distributions over n categories, with algorithms for sampling as well as add/delete/modify operations all running in O(log n) time. It further states that practical sampling performance stays within a few percent of the information-theoretic optimum and suggests applicability to adaptive Huffman coding.

Significance. A correct O(log n) dynamic data structure for categorical sampling and updates would be valuable for probabilistic programming and ML workloads with frequently changing parameters. The practical performance claim, if supported by experiments, would be a minor positive. However, the core algorithmic claim is incorrect for arbitrary distributions.

major comments (1)

[Abstract] Abstract: The central claim that sampling and add/delete/modify operations run in O(log n) time is false in the worst case. For probabilities p_i proportional to 2^{-i} (normalized), the Huffman tree is a single path of length n-1; both sampling (leaf traversal) and updates (restructuring along the path) therefore require Θ(n) time. The manuscript states no restriction to balanced distributions, no amortized analysis, and no auxiliary balancing mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the gap in our complexity analysis. We agree that the worst-case O(log n) claim as stated is not supported for arbitrary distributions.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that sampling and add/delete/modify operations run in O(log n) time is false in the worst case. For probabilities p_i proportional to 2^{-i} (normalized), the Huffman tree is a single path of length n-1; both sampling (leaf traversal) and updates (restructuring along the path) therefore require Θ(n) time. The manuscript states no restriction to balanced distributions, no amortized analysis, and no auxiliary balancing mechanism.

Authors: The referee is correct. Huffman trees have no balancing guarantee, and the cited distribution produces a degenerate tree of height n-1, making both sampling and the update path Θ(n). The manuscript provides neither a restriction to balanced probability vectors, an amortized analysis, nor any rebalancing mechanism. We will revise the abstract, introduction, and all complexity statements to replace the unqualified O(log n) claim with the accurate bound of O(h), where h is the tree height (O(log n) in expectation or for sufficiently balanced distributions, but Θ(n) in the worst case). revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic claim with no self-referential derivation

full rationale

The paper asserts the existence of O(log n) algorithms for sampling and dynamic updates on a Huffman tree representation of a categorical distribution. No equations, fitted parameters, or derivations are present that reduce the claimed result to its own inputs by construction. The presentation is a direct algorithmic proposal rather than a mathematical derivation chain; any issues with the O(log n) bound (e.g., tree imbalance) concern correctness, not circularity of the kind defined by self-definition, fitted-input prediction, or self-citation load-bearing. The derivation is self-contained as an algorithmic statement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Huffman trees can be maintained dynamically for probability sampling; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Huffman tree structure can represent arbitrary categorical probabilities and support sampling by tree traversal
Standard property of Huffman coding applied to sampling

pith-pipeline@v0.9.0 · 5668 in / 1246 out tokens · 43493 ms · 2026-05-25T14:12:29.392342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Probabilistic programming in python using pymc3

John Salvatier, Thomas V Wiecki, and Christopher Fonnes beck. Probabilistic programming in python using pymc3. PeerJ Computer Science, 2:e55, 2016

work page 2016
[2]

TensorFlow Distributions

Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevd o, Srinivas V asudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. Tensorﬂow distr ibutions. arXiv preprint arXiv:1711.10604, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

A method for the construction of minimum -redundancy codes

David A Huffman. A method for the construction of minimum -redundancy codes. Proceedings of the IRE , 40(9):1098–1101, 1952

work page 1952
[4]

Algorithm 673: dynamic huffman co ding

Jeffrey Scott Vitter. Algorithm 673: dynamic huffman co ding. ACM Transactions on Mathematical Software (TOMS), 15(2):158–167, 1989

work page 1989
[5]

Dynamic huffman coding

Donald E Knuth. Dynamic huffman coding. Journal of algorithms, 6(2):163–180, 1985

work page 1985
[6]

Introduction to algorithms

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, an d Clifford Stein. Introduction to algorithms . MIT press, 2009. 4

work page 2009

[1] [1]

Probabilistic programming in python using pymc3

John Salvatier, Thomas V Wiecki, and Christopher Fonnes beck. Probabilistic programming in python using pymc3. PeerJ Computer Science, 2:e55, 2016

work page 2016

[2] [2]

TensorFlow Distributions

Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevd o, Srinivas V asudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. Tensorﬂow distr ibutions. arXiv preprint arXiv:1711.10604, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

A method for the construction of minimum -redundancy codes

David A Huffman. A method for the construction of minimum -redundancy codes. Proceedings of the IRE , 40(9):1098–1101, 1952

work page 1952

[4] [4]

Algorithm 673: dynamic huffman co ding

Jeffrey Scott Vitter. Algorithm 673: dynamic huffman co ding. ACM Transactions on Mathematical Software (TOMS), 15(2):158–167, 1989

work page 1989

[5] [5]

Dynamic huffman coding

Donald E Knuth. Dynamic huffman coding. Journal of algorithms, 6(2):163–180, 1985

work page 1985

[6] [6]

Introduction to algorithms

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, an d Clifford Stein. Introduction to algorithms . MIT press, 2009. 4

work page 2009