pith. sign in

arxiv: 1906.11700 · v1 · pith:67TSUXAFnew · submitted 2019-06-27 · 💻 cs.DS

Efficient algorithms for modifying and sampling from a categorical distribution

Pith reviewed 2026-05-25 14:12 UTC · model grok-4.3

classification 💻 cs.DS
keywords Huffman treecategorical distributionsamplingdynamic data structureprobabilistic programmingadaptive Huffman codinglogarithmic time operations
0
0 comments X

The pith

A Huffman tree represents categorical distributions so sampling and updates take O(log n) time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that a Huffman tree can serve as the underlying structure for a categorical distribution over n categories. With this structure, drawing a sample and performing add, delete or modify operations on categories each run in O(log n) time, whereas standard approaches require O(n) time for any parameter change. When n is large and the probabilities are hyper-parameters that must be adjusted often, the linear cost becomes prohibitive in probabilistic programming and machine learning settings. The work also reports that practical sampling times stay within a few percent of the information-theoretic optimum and notes the same algorithms apply to adaptive Huffman coding.

Core claim

A Huffman tree is an efficient data structure for representing categorical distributions and algorithms exist to generate samples as well as add, delete and modify categories in O(log n) time. The time to sample from the distribution remains, in practice, within a few percent of the theoretical optimal value. The same algorithm may also be useful in the context of adaptive Huffman coding where computational efficiency is important.

What carries the argument

Huffman tree that encodes category probabilities as prefix-free paths, enabling tree traversal for sampling and local rotations or updates for modifications.

If this is right

  • Sampling time stays within a few percent of the information-theoretic lower bound.
  • Add, delete and modify operations avoid the O(n) rebuild cost of array-based representations.
  • The same procedures support efficient adaptive Huffman coding.
  • Frequent hyper-parameter changes become feasible even when n is large.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The tree could be combined with other balanced search trees to handle additional constraints such as range queries over categories.
  • In settings where categories arrive or disappear online, the logarithmic bound would keep total work linear in the number of changes rather than quadratic.
  • The approach supplies a concrete dynamic alternative whenever a static alias table or cumulative-sum array is currently used for categorical sampling.

Load-bearing premise

Dynamic add, delete and modify operations on the Huffman tree can be implemented to run in O(log n) time while preserving the exact sampling probabilities.

What would settle it

An implementation of the claimed add/delete/modify procedures on a 10,000-category distribution where any single update exceeds O(log n) time or produces sample frequencies that deviate from the stated probabilities.

Figures

Figures reproduced from arXiv: 1906.11700 by Daniel Tang.

Figure 1
Figure 1. Figure 1: Algorithm to generate a sample from a Huffman tree [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Algorithm to add a new category to a tree [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Probabilistic programming languages and other machine learning applications often require samples to be generated from a categorical distribution where the probability of each one of $n$ categories is specified as a parameter. If the parameters are hyper-parameters then they need to be modified, however, current implementations of categorical distributions take $\mathcal{O}(n)$ time to modify a parameter. If $n$ is large and the parameters are being frequently modified, this can become prohibitive. Here we present the insight that a Huffman tree is an efficient data structure for representing categorical distributions and present algorithms to generate samples as well as add, delete and modify categories in $\mathcal{O}(\log(n))$ time. We demonstrate that the time to sample from the distribution remains, in practice, within a few percent of the theoretical optimal value. The same algorithm may also be useful in the context of adaptive Huffman coding where computational efficiency is important.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims that Huffman trees provide an efficient representation for categorical distributions over n categories, with algorithms for sampling as well as add/delete/modify operations all running in O(log n) time. It further states that practical sampling performance stays within a few percent of the information-theoretic optimum and suggests applicability to adaptive Huffman coding.

Significance. A correct O(log n) dynamic data structure for categorical sampling and updates would be valuable for probabilistic programming and ML workloads with frequently changing parameters. The practical performance claim, if supported by experiments, would be a minor positive. However, the core algorithmic claim is incorrect for arbitrary distributions.

major comments (1)
  1. [Abstract] Abstract: The central claim that sampling and add/delete/modify operations run in O(log n) time is false in the worst case. For probabilities p_i proportional to 2^{-i} (normalized), the Huffman tree is a single path of length n-1; both sampling (leaf traversal) and updates (restructuring along the path) therefore require Θ(n) time. The manuscript states no restriction to balanced distributions, no amortized analysis, and no auxiliary balancing mechanism.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the gap in our complexity analysis. We agree that the worst-case O(log n) claim as stated is not supported for arbitrary distributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that sampling and add/delete/modify operations run in O(log n) time is false in the worst case. For probabilities p_i proportional to 2^{-i} (normalized), the Huffman tree is a single path of length n-1; both sampling (leaf traversal) and updates (restructuring along the path) therefore require Θ(n) time. The manuscript states no restriction to balanced distributions, no amortized analysis, and no auxiliary balancing mechanism.

    Authors: The referee is correct. Huffman trees have no balancing guarantee, and the cited distribution produces a degenerate tree of height n-1, making both sampling and the update path Θ(n). The manuscript provides neither a restriction to balanced probability vectors, an amortized analysis, nor any rebalancing mechanism. We will revise the abstract, introduction, and all complexity statements to replace the unqualified O(log n) claim with the accurate bound of O(h), where h is the tree height (O(log n) in expectation or for sufficiently balanced distributions, but Θ(n) in the worst case). revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic claim with no self-referential derivation

full rationale

The paper asserts the existence of O(log n) algorithms for sampling and dynamic updates on a Huffman tree representation of a categorical distribution. No equations, fitted parameters, or derivations are present that reduce the claimed result to its own inputs by construction. The presentation is a direct algorithmic proposal rather than a mathematical derivation chain; any issues with the O(log n) bound (e.g., tree imbalance) concern correctness, not circularity of the kind defined by self-definition, fitted-input prediction, or self-citation load-bearing. The derivation is self-contained as an algorithmic statement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Huffman trees can be maintained dynamically for probability sampling; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Huffman tree structure can represent arbitrary categorical probabilities and support sampling by tree traversal
    Standard property of Huffman coding applied to sampling

pith-pipeline@v0.9.0 · 5668 in / 1246 out tokens · 43493 ms · 2026-05-25T14:12:29.392342+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Probabilistic programming in python using pymc3

    John Salvatier, Thomas V Wiecki, and Christopher Fonnes beck. Probabilistic programming in python using pymc3. PeerJ Computer Science, 2:e55, 2016

  2. [2]

    TensorFlow Distributions

    Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevd o, Srinivas V asudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. Tensorflow distr ibutions. arXiv preprint arXiv:1711.10604, 2017

  3. [3]

    A method for the construction of minimum -redundancy codes

    David A Huffman. A method for the construction of minimum -redundancy codes. Proceedings of the IRE , 40(9):1098–1101, 1952

  4. [4]

    Algorithm 673: dynamic huffman co ding

    Jeffrey Scott Vitter. Algorithm 673: dynamic huffman co ding. ACM Transactions on Mathematical Software (TOMS), 15(2):158–167, 1989

  5. [5]

    Dynamic huffman coding

    Donald E Knuth. Dynamic huffman coding. Journal of algorithms, 6(2):163–180, 1985

  6. [6]

    Introduction to algorithms

    Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, an d Clifford Stein. Introduction to algorithms . MIT press, 2009. 4