pith. sign in

arxiv: 2603.25806 · v2 · submitted 2026-03-26 · 📊 stat.ME · math.ST· stat.CO· stat.TH

Context Tree Prior Distributions based on Node Weighting with exact Bayes Factors

Pith reviewed 2026-05-15 00:04 UTC · model grok-4.3

classification 📊 stat.ME math.STstat.COstat.TH
keywords variable-length Markov chainscontext treesBayesian priorsBayes factorsContext Tree WeightingContext Tree Maximizingmodel selection
0
0 comments X

The pith

Priors on context trees for variable-length Markov chains can be defined by directly weighting individual nodes while preserving exact marginal likelihoods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a new family of prior distributions over context tree structures by using functions that assign weights directly to nodes rather than through branching probabilities. This change makes it possible to express structural assumptions about which contexts should be more or less likely in an intuitive way. The resulting priors remain computationally tractable because marginal likelihoods and posterior-maximizing trees can still be obtained exactly by generalizing the Context Tree Weighting and Context Tree Maximizing algorithms. Exact Bayes factors then become available for comparing different structural hypotheses or model specifications.

Core claim

A prior over context trees is defined by a context-tree function that supplies a weight to each possible node; the prior probability of any particular tree is then proportional to the product of the weights on its nodes, and this representation allows the standard CTW and CTM recursions to be extended so that the marginal likelihood of any observed sequence and the maximum a posteriori tree are obtained exactly.

What carries the argument

Context-tree functions that assign a weight to each node and thereby determine the prior probability of every possible tree structure.

If this is right

  • Any structural belief about context lengths or branching patterns can be encoded by choosing an appropriate node-weight function.
  • Marginal likelihoods for observed sequences remain exactly computable for any tree in the support of the prior.
  • The tree that maximizes the posterior probability can be recovered exactly by a generalized Context Tree Maximizing procedure.
  • Bayes factors between competing priors or between different maximal depths can be computed exactly and used for model selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same node-weighting construction could be applied to other tree-structured models that currently rely on branching-process priors.
  • Practical implementations could combine the exact Bayes factor with forward selection of maximal depth to automate model choice on real sequence data.
  • Because the prior is defined through an arbitrary positive function on nodes, it opens the possibility of learning the weight function itself from multiple related data sets.

Load-bearing premise

That node-weighting functions can be chosen so the generalized Context Tree Weighting and Context Tree Maximizing recursions still compute the exact marginal likelihood and exact posterior mode.

What would settle it

A concrete node-weighting function together with a short observed sequence for which the generalized CTW recursion returns a different marginal likelihood than the value obtained by direct summation over all trees.

Figures

Figures reproduced from arXiv: 2603.25806 by Thiago Paulichen, Victor Freguglia.

Figure 1
Figure 1. Figure 1: Example of full rooted tree with root node λ and labels in A = {0, 1} A sequence s n n−l , l ≤ n, is called a suffix of a string z n 1 if sj = zj for all j = n − l, . . . , n. For instance, the empty string λ is a suffix of any string, and any inner node is a suffix of each of its children. A set of strings γ is called proper if no string in γ is a suffix of any other string in γ (Willems et al., 1995) [P… view at source ↗
Figure 2
Figure 2. Figure 2: Example of context tree (τ, p) 3. A Bayesian Framework to Context Trees As shown in Section 2, a VLMC model is fully specified by the context tree (τ, p). Un￾der the Bayesian perspective, we treat both unobserved components, τ and p, as random elements and assign prior distributions to each of them. The general Bayesian system can then be summarized by the following hierarchical structure: τ ∼ π(τ ), τ ∈ T… view at source ↗
Figure 3
Figure 3. Figure 3: Sum and maximum over all trees of the function F We first compute and store the value of f(s) for every node s ∈ N (τMAX). Then, starting from the leaves and moving towards the root, we compute the values of ΣF (s) and ΥF (s) at each node s ∈ N (τMAX). The values of f, ΣF , and ΥF are displayed next to their corresponding nodes in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generator VLMC models (τ, p) We define the competing models by specifying different context-tree functions F. The set of context-tree functions utilized in each scenario is presented in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the two trees. Observe that exactly two operations are required to transform τ1 into τ2: (i) grow the node 0, and (ii) prune the nodes 01 and 11. τ1 λ 1 01 11 0 τ2 λ 0 1 00 10 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Graphical summary of the prior and posterior probabilities of the true trees. The top panel corresponds to simulation scenario (a), and the bottom panel to scenario (b) [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example of the bijection map for L = 3 and A = {0, 1}. The tree τ is an element of the set T 0 2 while τ 0 ∈ T 00 1 and τ 1 ∈ T 10 1 . The pair (τ 0 , τ 1 ) is the correspondent element to τ in T 00 1 × T 10 1 . Here, the underlined symbols are representing the roots. Therefore, for a node r such that ℓ(r) < L, we can rewrite: ΣF (r) = X τ∈{r}∪T r L−ℓ(r) \{r} Y s∈τ f(s) = f(r) + X τ∈T r L−ℓ(r) \{r} Y s∈τ f… view at source ↗
Figure 8
Figure 8. Figure 8: Illustration of the procedure involving the function T for L = 2 and A = {0, 1}. The node considered is r = 0 (shown as dashed in the figure). Note that if f(0) ≥ ΥF (00) · ΥF (10), then T(0) = {0} ∈ T 0 1 . Otherwise, the bijection map ensures, T(0) = {00} ∪ {10} ≃ {00, 10} ∈ T 0 1 . The underlined symbols are representing the roots. Now, we can prove, by descending induction on ℓ(r), that for every r ∈ N… view at source ↗
read the original abstract

Variable-length Markov chains (VLMCs) are a flexible class of higher-order Markov models that admit a natural representation as context trees. Existing Bayesian methods for specifying prior distributions on tree structures rely on branching processes, but these suffer from a fundamental limitation. The connection between branching probabilities at individual nodes and the structural properties of the induced tree distribution is not straightforward, making it difficult to construct priors encoding specific structural beliefs. We address this limitation by introducing a novel representation of prior distributions on tree space based on context-tree functions. By directly specifying weights for individual contexts through a function on nodes, our approach provides an intuitive mechanism for incorporating structural hypotheses into the prior. This class of distributions maintains computational tractability, allowing marginal likelihoods and posterior mode trees to be computed exactly via generalizations of the Context Tree Weighting (CTW) and Context Tree Maximizing (CTM) algorithms. Exact Bayes factor computation enables rigorous model comparison and hypothesis testing. We demonstrate the flexibility and effectiveness of our approach through simulation studies comparing different prior specifications, and develop practical algorithms for selecting the maximal depth and performing model selection based on Bayes factors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a class of prior distributions on context trees for variable-length Markov chains (VLMCs), defined directly via node-weighting functions on individual contexts rather than branching processes. It claims this representation allows intuitive encoding of structural hypotheses while preserving exact computational tractability, so that marginal likelihoods and posterior mode trees can be obtained via direct generalizations of the Context Tree Weighting (CTW) and Context Tree Maximizing (CTM) algorithms. Exact Bayes factors are asserted to follow, with the approach demonstrated on simulation studies for prior comparison, maximal-depth selection, and model selection.

Significance. If the exactness of the generalized recursions holds for the proposed weighting functions, the work would provide a useful advance over existing branching-process priors by offering a more transparent mechanism for incorporating domain knowledge about tree structure. The availability of exact marginal likelihoods would strengthen Bayesian model comparison in VLMC applications.

major comments (3)
  1. [§3.2] §3.2 (Definition of context-tree functions and node-weighting): The central claim that arbitrary node-weighting functions w yield priors whose marginal likelihoods remain exactly computable via generalized CTW/CTM recursions is not accompanied by a closure proof or explicit conditions on w. The original CTW proof relies on the product form of the prior allowing bottom-up summation; for a generic w the weighted sum over extensions of a context need not reduce to a simple function of the child sums, requiring either a restriction on w (e.g., multiplicative or depth-dependent) or additional correction terms. This assumption is load-bearing for the tractability result.
  2. [§4.1–4.2] §4.1–4.2 (Generalized CTW and CTM algorithms): No derivation or inductive argument is supplied showing that the dynamic-programming tables for the new priors can be filled exactly without enumeration or extra factors. The abstract states that the generalizations “maintain computational tractability,” but the specific recurrence relations and their invariance under the weighting are not verified.
  3. [§5] §5 (Simulation studies): The reported comparisons of different prior specifications omit error bars on the Monte Carlo estimates, baseline results against standard branching-process priors, and quantitative assessment of how often the exact Bayes-factor computations differ from approximate alternatives. These omissions weaken the empirical support for the practical advantage of the new class.
minor comments (2)
  1. [§3.1] Notation for the node-weighting function w is introduced without an explicit statement of its domain and codomain; a short formal definition would improve readability.
  2. [Abstract] The abstract mentions “exact Bayes factor computation” but the manuscript does not clarify whether the Bayes factors are between trees or between maximal depths; a brief clarifying sentence would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify areas where the manuscript can be strengthened with additional rigor and empirical detail. We respond to each major comment below and will incorporate the suggested changes in the revised version.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Definition of context-tree functions and node-weighting): The central claim that arbitrary node-weighting functions w yield priors whose marginal likelihoods remain exactly computable via generalized CTW/CTM recursions is not accompanied by a closure proof or explicit conditions on w. The original CTW proof relies on the product form of the prior allowing bottom-up summation; for a generic w the weighted sum over extensions of a context need not reduce to a simple function of the child sums, requiring either a restriction on w (e.g., multiplicative or depth-dependent) or additional correction terms. This assumption is load-bearing for the tractability result.

    Authors: We agree that an explicit closure argument is required. The node-weighting functions in the paper are defined to be compatible with the recursive structure of context trees (specifically, the weight of a context factors through its immediate extensions). In the revision we will add a dedicated lemma and inductive proof establishing that the generalized CTW recursion computes the exact marginal likelihood for this class, together with the precise conditions on w that guarantee the property without extra correction terms. revision: yes

  2. Referee: [§4.1–4.2] §4.1–4.2 (Generalized CTW and CTM algorithms): No derivation or inductive argument is supplied showing that the dynamic-programming tables for the new priors can be filled exactly without enumeration or extra factors. The abstract states that the generalizations “maintain computational tractability,” but the specific recurrence relations and their invariance under the weighting are not verified.

    Authors: We accept that the algorithmic sections would benefit from explicit derivations. The revised manuscript will contain complete inductive arguments for both the generalized CTW (marginal likelihood) and CTM (posterior mode) recursions, verifying that the dynamic-programming tables remain exact under the node-weighting scheme and that no enumeration or auxiliary factors are introduced. revision: yes

  3. Referee: [§5] §5 (Simulation studies): The reported comparisons of different prior specifications omit error bars on the Monte Carlo estimates, baseline results against standard branching-process priors, and quantitative assessment of how often the exact Bayes-factor computations differ from approximate alternatives. These omissions weaken the empirical support for the practical advantage of the new class.

    Authors: We will revise Section 5 to include standard-error bars on all Monte Carlo estimates, direct numerical comparisons against standard branching-process priors, and quantitative summaries (e.g., the proportion of replications in which the exact and approximate Bayes factors differ by more than a given threshold) to strengthen the empirical evidence. revision: yes

Circularity Check

0 steps flagged

Node-weighting priors extend CTW/CTM without reducing to self-definition or fitted inputs

full rationale

The paper defines a new class of priors directly via node-weighting functions on contexts, independent of the branching-process priors it replaces. Tractability is claimed as a consequence of generalizing the established CTW and CTM recursions rather than as an input or fitted quantity. No equations equate the new prior probabilities to data-derived parameters or prior-work results by construction. Self-citations to CTW/CTM are external benchmarks, not load-bearing for the core construction. The derivation remains self-contained against standard context-tree algorithms.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the user choosing a weighting function to encode structural beliefs and on the unproven claim that CTW/CTM generalizations remain exact for these priors.

free parameters (1)
  • node weighting function
    User-specified function that assigns weights to individual contexts; its form is chosen to reflect structural hypotheses and is not derived from data.
axioms (1)
  • domain assumption Generalizations of the CTW and CTM algorithms compute exact marginal likelihoods under the new node-weighted priors
    Invoked to guarantee computational tractability and exact Bayes factors.
invented entities (1)
  • context-tree function no independent evidence
    purpose: To define prior distributions on tree structures by assigning weights directly to nodes
    New representational device introduced to overcome limitations of branching-process priors.

pith-pipeline@v0.9.0 · 5497 in / 1367 out tokens · 44054 ms · 2026-05-15T00:04:17.499236+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    doi: 10.1214/ss/1042727943. P. Bühlmann and A. J. Wyner. Variable length Markov chains.The Annals of Statistics, 27 (2):480–513,

  2. [2]

    doi: 10.1214/aos/1018031204. C. Dimitrakakis. Bayesian variable order Markov models. In Y. W. Teh and M. Titterington, editors,Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 161–168, Chia Laguna Resort, Sardinia, Italy,

  3. [3]

    URLhttps://doi.org/10.1007/s11222-022-10191-2

    doi: 10.1007/ s11222-022-10191-2. URLhttps://doi.org/10.1007/s11222-022-10191-2. Y. Gao, I. Kontoyiannis, and E. Bienenstock. Estimating the entropy of binary time series: Methodology, some theory and a simulation study.Entropy, 10(2):71–99, June

  4. [4]

    doi: 10.3390/entropy-e10020071

    ISSN 1099-4300. doi: 10.3390/entropy-e10020071. URLhttp://dx.doi.org/10.3390/ entropy-e10020071. J. L. Gross and J. Yellen.Graph Theory and Its Applications. CRC Press,

  5. [5]

    B ayes Factors

    doi: 10.1080/01621459.1995.10476572. URLhttps://doi.org/ 10.1080/01621459.1995.10476572. I. Kontoyiannis, L. Mertzanis, A. Panotopoulou, I. Papageorgiou, and M. Skoularidou. Bayesian context trees: Modelling and exact inference for discrete time series.Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(4):1287–1323, 04

  6. [6]

    doi: 10.1111/rssb.12511

    ISSN 1369-7412. doi: 10.1111/rssb.12511. URLhttps://doi.org/10.1111/rssb.12511. R. Krichevsky and V. Trofimov. The performance of universal encoding.IEEE Transactions on Information Theory, 27(2):199–207,

  7. [7]

    doi: 10.3390/e24030328. I. Papageorgiou and I. Kontoyiannis. Posterior Representations for Bayesian Context Trees: Sampling, Estimation and Convergence.Bayesian Analysis, 19(2):501 – 529,

  8. [8]

    URLhttps://doi.org/10.1214/23-BA1362

    doi: 10.1214/23-BA1362. URLhttps://doi.org/10.1214/23-BA1362. J. Rissanen. A universal data compression system.IEEE Transactions on information theory, 29(5):656–664,

  9. [9]

    Volf and F

    P. Volf and F. Willems. On the context tree maximizing algorithm. InProceedings of 1995 IEEE International Symposium on Information Theory, pages 20–,

  10. [10]

    doi: 10.1109/ ISIT.1995.531122. 23 F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: basic properties.IEEE Transactions on Information Theory, 41(3):653–664,

  11. [11]

    doi: 10.1109/18.382012. A. Z. Zambom and D. Rud. Context tree classification and clustering.Journal of Statistical Computation and Simulation, 95(4):781–809,

  12. [12]

    Sequential adaptive design for emu- lating costly computer codes.Journal of Statistical Computation and Simulation, 95(3):654–675, 2025.doi:10.1080/00949655.2024.2436013

    doi: 10.1080/00949655.2024.2439453. URLhttps://doi.org/10.1080/00949655.2024.2439453. 24 6.Appendix 6.A.Proofs.In this subsection, we present the proofs of the results of this work. Through- out, we fix a maximal depthL. 6.A.1.Proposition

  13. [13]

    Prune the sub-tree belowrif f(r)≥ m−1Y k=0 ΥF (kr); Otherwise continue the same test for each childkr,k= 0, . . . , m−1

    Proof.First, note that the resulting tree obtained after inspecting a specific noder∈ N(τ MAX)in the step: “Prune the sub-tree belowrif f(r)≥ m−1Y k=0 ΥF (kr); Otherwise continue the same test for each childkr,k= 0, . . . , m−1” can be defined via the following recursive function: T(r)∈ T r L−ℓ(r) where •ifℓ(r) =L: T(r) ={r}; •otherwise, ifℓ(r)< L: T(r) =...