Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

Sai Niranjan Ramachandran; Suvrit Sra

arxiv: 2605.00414 · v2 · pith:HI5THGUTnew · submitted 2026-05-01 · 💻 cs.LG · cond-mat.stat-mech· cs.AI

Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

Sai Niranjan Ramachandran , Suvrit Sra This is my paper

Pith reviewed 2026-05-22 10:19 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.stat-mechcs.AI

keywords decision treesdiffusion modelsscore matchinggradient boostingtabular data generationmodel distillationunificationlimiting regimes

0 comments

The pith

Decision trees and diffusion models correspond mathematically in limiting regimes, revealing a shared Global Trajectory Score Matching optimization principle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a crisp mathematical correspondence between hierarchical decision trees and diffusion processes under appropriate limiting regimes. This correspondence uncovers that both model classes optimize according to the same Global Trajectory Score Matching principle, for which an idealized gradient boosting procedure is asymptotically optimal. The unification yields two practical models: TreeFlow for high-fidelity tabular generation at 2x speedup and DSMTree for distilling tree logic into neural networks while staying within 2 percent of teacher performance. A sympathetic reader would care because the shared principle bridges discrete hierarchical structures with continuous generative dynamics, suggesting ways to combine interpretability with flexible sampling.

Core claim

Hierarchical decision trees and diffusion processes are mathematically correspondent in appropriate limiting regimes; this correspondence produces a shared optimization principle called Global Trajectory Score Matching, under which gradient boosting in its idealized form is asymptotically optimal.

What carries the argument

The crisp mathematical correspondence between hierarchical decision trees and diffusion processes in limiting regimes, which makes the Global Trajectory Score Matching principle emerge as the common objective.

If this is right

TreeFlow produces competitive generation quality on tabular data while delivering higher fidelity and roughly 2x computational speedup.
DSMTree transfers hierarchical decision logic into neural networks and matches the teacher model within 2 percent on many benchmarks.
Gradient boosting becomes the asymptotically optimal procedure for the shared Global Trajectory Score Matching objective.
The unification supplies a common lens for analyzing and improving both discrete tree-based and continuous flow-based generative models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid architectures could alternate between tree splits for interpretability and diffusion steps for smooth sampling on the same trajectory objective.
The limiting-regime correspondence may extend to other discrete structures such as decision lists or recursive partitioning schemes.
Practical implementations could test whether enforcing the Global Trajectory Score Matching objective directly improves stability in existing diffusion pipelines.

Load-bearing premise

The mathematical correspondence between hierarchical decision trees and diffusion processes holds in the required limiting regimes.

What would settle it

A direct numerical check showing that the idealized gradient boosting procedure fails to converge to the optimum of Global Trajectory Score Matching when the tree-to-diffusion limit is taken.

Figures

Figures reproduced from arXiv: 2605.00414 by Sai Niranjan Ramachandran, Suvrit Sra.

**Figure 1.** Figure 1: A Visual Metaphor for Trees to Flows and back. We connect the discrete structure of a decision tree (left) with the continuous dynamics of a flow process (right). We establish a a formal correspondence, enabling an analysis of boosting via flows, and imbuing generative models with tree-based inductive biases. This Tree↔Flow correspondence reveals a common optimization principle, which we term Global Trajec… view at source ↗

**Figure 2.** Figure 2: Implicit tree structure discovered from a trained diffusion model on the 4-Corners dataset using the corrected time-domain clustering method. (a) The original labeled data clusters, representing the leaves of the hierarchy at t = 0. (b) The discovered hierarchical structure (dendrogram) obtained by tracking the forward SDE trajectories of each cluster and performing agglomerative clustering. The vertical … view at source ↗

**Figure 3.** Figure 3: Comparison of information decay on MNIST. (Top) The normalized entropy of a decision [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: TREEFLOW compared against baseline generative models across a suite of tabular benchmarks. We evaluate on four axes: Utility (TSTR Accuracy ↑), Fidelity (Wasserstein Distance ↓), Structure (Correlation Error ↓), and Efficiency (Runtime ↓). TREEFLOW achieves highest TSTR accuracy on 3/5 benchmarks, lowest Wasserstein distance on 4/5 benchmarks, and lowest correlation error on 3/5 benchmarks, while being 2X… view at source ↗

**Figure 5.** Figure 5: Classification accuracy of the DSM-TREE model compared to its teacher (Base Tree). DSM-TREE matches or exceeds teacher performance on 4/5 datasets, with +3.7% improvement on Heart Disease, demonstrating successful knowledge transfer of complete hierarchical structure. Zoom for clarity. A larger version is available in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Logical flowchart of the Tree-to-Flow derivation detailed in [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Logical flowchart of the Flow-to-Tree derivation detailed in [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Logical flowchart of the derivation showing Gradient Boosting is an optimal GTSM solver, [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Flowchart for the implications of the GTSM framework (Section E). [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Flowchart for the derivation of DSM-TREE (Section G). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Flowchart for the derivation of TREEFLOW (Section H). Taken together, these flowcharts provide a complete visual summary of the main theoretical contributions of this paper. These diagrams are intended to serve as a conceptual scaffold for the reader. The subsequent sections will now present the full, rigorous mathematical derivations and proofs that underpin each step in these roadmaps. C Detailed Deriv… view at source ↗

**Figure 12.** Figure 12: Additional Results for Implicit Tree Structure Discovery (Experiment 1). These figures supplement [PITH_FULL_IMAGE:figures/full_fig_p066_12.png] view at source ↗

**Figure 13.** Figure 13: Additional Results for Information-Theoretic Analysis (Experiment 2). These figures supplement [PITH_FULL_IMAGE:figures/full_fig_p067_13.png] view at source ↗

**Figure 14.** Figure 14: Magnified View of DSM-TREE Performance. Classification accuracy of the DSMTREE model compared to its teacher (Base Tree). Performance is nearly identical on most datasets, with DSM-TREE outperforming on the Heart Disease dataset, which demonstrates successful knowledge transfer from the discrete tree to the continuous neural network. This full-page view is provided for enhanced clarity of the accuracy p… view at source ↗

**Figure 15.** Figure 15: Full-Page View of TREEFLOW Performance. Comparative performance of TREEFLOW against baseline generative models across a suite of tabular benchmarks. We evaluate on four axes: Utility (TSTR Accuracy ↑), Fidelity (Wasserstein Distance ↓), Structure (Correlation Error ↓), and Efficiency (Runtime ↓). TREEFLOW consistently demonstrates state-of-the-art utility, often matching or exceeding the performance of mo… view at source ↗

read the original abstract

Decision trees and diffusion models are ostensibly disparate model classes, one discrete and hierarchical, the other continuous and dynamic. This work unifies the two by establishing a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. Our unification reveals a shared optimization principle: \emph{Global Trajectory Score Matching (GTSM)}, for which gradient boosting (in an idealized version) is asymptotically optimal. We underscore the conceptual value of our work through two key practical instantiations: \treeflow, which achieves competitive generation quality on tabular data with higher fidelity and a 2\times computational speedup, and \dsmtree, a novel distillation method that transfers hierarchical decision logic into neural networks, matching teacher performance within 2\% on many benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The unification via GTSM is conceptually neat but the limiting-regime step looks vulnerable to discretization bias that the paper does not fully close.

read the letter

The paper's main contribution is a claimed crisp correspondence between hierarchical decision trees and diffusion trajectories in certain limits, which surfaces a common objective they call Global Trajectory Score Matching. Idealized gradient boosting is then said to be asymptotically optimal for this GTSM. They turn the idea into two concrete pieces: TreeFlow for tabular generation that reports a 2x speedup with competitive fidelity, and DSMTree that distills tree logic into networks while staying within 2% of the teacher on several benchmarks. Those practical results are the part that actually lands; the speed and accuracy numbers are straightforward to check and appear to deliver on the tasks shown. The conceptual framing is also new enough that it is not just a re-labeling of existing boosting or score-matching work. The soft spot is exactly the one the stress-test flags. The mapping from tree depth to diffusion time and from leaf probabilities to marginals is sketched, but the argument that the discrete gradient updates converge to the continuous GTSM loss without leftover bias is not accompanied by uniform error bounds or regularity conditions on the splitting process. If that bias persists, the optimality claim for boosting does not transfer cleanly and the shared principle becomes more of an analogy than a derived equivalence. The paper does not appear to contain counter-examples or explicit limit derivations that would rule this out. This is the sort of work that would interest people building hybrid discrete-continuous generators or working on tabular data, where the speed gains could matter. It is worth sending to referees because the experiments give concrete things to verify and the theory gap is fixable in revision rather than fatal.

Referee Report

3 major / 2 minor

Summary. The paper claims to establish a crisp mathematical correspondence between hierarchical decision trees and diffusion processes in appropriate limiting regimes. This unification reveals a shared optimization principle called Global Trajectory Score Matching (GTSM), for which an idealized version of gradient boosting is asymptotically optimal. Two practical instantiations are presented: TreeFlow, which generates tabular data with competitive quality, higher fidelity, and 2x speedup, and DSMTREE, a distillation method that transfers decision logic into neural networks while matching teacher performance within 2% on benchmarks.

Significance. If the limiting-regime correspondence and GTSM optimality hold with rigorous support, the work could bridge discrete hierarchical models and continuous generative dynamics in machine learning, offering conceptual unification and new algorithmic tools. The practical methods suggest efficiency gains for tabular generation and model compression, with potential for broader impact if the theoretical foundation is solidified.

major comments (3)

[Section 3] Section 3 (unification): The mapping of tree depth to diffusion time and leaf probabilities to marginals is presented without an explicit derivation or error analysis showing that the Global Trajectory Score Matching loss remains identical after the continuum limit. The assumption that discrete splitting gradients equate to continuous score-matching gradients requires additional regularity conditions to ensure discretization bias vanishes uniformly; without this, the shared optimization principle does not transfer.
[Abstract] Abstract and optimality discussion: The assertion that idealized gradient boosting is asymptotically optimal for GTSM lacks derivations, analysis of limiting regimes, or verification details. This raises the risk that the optimality claim reduces to the definitions introduced in the paper's own limiting process rather than following independently.
[Section 3] Section 3: The crisp correspondence between decision trees and diffusion trajectories is asserted to hold in appropriate limiting regimes, but no analysis addresses whether non-vanishing discretization bias from the splitting process could invalidate the exact equivalence of the GTSM objective.

minor comments (2)

[Section 3] Clarify notation distinguishing discrete tree quantities from their continuous diffusion counterparts, particularly in equations involving the limiting process.
[Practical instantiations] Add more detail on empirical verification for the claimed 2x speedup and 2% performance match in the practical sections to support the instantiations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which identify opportunities to strengthen the rigor of our theoretical unification. We address each major comment below and will incorporate revisions to provide the requested derivations and analyses.

read point-by-point responses

Referee: [Section 3] Section 3 (unification): The mapping of tree depth to diffusion time and leaf probabilities to marginals is presented without an explicit derivation or error analysis showing that the Global Trajectory Score Matching loss remains identical after the continuum limit. The assumption that discrete splitting gradients equate to continuous score-matching gradients requires additional regularity conditions to ensure discretization bias vanishes uniformly; without this, the shared optimization principle does not transfer.

Authors: We agree that an explicit derivation and error analysis would improve clarity. In the revised manuscript, we will expand Section 3 with a step-by-step derivation of the continuum limit, mapping tree depth to diffusion time and leaf probabilities to marginals. We will also include an error analysis under suitable regularity conditions (e.g., Lipschitz continuity of the underlying densities) demonstrating uniform convergence of the GTSM loss and vanishing discretization bias. This will rigorously establish transfer of the shared optimization principle. revision: yes
Referee: [Abstract] Abstract and optimality discussion: The assertion that idealized gradient boosting is asymptotically optimal for GTSM lacks derivations, analysis of limiting regimes, or verification details. This raises the risk that the optimality claim reduces to the definitions introduced in the paper's own limiting process rather than following independently.

Authors: The optimality follows from interpreting boosting steps as a discretization of the gradient flow minimizing the GTSM functional in the continuum limit, independent of the tree-diffusion correspondence. We will revise the abstract and add a subsection with the derivation, limiting-regime analysis, and verification on low-dimensional cases. This grounds the result in standard functional gradient descent theory rather than circularity. revision: yes
Referee: [Section 3] Section 3: The crisp correspondence between decision trees and diffusion trajectories is asserted to hold in appropriate limiting regimes, but no analysis addresses whether non-vanishing discretization bias from the splitting process could invalidate the exact equivalence of the GTSM objective.

Authors: We concur that explicit treatment of discretization bias is needed. The revision will add analysis in Section 3 showing that, under bounded splitting criteria and Lipschitz score functions, the bias vanishes as the partition refines and the time discretization step approaches zero. This confirms equivalence of the GTSM objectives in the limit and preserves the correspondence. revision: yes

Circularity Check

0 steps flagged

No significant circularity in unification or optimality claim

full rationale

The paper presents a mathematical correspondence between hierarchical decision trees and diffusion processes in limiting regimes, from which a shared Global Trajectory Score Matching (GTSM) principle is derived, with an asymptotic optimality statement for idealized gradient boosting. No equations or definitions in the provided abstract or context reduce the claimed optimality or unification to a self-referential fit, renaming, or self-citation chain by construction. The limiting-regime mapping is offered as an independent derivation rather than a tautology, and the work remains self-contained against external benchmarks without load-bearing self-references that collapse the central result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to identify specific free parameters, axioms, or invented entities; no explicit fitting or new postulates are described.

pith-pipeline@v0.9.0 · 5659 in / 976 out tokens · 27016 ms · 2026-05-22T10:19:27.740716+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2.3 (Limit under Dyadic Refinement) ... continuous-time filtration {Ft} ... Gt converges to time-invariant generator G
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Global Trajectory Score Matching (GTSM) ... LC GTSM(θ) = ½ ∫ w(t) E[||sθ−s∗||²D] dt

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

This meansp(x, k)must be measurable with respect to the coarser sigma-algebraF k generated by the partitionΠ k

Axiom 1: Measurability (Information Erasure).The output density p(x, k) must not contain information about the boundaries that were erased in the transition from Πk−1. This meansp(x, k)must be measurable with respect to the coarser sigma-algebraF k generated by the partitionΠ k

work page
[2]

Axiom 2: Partial Averages (Probability Conservation).The probability mass must be conserved within every region of the new partition. This property must extend to all sets A∈ F k, such that: Z A p(x, k)dx= Z A p(x, k−1)dx∀A∈ F k.(11) A fundamental theorem of measure theory states that for a given function p(x, k−1) and a sigma- algebra Fk, there exists a ...

work page 2019
[3]

Consistent convergence:For any t1 < t2, the sequence of propagators {P(n) t2←t1 } converges to a limitP t2←t1

work page
[4]

balanced

Bounded intermediate complexity:The ”geometric complexity” added at refinement step n, measured by a suitable metric on partitions, isO(2 −n). Then: (i) The limiting process has a generator Gt with local Lipschitz structure on any compact I⊂[0, T]. (ii) For any sequence of compact intervalsIm ↑[0, T] , there exists a subsequence of refinements such that t...

work page 2019
[5]

The series terminates after the second term, i.e.,D (n)(x, t)≡0for alln >2

work page
[6]

The expansion cannot terminate at any finite order greater than two

The series does not terminate, and all even-ordered momentsD (2m) form≥1are strictly positive. The expansion cannot terminate at any finite order greater than two. Proof. The proof demonstrates that if any moment D(n) for n >2 is non-zero, then moments of arbitrarily high order must also be non-zero. For clarity, we present the proof in one dimension; the...

work page 2004
[7]

to truncate exactly after the second term. The remaining n= 1 (drift) and n= 2 (diffusion) terms constitute the Fokker-Planck equation, which describes the evolution of the probability density p(x, t): ∂p(x, t) ∂t =− dX i=1 ∂ ∂xi [fi(x)p(x, t)] + 1 2 dX i,j=1 ∂2 ∂xi∂xj [(g(x)g(x)⊤)ijp(x, t)].(43) Second, we establish the equivalence to the SDE. It is a st...

work page 2004
[8]

The basins of attraction Bk :={x: lim t→∞ ϕt(x) =x ∗ k} under the gradient flow ˙x= ∇logp 0(x)form a partition ofX,

work page
[9]

There existsρ min >0such thatinf x∈∂Bk p0(x)< p 0(x∗ k)−ρ min for allk. Definition D.7(Level-Set Clustering).For a threshold λ∈(0,max k p0(x∗ k)), define the super-level set: Sλ :={x:p 0(x)≥λ}.(45) Theinitial clustersare the connected components of Sλ for λ chosen such that Sλ has exactly K components, one containing each modex ∗ k. Proposition D.8(Well-D...

work page 2025
[10]

, CK} are distinct by definition, forming the leaves of our hierarchy

Initialization (Leaves at t= 0 ):At time t= 0 , all K clusters {C1, . . . , CK} are distinct by definition, forming the leaves of our hierarchy

work page
[11]

The first merger event occurs at the minimum of these times, t1 = mini,j t(n,ϵ) ij

First Merger:We compute the set of all pairwise merger times, {t(n,ϵ) ij } for i̸=j . The first merger event occurs at the minimum of these times, t1 = mini,j t(n,ϵ) ij . Let this minimum occur for the pair (Ca, Cb). At time t1, we merge Ca and Cb into a new super-cluster, Cab. This event defines the lowest-level branch in the hierarchy

work page
[12]

We repeat the process, computing the merger times between all pairs in this new set (including between the new super-cluster and other original clusters)

Iterative Merging:We now have a set of K−1 clusters. We repeat the process, computing the merger times between all pairs in this new set (including between the new super-cluster and other original clusters). The next merger event at timet 2 > t1 defines the next branch. This agglomerative process continues, and the ordered sequence of merger times defines...

work page 2009
[13]

It is a distribution that has no further structural information to lose

Defining the Root:A maximally entropic distribution, such as a uniform distribution over a manifold X , represents a state of complete information loss relative to that manifold. It is a distribution that has no further structural information to lose. This state acts as the single, common ancestor for all possible paths—the root of the tree. Any partition...

work page
[14]

This operator models a pure process of information aggregation, or entropy increase

Consistency with Coarse-Graining:The mapping from a tree to an SDE in Section C is based on a coarse-graining operator defined by conditional expectation. This operator models a pure process of information aggregation, or entropy increase. A forward diffusion process is only guaranteed to be a pure entropy-increasing process fromanyinitial state if its st...

work page 2012
[15]

Partition:Its partition of the feature space, Πm, is thecommon refinementof the partitions induced by all constituent weak learners {h1, . . . , hm}. That is, a region R∈Π m is the largest possible connected subset of X such that for any two points xa,x b ∈R , hi(xa) =h i(xb)for alli∈ {1, . . . , m}

work page
[16]

Leaf Values:The value assigned to any leaf region in Πm is the sum of the predictions from all constituent trees,Pm i=1 ηhi(x), whereηis the learning rate. This abstraction provides a monolithic, non-additive representation of the ensemble, allowing us to rigorously analyze the evolution of its decision boundaries as a single geometric object. E.1.3 Monot...

work page
[17]

Leaf Partition:The set of its leaf nodes corresponds exactly to the regions of the partition Πm

work page
[18]

Hierarchical Structure:The tree’s internal structure is defined by the unique, nested sequence of partitions generated by the boosting algorithm itself: {Π0,Π 1, . . . ,Πm}. The coarsening from the leaves ( Πm) to the root ( Π0 ={X } ) is defined by reversing the historical refinement process. The parent of a node in partition level Πk is the unique regio...

work page 2019
[19]

Therefore, the modelSm+1, being more refined, provides a law Pm+1 that is a faithful approximation of the ideal law P⋆ on a larger tail σ-algebra than Sm can

In the context of a diffusion process, this fine-grained information corresponds to the behavior at earlier times. Therefore, the modelSm+1, being more refined, provides a law Pm+1 that is a faithful approximation of the ideal law P⋆ on a larger tail σ-algebra than Sm can. Let us formalize this. For each model Sm, let tm be the earliest time for which its...

work page
[20]

Greedy stage-wise algorithms(e.g., boosting, matching pursuit): By construction, Lk is optimized beforeL k+1 is considered

work page
[21]

Hierarchical architectures(e.g., progressive growing, curriculum learning): The model capacity for finer scales is introduced only after coarser scales are learned

work page
[22]

RemarkE.15 (Falsifiability).This assumption can be tested empirically by measuring τk via the learning curves ofL k(θt)

Spectral bias in neural networks(empirical): Implicit regularization causes neural net- works to learn low-frequency (coarse) components faster than high-frequency (fine) compo- nents (Rahaman et al., 2019). RemarkE.15 (Falsifiability).This assumption can be tested empirically by measuring τk via the learning curves ofL k(θt). For boosting, it holds by de...

work page 2019
[23]

Its task is to find an operatorR ⋆ m+1 that transitionsS ⋆ m to a modelS ⋆ m+1 ∈K ⋆ tm+1

The structurally-supervised learner knows the next target class is K ⋆ tm+1. Its task is to find an operatorR ⋆ m+1 that transitionsS ⋆ m to a modelS ⋆ m+1 ∈K ⋆ tm+1

work page
[24]

It uses the local data supervision (residuals) to find an update h⋆ m+1 that produces a new model Sm+1

The data-supervised learner (boosting) does not know the target class K ⋆ tm+1. It uses the local data supervision (residuals) to find an update h⋆ m+1 that produces a new model Sm+1

work page
[25]

coarsest

Here we invoke Assumption E.14. The coarse-to-fine learning bias guarantees that at step m, the model Sm is already ϵ-optimal for the tail-field F≥tm. The “coarsest” remaining part of the learning problem is to correctly model the dynamics in the time interval[t m+1, tm)

work page
[26]

By doing so, it produces a new modelS m+1 whose law now correctly matches the ideal law on the tail-fieldF ≥tm+1

The assumption guarantees that the optimal, data-supervised update h⋆ m+1 will be precisely the one that resolves this coarsest remaining part of the problem. By doing so, it produces a new modelS m+1 whose law now correctly matches the ideal law on the tail-fieldF ≥tm+1

work page
[27]

Since the update is optimal, it must be the optimal such model,S ⋆ m+1

By definition, any such model is a member of the equivalence classK ⋆ tm+1. Since the update is optimal, it must be the optimal such model,S ⋆ m+1. Therefore, Sm+1 =S ⋆ m+1. By the principle of induction, the trajectories are identical. This crucial equivalence justifies our use of the DSM mathematical machinery to analyze the local, data-supervised boost...

work page 1982
[28]

Let X⋆ t be a path generated by the ideal process SDE ⋆, and let Xm,t be a path generated by our model Sm. A core assumption of the score-based modeling framework is that the forward process, defined by drift b(x, t) and diffusion tensor D(x, t), is a fixed, known process independent of the data. The learning task is to find the correct score function to ...

work page
[29]

We consider their evolution starting from the same point at τ= 0 (forward time t=T ), i.e., X ⋆ 0 = Xm,0 =X T

Let X ⋆ τ and Xm,τ be the corresponding reverse-time processes, using the reverse time variable τ∈[0, T] . We consider their evolution starting from the same point at τ= 0 (forward time t=T ), i.e., X ⋆ 0 = Xm,0 =X T . The endpoints of these processes at reverse timeτ=Tare X ⋆ T and Xm,T

work page
[30]

The difference between the endpoints is: X ⋆ T −Xm,T = Z T 0 b⋆ rev(X ⋆ τ , τ)−b rev,m(Xm,τ , τ) dτ+ Z T 0 σ(X ⋆ τ , τ)dW ⋆ τ − Z T 0 σ(Xm,τ , τ)dWm,τ

We express the endpoints in integral form. The difference between the endpoints is: X ⋆ T −Xm,T = Z T 0 b⋆ rev(X ⋆ τ , τ)−b rev,m(Xm,τ , τ) dτ+ Z T 0 σ(X ⋆ τ , τ)dW ⋆ τ − Z T 0 σ(Xm,τ , τ)dWm,τ . We now take the expectation conditioned on the starting pointXT . The expectation of the stochastic integral terms is zero because Ito integrals are martingales ...

work page
[31]

The forward drift terms −b cancel, leaving only the score-dependent terms

We substitute the general formula for the reverse drifts. The forward drift terms −b cancel, leaving only the score-dependent terms. We also change the integration variable back to forward timet=T−τ, which meansdτ=−dtand the integration limits flip. E[X ⋆ T − Xm,T |XT ] =E Z 0 T (D(X⋆ t , t)s⋆ t (X⋆ t )−D(X m,t, t)sm,t(Xm,t)) (−dt) XT =E "Z T 0 (D(X⋆ t , ...

work page
[32]

This means that for t≥t m, the law of the model process is a good approximation of the ideal law, Pm|F≥tm ≈P ⋆|F≥tm

Now we use the crucial fact that Sm ∈K tm. This means that for t≥t m, the law of the model process is a good approximation of the ideal law, Pm|F≥tm ≈P ⋆|F≥tm . This implies that the paths and the score functions are approximately equal in expectation in this range: E[Dsm,t]≈E[Ds ⋆ t ]. The error is negligible for t≥t m. The integrand is therefore effecti...

work page
[33]

The expected residual at stepm+ 1is taken over the true data distributionp ⋆ 0(x, y): E[rm+1] =E (x,y)∼p⋆ 0[y−F m(x)]

Finally, we connect this quantity to the expected residual of the boosting algorithm. The expected residual at stepm+ 1is taken over the true data distributionp ⋆ 0(x, y): E[rm+1] =E (x,y)∼p⋆ 0[y−F m(x)]. We analyze the two terms on the right-hand side separately using the law of total expectation

work page
[34]

The expected residual at stepm+ 1over the empirical data distribution is: E(x,y)∼pemp 0 [rm+1] =E[y]−E[F m(x)],(60) where the expectations are over the empirical distributionp emp 0 = 1 N PN i=1 δ(xi,yi)

work page
[35]

The endpoint of the ideal reverse process, X ∗ T , has lawp ∗

The empirical mean E[y] = 1 N PN i=1 yi is an unbiased estimator of the true target mean. The endpoint of the ideal reverse process, X ∗ T , has lawp ∗

work page
[36]

44 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

Therefore: Epemp 0 [y] =E p∗ 0[X ∗ T ] +O p(N −1/2),(61) whereO p(N −1/2)is the standard Monte Carlo error. 44 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

work page
[37]

The model prediction Fm(x) is the leaf value assigned by the net decision tree Tm. By con- struction, Fm(x) is the conditional expectation under the model’s induced joint distribution: Fm(x) =E y∼pm,0[y|x].(62) The unconditional expectation of the model’s predictions over the empirical feature distribu- tion is: Ex∼pemp 0 [Fm(x)] = 1 N NX i=1 Fm(xi).(63) ...

work page
[38]

The key approximation relates these two expectations. Define the distribution mismatch: ∆m :=E x∼pemp 0 [Fm(x)]− Z Fm(x)p m,0(x)dx.(65) We decompose this into two error sources: ∆m = Ex∼pemp 0 [Fm(x)]−E x∼p∗ 0[Fm(x)] | {z } Sampling errorϵ samp + Ex∼p∗ 0[Fm(x)]−E x∼pm,0[Fm(x)] | {z } Model errorϵ model .(66) Sampling error bound:By the central limit theor...

work page
[39]

Therefore, the expected residual is anasymptotically unbiased estimatorof the integrated score error, with the bias vanishing asN→ ∞andD KL(p∗ 0∥pm,0)→0

Combining the bounds from (61) and the analysis above: E[rm+1] =E[y]−E[F m(x)]≈E[ X ⋆ T ]−E[ Xm,T ] =E[ X ⋆ T − Xm,T ]. Therefore, the expected residual is anasymptotically unbiased estimatorof the integrated score error, with the bias vanishing asN→ ∞andD KL(p∗ 0∥pm,0)→0. Corollary E.25(Conditions for Exact Equivalence).The meta-score equals the integrat...

work page
[40]

The model class contains the true conditionalE[y|x](realizability),

work page
[41]

In practice, the approximation quality is controlled by the sample size N and the residual variance at stepm

The boosting algorithm has converged:D KL(p∗ 0∥pm,0)< ϵfor arbitrarily smallϵ >0. In practice, the approximation quality is controlled by the sample size N and the residual variance at stepm. 45 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models E.4 Global Optimality of the Greedy Trajectory We have proven that each step of the gradient...

work page 1996
[42]

The optimal cost-to-go is simply the minimum possible cost at this stage, as there are no future costs: VM−1(SM−1) = min hM ∈AϵM−1 C(S M−1 , hM)

Base Case (m=M−1 ):At the final stage, the decision is to choose hM to transition from state SM−1 . The optimal cost-to-go is simply the minimum possible cost at this stage, as there are no future costs: VM−1(SM−1) = min hM ∈AϵM−1 C(S M−1 , hM). The optimal action, π⋆ M−1(SM−1), is by definition the greedy choice that minimizes this immediate cost

work page
[43]

3.Inductive Step (m):The Bellman equation for the optimal cost-to-go at stepmis: Vm(Sm) = min hm+1∈A [C(S m, hm+1) +V m+1(T(S m, hm+1))]

Inductive Hypothesis:Assume that for all steps k > m , the optimal policy π⋆ k is the greedy policy. 3.Inductive Step (m):The Bellman equation for the optimal cost-to-go at stepmis: Vm(Sm) = min hm+1∈A [C(S m, hm+1) +V m+1(T(S m, hm+1))]. Over the finite action spaceA ϵm, the minimum in the Bellman equation is well-defined: Vm(Sm) = min hm+1∈Aϵm [C(S m, h...

work page 2025
[44]

The base treeThas bounded depthD <∞and bounded number of leavesL=O(2 D)

work page
[45]

The feature space is bounded:∥x∥ ≤Bfor allx∈ X

work page
[46]

The loss functionℓ CE isG-Lipschitz continuous in the model parameters. Theorem G.5(Finite-Sample Convergence of DSM-TREE).UnderAssumptionG.4, let ˆθT be the parameters obtained after T gradient descent steps with learning rate η=O(1/ √ T) and batch size B. Then with probability at least1−δ: LDSM(ˆθT )− L DSM(θ∗)≤O D· p dlog(L/δ)√ BT ! ,(79) whereθ ∗ is t...

work page 2018
[47]

The data distributionp data(x, y)has bounded support:∥x∥ ≤Bfor allx∈supp(p data)

work page
[48]

The base treeThas depthDandL=O(2 D)leaves, with balanced partitioning (each leaf has comparable probability mass)

work page
[49]

The velocity networkv θ isG-Lipschitz continuous inxandθ

work page
[50]

The path encoding p=PathEncoder(T,x) is deterministic and has bounded norm: ∥p∥ ≤ P. Theorem H.3(Finite-Sample Convergence of TREEFLOW).UnderAssumptionH.2, let ˆθS be the parameters obtained after S gradient descent steps with learning rate η=O(1/ √ S) and batch size B. Then with probability at least1−δ: LTREEFLOW (ˆθS)− L TREEFLOW (θ∗)≤O p d·L·log(L/δ)√ ...

work page 2018
[51]

Sampling B data points and computing their path encodings: O(B·D) (traversing tree to depthD)

work page
[52]

Sampling noisex (0) and timet:O(B·d)

work page
[53]

Computing interpolationsx (t):O(B·d)

work page
[54]

Forward pass through velocity network:O(B·C net)

work page
[55]

Computing loss and gradients:O(B·d)

work page
[56]

For a network with hidden dimension h, Cnet =O(d·h+K·h) whereKis the path encoding dimension (typicallyK=Lfor full decision path representation)

Backward pass and parameter update:O(B·C net) The dominant terms are the path encoding (which requires O(D) traversal per sample) and the network forward-backward passes. For a network with hidden dimension h, Cnet =O(d·h+K·h) whereKis the path encoding dimension (typicallyK=Lfor full decision path representation). Multiplying bySsteps and simplifying: O(...

work page
[57]

DSM-TREEsolves CGTSM for discriminative modeling: learn the coarse-to-fine trajectory of decision boundaries that minimizes classification error

work page
[58]

TREEFLOWsolves CGTSM for generative modeling: learn the coarse-to-fine trajectory of distributions that minimizes generation error (Wasserstein distance). 60 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models Both achieve this by explicitly modeling the hierarchical structure encoded in decision trees as a trajectory through tail-equiva...

work page 2023
[59]

Initialize Clusters:The process begins with each of the K ground-truth data classes corresponding to a distinct, active cluster at timet= 0

work page
[60]

This is not the simple analytical forward process used for training

Simulate Learned Forward SDE:For each initial cluster, we simulate its evolution forward in time from t= 0 to t=T . This is not the simple analytical forward process used for training. Instead, we use a discretized Euler-Maruyama scheme to solve thelearned forward SDE, where the drift at each step is determined by the model’s own score function: flearned(...

work page
[61]

This yields a full trajectory for each cluster’s mean and variance over time

Track Centroid Trajectories:During the simulation, for each cluster and at each time step ti, we compute and store two quantities: the geometric centroid of the cluster’s points and their average spread (mean distance from the centroid). This yields a full trajectory for each cluster’s mean and variance over time

work page
[62]

It iterates until only one cluster remains

Iterative Merge Search:The algorithm proceeds via agglomerative clustering. It iterates until only one cluster remains. In each iteration, it searches for the next merge event by checking every possible pair of currently active clusters to find which pair becomes indistinguishable at the earliest future time

work page
[63]

This criterion signifies the moment their probability distributions have substantially overlapped

Merge Criterion:Two clusters are defined as ”merged” at the first time step ti where the Euclidean distance between their centroids becomes smaller than the sum of their spreads. This criterion signifies the moment their probability distributions have substantially overlapped

work page
[64]

The event comprising the two original cluster IDs, the merge time ti, and the number of original leaves in the new cluster, is recorded

Dendrogram Construction:The pair of clusters with the minimum merge time is formally merged into a new, larger cluster. The event comprising the two original cluster IDs, the merge time ti, and the number of original leaves in the new cluster, is recorded. This sequence of recorded merge events directly forms the linkage matrix for the dendrogram, where t...

work page 1998
[65]

Teacher Model Generation:An oracle RandomForestClassifier (100 estimators, depth

work page
[66]

A single DecisionTreeClassifier (depth 15), which serves as our ”Base Tree” baseline, is then trained on the pseudo-labels from this oracle

is trained. A single DecisionTreeClassifier (depth 15), which serves as our ”Base Tree” baseline, is then trained on the pseudo-labels from this oracle

work page
[67]

This network uses an embedding layer for the tree level j (embedding dim=32) and a 2-hidden-layer MLP (256 units each, with ReLU and BatchNorm) to predict the split decision

DSM-TREETraining:A ConditionalSplitModel is trained. This network uses an embedding layer for the tree level j (embedding dim=32) and a 2-hidden-layer MLP (256 units each, with ReLU and BatchNorm) to predict the split decision. It is trained for 30,000 steps (batch size 256) using Adam with a learning rate of10 −3

work page
[68]

ResultsThe detailed performance metrics are provided in Table 2 with a magnified visualization in Figure 14

Inference:To make a prediction, the DSM-TREEmodel is queried iteratively from level j= 0to simulate traversal down the tree until a leaf is reached. ResultsThe detailed performance metrics are provided in Table 2 with a magnified visualization in Figure 14. The results demonstrate that the DSM-TREEmodel, a fully differentiable neural network, can successf...

work page 2016
[69]

A TreePathEncoder class converts the decision path of any sample into a sparse vector encoding, where the value at an index is the inverse of the node’s depth

Tree Encoder:A DecisionTreeClassifier (depth 10) is trained. A TreePathEncoder class converts the decision path of any sample into a sparse vector encoding, where the value at an index is the inverse of the node’s depth

work page
[70]

It uses embeddings for y (embedding dim=16) and a 2-hidden-layer MLP (512 units each, with SiLU and LayerNorm)

TREEFLOWModel:The model is a conditional MLP that takes as input (x, t,p, y) and outputs a velocity. It uses embeddings for y (embedding dim=16) and a 2-hidden-layer MLP (512 units each, with SiLU and LayerNorm). It is trained for 1000 steps using AdamW (lr= 10 −3) on the conditional flow matching MSE loss. 3.Generation:We provide a target class labelyand...

work page arXiv 1970

[1] [1]

This meansp(x, k)must be measurable with respect to the coarser sigma-algebraF k generated by the partitionΠ k

Axiom 1: Measurability (Information Erasure).The output density p(x, k) must not contain information about the boundaries that were erased in the transition from Πk−1. This meansp(x, k)must be measurable with respect to the coarser sigma-algebraF k generated by the partitionΠ k

work page

[2] [2]

Axiom 2: Partial Averages (Probability Conservation).The probability mass must be conserved within every region of the new partition. This property must extend to all sets A∈ F k, such that: Z A p(x, k)dx= Z A p(x, k−1)dx∀A∈ F k.(11) A fundamental theorem of measure theory states that for a given function p(x, k−1) and a sigma- algebra Fk, there exists a ...

work page 2019

[3] [3]

Consistent convergence:For any t1 < t2, the sequence of propagators {P(n) t2←t1 } converges to a limitP t2←t1

work page

[4] [4]

balanced

Bounded intermediate complexity:The ”geometric complexity” added at refinement step n, measured by a suitable metric on partitions, isO(2 −n). Then: (i) The limiting process has a generator Gt with local Lipschitz structure on any compact I⊂[0, T]. (ii) For any sequence of compact intervalsIm ↑[0, T] , there exists a subsequence of refinements such that t...

work page 2019

[5] [5]

The series terminates after the second term, i.e.,D (n)(x, t)≡0for alln >2

work page

[6] [6]

The expansion cannot terminate at any finite order greater than two

The series does not terminate, and all even-ordered momentsD (2m) form≥1are strictly positive. The expansion cannot terminate at any finite order greater than two. Proof. The proof demonstrates that if any moment D(n) for n >2 is non-zero, then moments of arbitrarily high order must also be non-zero. For clarity, we present the proof in one dimension; the...

work page 2004

[7] [7]

to truncate exactly after the second term. The remaining n= 1 (drift) and n= 2 (diffusion) terms constitute the Fokker-Planck equation, which describes the evolution of the probability density p(x, t): ∂p(x, t) ∂t =− dX i=1 ∂ ∂xi [fi(x)p(x, t)] + 1 2 dX i,j=1 ∂2 ∂xi∂xj [(g(x)g(x)⊤)ijp(x, t)].(43) Second, we establish the equivalence to the SDE. It is a st...

work page 2004

[8] [8]

The basins of attraction Bk :={x: lim t→∞ ϕt(x) =x ∗ k} under the gradient flow ˙x= ∇logp 0(x)form a partition ofX,

work page

[9] [9]

There existsρ min >0such thatinf x∈∂Bk p0(x)< p 0(x∗ k)−ρ min for allk. Definition D.7(Level-Set Clustering).For a threshold λ∈(0,max k p0(x∗ k)), define the super-level set: Sλ :={x:p 0(x)≥λ}.(45) Theinitial clustersare the connected components of Sλ for λ chosen such that Sλ has exactly K components, one containing each modex ∗ k. Proposition D.8(Well-D...

work page 2025

[10] [10]

, CK} are distinct by definition, forming the leaves of our hierarchy

Initialization (Leaves at t= 0 ):At time t= 0 , all K clusters {C1, . . . , CK} are distinct by definition, forming the leaves of our hierarchy

work page

[11] [11]

The first merger event occurs at the minimum of these times, t1 = mini,j t(n,ϵ) ij

First Merger:We compute the set of all pairwise merger times, {t(n,ϵ) ij } for i̸=j . The first merger event occurs at the minimum of these times, t1 = mini,j t(n,ϵ) ij . Let this minimum occur for the pair (Ca, Cb). At time t1, we merge Ca and Cb into a new super-cluster, Cab. This event defines the lowest-level branch in the hierarchy

work page

[12] [12]

We repeat the process, computing the merger times between all pairs in this new set (including between the new super-cluster and other original clusters)

Iterative Merging:We now have a set of K−1 clusters. We repeat the process, computing the merger times between all pairs in this new set (including between the new super-cluster and other original clusters). The next merger event at timet 2 > t1 defines the next branch. This agglomerative process continues, and the ordered sequence of merger times defines...

work page 2009

[13] [13]

It is a distribution that has no further structural information to lose

Defining the Root:A maximally entropic distribution, such as a uniform distribution over a manifold X , represents a state of complete information loss relative to that manifold. It is a distribution that has no further structural information to lose. This state acts as the single, common ancestor for all possible paths—the root of the tree. Any partition...

work page

[14] [14]

This operator models a pure process of information aggregation, or entropy increase

Consistency with Coarse-Graining:The mapping from a tree to an SDE in Section C is based on a coarse-graining operator defined by conditional expectation. This operator models a pure process of information aggregation, or entropy increase. A forward diffusion process is only guaranteed to be a pure entropy-increasing process fromanyinitial state if its st...

work page 2012

[15] [15]

Partition:Its partition of the feature space, Πm, is thecommon refinementof the partitions induced by all constituent weak learners {h1, . . . , hm}. That is, a region R∈Π m is the largest possible connected subset of X such that for any two points xa,x b ∈R , hi(xa) =h i(xb)for alli∈ {1, . . . , m}

work page

[16] [16]

Leaf Values:The value assigned to any leaf region in Πm is the sum of the predictions from all constituent trees,Pm i=1 ηhi(x), whereηis the learning rate. This abstraction provides a monolithic, non-additive representation of the ensemble, allowing us to rigorously analyze the evolution of its decision boundaries as a single geometric object. E.1.3 Monot...

work page

[17] [17]

Leaf Partition:The set of its leaf nodes corresponds exactly to the regions of the partition Πm

work page

[18] [18]

Hierarchical Structure:The tree’s internal structure is defined by the unique, nested sequence of partitions generated by the boosting algorithm itself: {Π0,Π 1, . . . ,Πm}. The coarsening from the leaves ( Πm) to the root ( Π0 ={X } ) is defined by reversing the historical refinement process. The parent of a node in partition level Πk is the unique regio...

work page 2019

[19] [19]

Therefore, the modelSm+1, being more refined, provides a law Pm+1 that is a faithful approximation of the ideal law P⋆ on a larger tail σ-algebra than Sm can

In the context of a diffusion process, this fine-grained information corresponds to the behavior at earlier times. Therefore, the modelSm+1, being more refined, provides a law Pm+1 that is a faithful approximation of the ideal law P⋆ on a larger tail σ-algebra than Sm can. Let us formalize this. For each model Sm, let tm be the earliest time for which its...

work page

[20] [20]

Greedy stage-wise algorithms(e.g., boosting, matching pursuit): By construction, Lk is optimized beforeL k+1 is considered

work page

[21] [21]

Hierarchical architectures(e.g., progressive growing, curriculum learning): The model capacity for finer scales is introduced only after coarser scales are learned

work page

[22] [22]

RemarkE.15 (Falsifiability).This assumption can be tested empirically by measuring τk via the learning curves ofL k(θt)

Spectral bias in neural networks(empirical): Implicit regularization causes neural net- works to learn low-frequency (coarse) components faster than high-frequency (fine) compo- nents (Rahaman et al., 2019). RemarkE.15 (Falsifiability).This assumption can be tested empirically by measuring τk via the learning curves ofL k(θt). For boosting, it holds by de...

work page 2019

[23] [23]

Its task is to find an operatorR ⋆ m+1 that transitionsS ⋆ m to a modelS ⋆ m+1 ∈K ⋆ tm+1

The structurally-supervised learner knows the next target class is K ⋆ tm+1. Its task is to find an operatorR ⋆ m+1 that transitionsS ⋆ m to a modelS ⋆ m+1 ∈K ⋆ tm+1

work page

[24] [24]

It uses the local data supervision (residuals) to find an update h⋆ m+1 that produces a new model Sm+1

The data-supervised learner (boosting) does not know the target class K ⋆ tm+1. It uses the local data supervision (residuals) to find an update h⋆ m+1 that produces a new model Sm+1

work page

[25] [25]

coarsest

Here we invoke Assumption E.14. The coarse-to-fine learning bias guarantees that at step m, the model Sm is already ϵ-optimal for the tail-field F≥tm. The “coarsest” remaining part of the learning problem is to correctly model the dynamics in the time interval[t m+1, tm)

work page

[26] [26]

By doing so, it produces a new modelS m+1 whose law now correctly matches the ideal law on the tail-fieldF ≥tm+1

The assumption guarantees that the optimal, data-supervised update h⋆ m+1 will be precisely the one that resolves this coarsest remaining part of the problem. By doing so, it produces a new modelS m+1 whose law now correctly matches the ideal law on the tail-fieldF ≥tm+1

work page

[27] [27]

Since the update is optimal, it must be the optimal such model,S ⋆ m+1

By definition, any such model is a member of the equivalence classK ⋆ tm+1. Since the update is optimal, it must be the optimal such model,S ⋆ m+1. Therefore, Sm+1 =S ⋆ m+1. By the principle of induction, the trajectories are identical. This crucial equivalence justifies our use of the DSM mathematical machinery to analyze the local, data-supervised boost...

work page 1982

[28] [28]

Let X⋆ t be a path generated by the ideal process SDE ⋆, and let Xm,t be a path generated by our model Sm. A core assumption of the score-based modeling framework is that the forward process, defined by drift b(x, t) and diffusion tensor D(x, t), is a fixed, known process independent of the data. The learning task is to find the correct score function to ...

work page

[29] [29]

We consider their evolution starting from the same point at τ= 0 (forward time t=T ), i.e., X ⋆ 0 = Xm,0 =X T

Let X ⋆ τ and Xm,τ be the corresponding reverse-time processes, using the reverse time variable τ∈[0, T] . We consider their evolution starting from the same point at τ= 0 (forward time t=T ), i.e., X ⋆ 0 = Xm,0 =X T . The endpoints of these processes at reverse timeτ=Tare X ⋆ T and Xm,T

work page

[30] [30]

The difference between the endpoints is: X ⋆ T −Xm,T = Z T 0 b⋆ rev(X ⋆ τ , τ)−b rev,m(Xm,τ , τ) dτ+ Z T 0 σ(X ⋆ τ , τ)dW ⋆ τ − Z T 0 σ(Xm,τ , τ)dWm,τ

We express the endpoints in integral form. The difference between the endpoints is: X ⋆ T −Xm,T = Z T 0 b⋆ rev(X ⋆ τ , τ)−b rev,m(Xm,τ , τ) dτ+ Z T 0 σ(X ⋆ τ , τ)dW ⋆ τ − Z T 0 σ(Xm,τ , τ)dWm,τ . We now take the expectation conditioned on the starting pointXT . The expectation of the stochastic integral terms is zero because Ito integrals are martingales ...

work page

[31] [31]

The forward drift terms −b cancel, leaving only the score-dependent terms

We substitute the general formula for the reverse drifts. The forward drift terms −b cancel, leaving only the score-dependent terms. We also change the integration variable back to forward timet=T−τ, which meansdτ=−dtand the integration limits flip. E[X ⋆ T − Xm,T |XT ] =E Z 0 T (D(X⋆ t , t)s⋆ t (X⋆ t )−D(X m,t, t)sm,t(Xm,t)) (−dt) XT =E "Z T 0 (D(X⋆ t , ...

work page

[32] [32]

This means that for t≥t m, the law of the model process is a good approximation of the ideal law, Pm|F≥tm ≈P ⋆|F≥tm

Now we use the crucial fact that Sm ∈K tm. This means that for t≥t m, the law of the model process is a good approximation of the ideal law, Pm|F≥tm ≈P ⋆|F≥tm . This implies that the paths and the score functions are approximately equal in expectation in this range: E[Dsm,t]≈E[Ds ⋆ t ]. The error is negligible for t≥t m. The integrand is therefore effecti...

work page

[33] [33]

The expected residual at stepm+ 1is taken over the true data distributionp ⋆ 0(x, y): E[rm+1] =E (x,y)∼p⋆ 0[y−F m(x)]

Finally, we connect this quantity to the expected residual of the boosting algorithm. The expected residual at stepm+ 1is taken over the true data distributionp ⋆ 0(x, y): E[rm+1] =E (x,y)∼p⋆ 0[y−F m(x)]. We analyze the two terms on the right-hand side separately using the law of total expectation

work page

[34] [34]

The expected residual at stepm+ 1over the empirical data distribution is: E(x,y)∼pemp 0 [rm+1] =E[y]−E[F m(x)],(60) where the expectations are over the empirical distributionp emp 0 = 1 N PN i=1 δ(xi,yi)

work page

[35] [35]

The endpoint of the ideal reverse process, X ∗ T , has lawp ∗

The empirical mean E[y] = 1 N PN i=1 yi is an unbiased estimator of the true target mean. The endpoint of the ideal reverse process, X ∗ T , has lawp ∗

work page

[36] [36]

44 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

Therefore: Epemp 0 [y] =E p∗ 0[X ∗ T ] +O p(N −1/2),(61) whereO p(N −1/2)is the standard Monte Carlo error. 44 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models

work page

[37] [37]

The model prediction Fm(x) is the leaf value assigned by the net decision tree Tm. By con- struction, Fm(x) is the conditional expectation under the model’s induced joint distribution: Fm(x) =E y∼pm,0[y|x].(62) The unconditional expectation of the model’s predictions over the empirical feature distribu- tion is: Ex∼pemp 0 [Fm(x)] = 1 N NX i=1 Fm(xi).(63) ...

work page

[38] [38]

The key approximation relates these two expectations. Define the distribution mismatch: ∆m :=E x∼pemp 0 [Fm(x)]− Z Fm(x)p m,0(x)dx.(65) We decompose this into two error sources: ∆m = Ex∼pemp 0 [Fm(x)]−E x∼p∗ 0[Fm(x)] | {z } Sampling errorϵ samp + Ex∼p∗ 0[Fm(x)]−E x∼pm,0[Fm(x)] | {z } Model errorϵ model .(66) Sampling error bound:By the central limit theor...

work page

[39] [39]

Therefore, the expected residual is anasymptotically unbiased estimatorof the integrated score error, with the bias vanishing asN→ ∞andD KL(p∗ 0∥pm,0)→0

Combining the bounds from (61) and the analysis above: E[rm+1] =E[y]−E[F m(x)]≈E[ X ⋆ T ]−E[ Xm,T ] =E[ X ⋆ T − Xm,T ]. Therefore, the expected residual is anasymptotically unbiased estimatorof the integrated score error, with the bias vanishing asN→ ∞andD KL(p∗ 0∥pm,0)→0. Corollary E.25(Conditions for Exact Equivalence).The meta-score equals the integrat...

work page

[40] [40]

The model class contains the true conditionalE[y|x](realizability),

work page

[41] [41]

In practice, the approximation quality is controlled by the sample size N and the residual variance at stepm

The boosting algorithm has converged:D KL(p∗ 0∥pm,0)< ϵfor arbitrarily smallϵ >0. In practice, the approximation quality is controlled by the sample size N and the residual variance at stepm. 45 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models E.4 Global Optimality of the Greedy Trajectory We have proven that each step of the gradient...

work page 1996

[42] [42]

The optimal cost-to-go is simply the minimum possible cost at this stage, as there are no future costs: VM−1(SM−1) = min hM ∈AϵM−1 C(S M−1 , hM)

Base Case (m=M−1 ):At the final stage, the decision is to choose hM to transition from state SM−1 . The optimal cost-to-go is simply the minimum possible cost at this stage, as there are no future costs: VM−1(SM−1) = min hM ∈AϵM−1 C(S M−1 , hM). The optimal action, π⋆ M−1(SM−1), is by definition the greedy choice that minimizes this immediate cost

work page

[43] [43]

3.Inductive Step (m):The Bellman equation for the optimal cost-to-go at stepmis: Vm(Sm) = min hm+1∈A [C(S m, hm+1) +V m+1(T(S m, hm+1))]

Inductive Hypothesis:Assume that for all steps k > m , the optimal policy π⋆ k is the greedy policy. 3.Inductive Step (m):The Bellman equation for the optimal cost-to-go at stepmis: Vm(Sm) = min hm+1∈A [C(S m, hm+1) +V m+1(T(S m, hm+1))]. Over the finite action spaceA ϵm, the minimum in the Bellman equation is well-defined: Vm(Sm) = min hm+1∈Aϵm [C(S m, h...

work page 2025

[44] [44]

The base treeThas bounded depthD <∞and bounded number of leavesL=O(2 D)

work page

[45] [45]

The feature space is bounded:∥x∥ ≤Bfor allx∈ X

work page

[46] [46]

The loss functionℓ CE isG-Lipschitz continuous in the model parameters. Theorem G.5(Finite-Sample Convergence of DSM-TREE).UnderAssumptionG.4, let ˆθT be the parameters obtained after T gradient descent steps with learning rate η=O(1/ √ T) and batch size B. Then with probability at least1−δ: LDSM(ˆθT )− L DSM(θ∗)≤O D· p dlog(L/δ)√ BT ! ,(79) whereθ ∗ is t...

work page 2018

[47] [47]

The data distributionp data(x, y)has bounded support:∥x∥ ≤Bfor allx∈supp(p data)

work page

[48] [48]

The base treeThas depthDandL=O(2 D)leaves, with balanced partitioning (each leaf has comparable probability mass)

work page

[49] [49]

The velocity networkv θ isG-Lipschitz continuous inxandθ

work page

[50] [50]

The path encoding p=PathEncoder(T,x) is deterministic and has bounded norm: ∥p∥ ≤ P. Theorem H.3(Finite-Sample Convergence of TREEFLOW).UnderAssumptionH.2, let ˆθS be the parameters obtained after S gradient descent steps with learning rate η=O(1/ √ S) and batch size B. Then with probability at least1−δ: LTREEFLOW (ˆθS)− L TREEFLOW (θ∗)≤O p d·L·log(L/δ)√ ...

work page 2018

[51] [51]

Sampling B data points and computing their path encodings: O(B·D) (traversing tree to depthD)

work page

[52] [52]

Sampling noisex (0) and timet:O(B·d)

work page

[53] [53]

Computing interpolationsx (t):O(B·d)

work page

[54] [54]

Forward pass through velocity network:O(B·C net)

work page

[55] [55]

Computing loss and gradients:O(B·d)

work page

[56] [56]

For a network with hidden dimension h, Cnet =O(d·h+K·h) whereKis the path encoding dimension (typicallyK=Lfor full decision path representation)

Backward pass and parameter update:O(B·C net) The dominant terms are the path encoding (which requires O(D) traversal per sample) and the network forward-backward passes. For a network with hidden dimension h, Cnet =O(d·h+K·h) whereKis the path encoding dimension (typicallyK=Lfor full decision path representation). Multiplying bySsteps and simplifying: O(...

work page

[57] [57]

DSM-TREEsolves CGTSM for discriminative modeling: learn the coarse-to-fine trajectory of decision boundaries that minimizes classification error

work page

[58] [58]

TREEFLOWsolves CGTSM for generative modeling: learn the coarse-to-fine trajectory of distributions that minimizes generation error (Wasserstein distance). 60 Trees to Flows and Back: Unifying Decision Trees and Diffusion Models Both achieve this by explicitly modeling the hierarchical structure encoded in decision trees as a trajectory through tail-equiva...

work page 2023

[59] [59]

Initialize Clusters:The process begins with each of the K ground-truth data classes corresponding to a distinct, active cluster at timet= 0

work page

[60] [60]

This is not the simple analytical forward process used for training

Simulate Learned Forward SDE:For each initial cluster, we simulate its evolution forward in time from t= 0 to t=T . This is not the simple analytical forward process used for training. Instead, we use a discretized Euler-Maruyama scheme to solve thelearned forward SDE, where the drift at each step is determined by the model’s own score function: flearned(...

work page

[61] [61]

This yields a full trajectory for each cluster’s mean and variance over time

Track Centroid Trajectories:During the simulation, for each cluster and at each time step ti, we compute and store two quantities: the geometric centroid of the cluster’s points and their average spread (mean distance from the centroid). This yields a full trajectory for each cluster’s mean and variance over time

work page

[62] [62]

It iterates until only one cluster remains

Iterative Merge Search:The algorithm proceeds via agglomerative clustering. It iterates until only one cluster remains. In each iteration, it searches for the next merge event by checking every possible pair of currently active clusters to find which pair becomes indistinguishable at the earliest future time

work page

[63] [63]

This criterion signifies the moment their probability distributions have substantially overlapped

Merge Criterion:Two clusters are defined as ”merged” at the first time step ti where the Euclidean distance between their centroids becomes smaller than the sum of their spreads. This criterion signifies the moment their probability distributions have substantially overlapped

work page

[64] [64]

The event comprising the two original cluster IDs, the merge time ti, and the number of original leaves in the new cluster, is recorded

Dendrogram Construction:The pair of clusters with the minimum merge time is formally merged into a new, larger cluster. The event comprising the two original cluster IDs, the merge time ti, and the number of original leaves in the new cluster, is recorded. This sequence of recorded merge events directly forms the linkage matrix for the dendrogram, where t...

work page 1998

[65] [65]

Teacher Model Generation:An oracle RandomForestClassifier (100 estimators, depth

work page

[66] [66]

A single DecisionTreeClassifier (depth 15), which serves as our ”Base Tree” baseline, is then trained on the pseudo-labels from this oracle

is trained. A single DecisionTreeClassifier (depth 15), which serves as our ”Base Tree” baseline, is then trained on the pseudo-labels from this oracle

work page

[67] [67]

This network uses an embedding layer for the tree level j (embedding dim=32) and a 2-hidden-layer MLP (256 units each, with ReLU and BatchNorm) to predict the split decision

DSM-TREETraining:A ConditionalSplitModel is trained. This network uses an embedding layer for the tree level j (embedding dim=32) and a 2-hidden-layer MLP (256 units each, with ReLU and BatchNorm) to predict the split decision. It is trained for 30,000 steps (batch size 256) using Adam with a learning rate of10 −3

work page

[68] [68]

ResultsThe detailed performance metrics are provided in Table 2 with a magnified visualization in Figure 14

Inference:To make a prediction, the DSM-TREEmodel is queried iteratively from level j= 0to simulate traversal down the tree until a leaf is reached. ResultsThe detailed performance metrics are provided in Table 2 with a magnified visualization in Figure 14. The results demonstrate that the DSM-TREEmodel, a fully differentiable neural network, can successf...

work page 2016

[69] [69]

A TreePathEncoder class converts the decision path of any sample into a sparse vector encoding, where the value at an index is the inverse of the node’s depth

Tree Encoder:A DecisionTreeClassifier (depth 10) is trained. A TreePathEncoder class converts the decision path of any sample into a sparse vector encoding, where the value at an index is the inverse of the node’s depth

work page

[70] [70]

It uses embeddings for y (embedding dim=16) and a 2-hidden-layer MLP (512 units each, with SiLU and LayerNorm)

TREEFLOWModel:The model is a conditional MLP that takes as input (x, t,p, y) and outputs a velocity. It uses embeddings for y (embedding dim=16) and a 2-hidden-layer MLP (512 units each, with SiLU and LayerNorm). It is trained for 1000 steps using AdamW (lr= 10 −3) on the conditional flow matching MSE loss. 3.Generation:We provide a target class labelyand...

work page arXiv 1970