pith. sign in

arxiv: 2601.02735 · v2 · submitted 2026-01-06 · 💻 cs.LG · cs.DS· cs.PF

Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels

Pith reviewed 2026-05-16 17:50 UTC · model grok-4.3

classification 💻 cs.LG cs.DScs.PF
keywords decision forestsproximity kernelssparse factorizationleaf collisionskernel methodsrandom forestssimilarity measures
0
0 comments X

The pith

Decision forest proximities admit an exact sparse factorization through separable weighted leaf-collision kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that common forest proximity measures can all be rewritten as weighted sums over pairs of samples sharing the same leaf in some tree. This shared structure lets the proximity matrix be factored into a sparse leaf-incidence matrix times a diagonal weighting matrix. The result replaces the usual quadratic all-pairs loop with sparse linear algebra that runs in near-linear time and memory for typical forests. A sympathetic reader cares because this removes the main computational barrier to using forest similarities inside larger kernel or embedding pipelines. The approach also yields an explicit low-dimensional leaf-space representation of the data that can be used directly for downstream tasks.

Core claim

By defining the Separable Weighted Leaf-Collision (SWLC) kernel class, the authors prove that most existing forest proximity definitions differ only in how they weight leaf collisions while sharing the same sparse incidence structure between samples and leaves. This common structure produces an exact factorization of the finite-sample proximity matrix as a product of the leaf-incidence matrix with its transpose, scaled by a diagonal matrix of leaf weights, without ever computing all pairwise comparisons.

What carries the argument

The Separable Weighted Leaf-Collision (SWLC) kernel, which expresses proximity as a sum over shared leaves with sample- and leaf-dependent weights, enabling the sparse factorization.

Load-bearing premise

That existing forest proximity definitions can be expressed exactly as separable weighted sums over leaf collisions and that the leaf-incidence matrix remains sufficiently sparse under standard forest hyperparameters.

What would settle it

A counter-example where a standard forest proximity cannot be written as a separable weighted leaf-collision kernel, or an empirical test showing quadratic scaling instead of near-linear when the number of samples grows while keeping forest size fixed.

Figures

Figures reproduced from arXiv: 2601.02735 by Adrien Aumon, Guy Wolf, Jake S. Rhodes, Kevin R. Moon.

Figure 1
Figure 1. Figure 1: Empirical complexity and scalability analysis on the Fashion-MNIST dataset com [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Decision forests induce supervised similarities through the partition structure of their trees. Yet forest proximity computation is still often treated as a quadratic operation in the number of samples, which limits scalability and restricts broader use in kernel and representation-learning pipelines. We introduce a unified view of leaf-collision forest proximities through a class of Separable Weighted Leaf-Collision (SWLC) kernels, showing that most existing proximities differ only in their weighting scheme while sharing a common sparse leaf-incidence structure. This yields an explicit leaf-space representation that clarifies their kernel interpretation and leads to an exact finite-sample sparse factorization of the proximity matrix, avoiding an explicit all-pairs comparison and reducing computation to sparse linear algebra over leaf collisions. We implement this framework in a memory-efficient Python library and show, both theoretically and empirically, that exact kernel computation scales near-linearly in time and memory under standard forest regimes. Benchmarks verify the predicted scaling behavior in practice across datasets, proximity definitions, and forest settings, and show that the resulting sparse leaf-space representation can also be used directly for fast task-aware embedding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript unifies existing forest proximity definitions under a class of Separable Weighted Leaf-Collision (SWLC) kernels. It derives an exact finite-sample factorization K = L W L^T of the proximity matrix from the sparse leaf-incidence matrix L, reducing computation to sparse linear algebra over leaf collisions without explicit all-pairs comparisons. The work provides theoretical arguments and empirical benchmarks showing near-linear time and memory scaling under standard forest regimes, along with a memory-efficient implementation and direct use of the leaf-space representation for task-aware embeddings.

Significance. If the sparsity assumptions hold, the exact factorization and its reduction to sparse linear algebra constitute a clear practical advance for scalable kernel methods and embeddings based on decision forests. The parameter-free derivation from the leaf-incidence definition and the reproducible scaling benchmarks are notable strengths that clarify the kernel view of forest proximities.

major comments (2)
  1. [theoretical analysis and scaling discussion] The central scalability claim (near-linear time/memory via sparse factorization) rests on the leaf-incidence matrix L remaining sufficiently sparse. No explicit bound is given on nnz(L) as a function of n_samples, max_depth, or n_estimators that would guarantee the claimed asymptotic regime outside narrow 'standard' hyperparameter settings.
  2. [unified view of SWLC kernels] The weakest assumption—that existing proximities are exactly expressible as separable weighted sums over leaf collisions—is used to justify the unified SWLC class, yet the manuscript does not provide a counter-example or proof sketch showing that all common proximity variants (e.g., those with non-separable terms) fall inside this class.
minor comments (2)
  1. [factorization derivation] The definition of the weighting matrix W should be stated explicitly with its dimensions and construction rule immediately after the factorization equation.
  2. [empirical benchmarks] Figure captions for the scaling plots should include the exact hyperparameter ranges tested (depth, n_estimators, dataset sizes) to allow direct verification of the 'standard regimes' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and the opportunity to clarify the theoretical foundations of our work. We address the major comments point by point below.

read point-by-point responses
  1. Referee: The central scalability claim (near-linear time/memory via sparse factorization) rests on the leaf-incidence matrix L remaining sufficiently sparse. No explicit bound is given on nnz(L) as a function of n_samples, max_depth, or n_estimators that would guarantee the claimed asymptotic regime outside narrow 'standard' hyperparameter settings.

    Authors: We appreciate this observation. In decision forests, each sample is assigned to exactly one leaf per tree, so the leaf-incidence matrix L has exactly nnz(L) = n_samples x n_estimators nonzeros, independent of max_depth. This bound is linear in the number of samples and estimators. We will add an explicit statement of this bound and a discussion of the conditions under which the near-linear scaling holds in the revised manuscript. revision: yes

  2. Referee: The weakest assumption—that existing proximities are exactly expressible as separable weighted sums over leaf collisions—is used to justify the unified SWLC class, yet the manuscript does not provide a counter-example or proof sketch showing that all common proximity variants (e.g., those with non-separable terms) fall inside this class.

    Authors: The manuscript focuses on standard proximity definitions (e.g., Breiman's original and common variants), which we show are exactly SWLC kernels by deriving their weighting schemes explicitly. We do not claim universality over all conceivable variants; non-separable terms would indeed lie outside the class. To clarify the scope, we will include a short discussion noting that the SWLC class encompasses the separable weighted leaf-collision proximities prevalent in the literature, with a brief example of a non-separable variant that falls outside. revision: yes

Circularity Check

0 steps flagged

No circularity: factorization follows directly from leaf-incidence definition

full rationale

The paper defines SWLC kernels from the standard leaf-collision structure of decision forests and derives the sparse factorization K = L W L^T as an algebraic identity on the leaf-incidence matrix L. This is a direct rewriting of the proximity definition rather than a self-referential loop, fitted parameter renamed as prediction, or load-bearing self-citation. No step reduces the claimed result to its own inputs by construction; the derivation remains self-contained against the external definition of forest proximities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the standard definition of leaf incidence in decision trees and the algebraic properties of sparse matrix multiplication; no free parameters or new physical entities are introduced.

axioms (2)
  • domain assumption Leaf-incidence matrix between samples and leaves is sparse under standard forest construction
    Invoked to obtain near-linear scaling
  • domain assumption Proximity definitions can be written as separable weighted sums over leaf collisions
    Central modeling assumption that unifies prior methods
invented entities (1)
  • Separable Weighted Leaf-Collision (SWLC) kernel no independent evidence
    purpose: Unified representation of forest proximities
    New class defined to separate weighting from incidence

pith-pipeline@v0.9.0 · 5496 in / 1307 out tokens · 50248 ms · 2026-05-16T17:50:25.618401+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Aumon, S

    A. Aumon, S. Ni, M. Lizotte, G. Wolf, K. R. Moon, and J. S. Rhodes. Random Forest Autoencoders for Guided Representation Learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, Oct. 2025

  2. [2]

    L. Breiman. Random forests.Machine learning, 45(1):5–32, 2001

  3. [3]

    Breiman and A

    L. Breiman and A. Cutler. Manual on setting up, using, and understanding random forests v4.0. Technical report, University of California, Berkeley, Department of Statis- tics, 2003

  4. [4]

    Duvaux, Q

    L. Duvaux, Q. Geissmann, K. Gharbi, J.-J. Zhou, J. Ferrari, C. M. Smadja, and R. K. Butlin. Dynamics of Copy Number Variation in Host Races of the Pea Aphid. Molecular Biology and Evolution, 32(1):63–80, Jan. 2015. ISSN 1537-1719, 0737-4038. doi: 10.1093/molbev/msu266

  5. [5]

    Warren, Lu Cheng, Haidar M

    S. Farokhi, H. Chen, K. Moon, and H. Karimi. Advancing Tabular Data Classification with Graph Neural Networks: A Random Forest Proximity Method. In2024 IEEE International Conference on Big Data (BigData), pages 7011–7020, Dec. 2024. doi: 10.1109/BigData62323.2024.10825972

  6. [6]

    J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine.The Annals of Statistics, 29(5):1189–1232, 2001. ISSN 0090-5364

  7. [7]

    Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

    P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, Apr. 2006. ISSN 1573-0565. doi: 10.1007/s10994-006-6226-1

  8. [8]

    F. G. Gustavson. Two Fast Algorithms for Sparse Matrices: Multiplication and Per- muted Transposition.ACM Transactions on Mathematical Software, 4(3):250–269, Sept. 1978. ISSN 0098-3500, 1557-7295. doi: 10.1145/355791.355796

  9. [9]

    Hassine, A

    K. Hassine, A. Erbad, and R. Hamila. Important Complexity Reduction of Random Forest in Multi-Classification Problem. In2019 15th International Wireless Commu- nications & Mobile Computing Conference (IWCMC), pages 226–231, June 2019. doi: 10.1109/IWCMC.2019.8766544

  10. [10]

    J. Lee. Patient-Specific Predictive Modeling Using Random Forests: An Observational Study for the Critically Ill.JMIR Medical Informatics, 5(1):e6690, Jan. 2017. doi: 10.2196/medinform.6690

  11. [11]

    G. Louppe. Understanding Random Forests: From Theory to Practice, June 2015

  12. [12]

    J. S. Rhodes and A. G. Rustad. Random Forest-Supervised Manifold Alignment. In 2024 IEEE International Conference on Big Data (BigData), pages 3309–3312. IEEE Computer Society, Dec. 2024. ISBN 979-8-3503-6248-0. doi: 10.1109/BigData62323. 2024.10825663. 7 Aumon, Wolf, Moon and Rhodes

  13. [13]

    J. S. Rhodes, A. Aumon, S. Morin, M. Girard, C. Larochelle, B. Lahav, E. Brunet- Ratnasingham, W. Zhang, A. Cutler, A. Zhou, D. E. Kaufmann, S. Zandee, A. Prat, G. Wolf, and K. R. Moon. Gaining Biological Insights through Supervised Data Visu- alization, Nov. 2023

  14. [14]

    J. S. Rhodes, A. Cutler, and K. R. Moon. Geometry- and Accuracy-Preserving Random Forest Proximities.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10947–10959, Sept. 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2023.3263774

  15. [15]

    B. Shaw, J. Rhodes, S. F. Boubrahimi, and K. R. Moon. Forest Proximities for Time Series. In K. Arai, editor,Intelligent Systems and Applications, pages 588–598, Cham, 2026. Springer Nature Switzerland. ISBN 978-3-032-07109-5. doi: 10.1007/978-3-032-07109-5 40

  16. [16]

    S. Tan, M. Soloviev, G. Hooker, and M. T. Wells. Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable. InProceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, FODS ’20, pages 23–34, New York, NY, USA, Oct. 2020. Association for Computing Machinery. ISBN 978-1-4503-8103-1. doi: 10.1145/3412815.3416893

  17. [17]

    Virtanen, R

    P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙I. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perk- told, R. Cimrman, I. Henrikse...

  18. [18]

    doi: 10.1038/s41592-019-0686-2

  19. [19]

    H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms, Sept. 2017. 8 Scalable Tree Ensemble Proximities Appendix A. Proofs Proof[Proposition 1] Let the columns ofQandWbe indexed byk∈ K. Since each leaf kbelongs to exactly one tree, lett(k) be the tree index containing leafk. We define the ent...