Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels

Adrien Aumon; Guy Wolf; Jake S. Rhodes; Kevin R. Moon

arxiv: 2601.02735 · v2 · submitted 2026-01-06 · 💻 cs.LG · cs.DS· cs.PF

Revisiting Forest Proximities via Sparse Leaf-Incidence Kernels

Adrien Aumon , Guy Wolf , Kevin R. Moon , Jake S. Rhodes This is my paper

Pith reviewed 2026-05-16 17:50 UTC · model grok-4.3

classification 💻 cs.LG cs.DScs.PF

keywords decision forestsproximity kernelssparse factorizationleaf collisionskernel methodsrandom forestssimilarity measures

0 comments

The pith

Decision forest proximities admit an exact sparse factorization through separable weighted leaf-collision kernels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that common forest proximity measures can all be rewritten as weighted sums over pairs of samples sharing the same leaf in some tree. This shared structure lets the proximity matrix be factored into a sparse leaf-incidence matrix times a diagonal weighting matrix. The result replaces the usual quadratic all-pairs loop with sparse linear algebra that runs in near-linear time and memory for typical forests. A sympathetic reader cares because this removes the main computational barrier to using forest similarities inside larger kernel or embedding pipelines. The approach also yields an explicit low-dimensional leaf-space representation of the data that can be used directly for downstream tasks.

Core claim

By defining the Separable Weighted Leaf-Collision (SWLC) kernel class, the authors prove that most existing forest proximity definitions differ only in how they weight leaf collisions while sharing the same sparse incidence structure between samples and leaves. This common structure produces an exact factorization of the finite-sample proximity matrix as a product of the leaf-incidence matrix with its transpose, scaled by a diagonal matrix of leaf weights, without ever computing all pairwise comparisons.

What carries the argument

The Separable Weighted Leaf-Collision (SWLC) kernel, which expresses proximity as a sum over shared leaves with sample- and leaf-dependent weights, enabling the sparse factorization.

Load-bearing premise

That existing forest proximity definitions can be expressed exactly as separable weighted sums over leaf collisions and that the leaf-incidence matrix remains sufficiently sparse under standard forest hyperparameters.

What would settle it

A counter-example where a standard forest proximity cannot be written as a separable weighted leaf-collision kernel, or an empirical test showing quadratic scaling instead of near-linear when the number of samples grows while keeping forest size fixed.

Figures

Figures reproduced from arXiv: 2601.02735 by Adrien Aumon, Guy Wolf, Jake S. Rhodes, Kevin R. Moon.

read the original abstract

Decision forests induce supervised similarities through the partition structure of their trees. Yet forest proximity computation is still often treated as a quadratic operation in the number of samples, which limits scalability and restricts broader use in kernel and representation-learning pipelines. We introduce a unified view of leaf-collision forest proximities through a class of Separable Weighted Leaf-Collision (SWLC) kernels, showing that most existing proximities differ only in their weighting scheme while sharing a common sparse leaf-incidence structure. This yields an explicit leaf-space representation that clarifies their kernel interpretation and leads to an exact finite-sample sparse factorization of the proximity matrix, avoiding an explicit all-pairs comparison and reducing computation to sparse linear algebra over leaf collisions. We implement this framework in a memory-efficient Python library and show, both theoretically and empirically, that exact kernel computation scales near-linearly in time and memory under standard forest regimes. Benchmarks verify the predicted scaling behavior in practice across datasets, proximity definitions, and forest settings, and show that the resulting sparse leaf-space representation can also be used directly for fast task-aware embedding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a clean sparse factorization for forest proximities that makes them usable at scale, though the sparsity benefit depends on keeping trees in the usual shallow regime.

read the letter

This paper gives a useful way to compute forest proximities without the usual quadratic hit. By framing them as Separable Weighted Leaf-Collision kernels, they factor the proximity matrix exactly as K = L W L^T, where L is the leaf-incidence matrix. That moves the work to sparse linear algebra over the collisions instead of all-pairs checks. What stands out is how they show that common proximity definitions are really just different weightings on the same leaf structure. The explicit sparse factorization is new in the literature they cite, and it clarifies the kernel view. They also ship a memory-efficient Python library and run benchmarks that confirm near-linear scaling in time and memory for typical forest setups. The leaf-space representation for embeddings is a practical bonus. The main limitation is the sparsity assumption. The scaling holds when the leaf-incidence matrix stays sparse, which the paper ties to standard forest regimes. But as the stress-test note points out, deeper trees or more estimators can increase the number of leaves per sample and push the density up. Without explicit bounds on depth or n_estimators where the nnz ratio stays low, the near-linear claim is conditional rather than general. That's a real but contained issue; the math itself is straightforward and the experiments look clean. This work is aimed at people using decision forests inside kernel methods, similarity-based learning, or embedding pipelines who need to handle larger datasets. The thinking is clear and the contribution is practical, so it should go to peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript unifies existing forest proximity definitions under a class of Separable Weighted Leaf-Collision (SWLC) kernels. It derives an exact finite-sample factorization K = L W L^T of the proximity matrix from the sparse leaf-incidence matrix L, reducing computation to sparse linear algebra over leaf collisions without explicit all-pairs comparisons. The work provides theoretical arguments and empirical benchmarks showing near-linear time and memory scaling under standard forest regimes, along with a memory-efficient implementation and direct use of the leaf-space representation for task-aware embeddings.

Significance. If the sparsity assumptions hold, the exact factorization and its reduction to sparse linear algebra constitute a clear practical advance for scalable kernel methods and embeddings based on decision forests. The parameter-free derivation from the leaf-incidence definition and the reproducible scaling benchmarks are notable strengths that clarify the kernel view of forest proximities.

major comments (2)

[theoretical analysis and scaling discussion] The central scalability claim (near-linear time/memory via sparse factorization) rests on the leaf-incidence matrix L remaining sufficiently sparse. No explicit bound is given on nnz(L) as a function of n_samples, max_depth, or n_estimators that would guarantee the claimed asymptotic regime outside narrow 'standard' hyperparameter settings.
[unified view of SWLC kernels] The weakest assumption—that existing proximities are exactly expressible as separable weighted sums over leaf collisions—is used to justify the unified SWLC class, yet the manuscript does not provide a counter-example or proof sketch showing that all common proximity variants (e.g., those with non-separable terms) fall inside this class.

minor comments (2)

[factorization derivation] The definition of the weighting matrix W should be stated explicitly with its dimensions and construction rule immediately after the factorization equation.
[empirical benchmarks] Figure captions for the scaling plots should include the exact hyperparameter ranges tested (depth, n_estimators, dataset sizes) to allow direct verification of the 'standard regimes' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and the opportunity to clarify the theoretical foundations of our work. We address the major comments point by point below.

read point-by-point responses

Referee: The central scalability claim (near-linear time/memory via sparse factorization) rests on the leaf-incidence matrix L remaining sufficiently sparse. No explicit bound is given on nnz(L) as a function of n_samples, max_depth, or n_estimators that would guarantee the claimed asymptotic regime outside narrow 'standard' hyperparameter settings.

Authors: We appreciate this observation. In decision forests, each sample is assigned to exactly one leaf per tree, so the leaf-incidence matrix L has exactly nnz(L) = n_samples x n_estimators nonzeros, independent of max_depth. This bound is linear in the number of samples and estimators. We will add an explicit statement of this bound and a discussion of the conditions under which the near-linear scaling holds in the revised manuscript. revision: yes
Referee: The weakest assumption—that existing proximities are exactly expressible as separable weighted sums over leaf collisions—is used to justify the unified SWLC class, yet the manuscript does not provide a counter-example or proof sketch showing that all common proximity variants (e.g., those with non-separable terms) fall inside this class.

Authors: The manuscript focuses on standard proximity definitions (e.g., Breiman's original and common variants), which we show are exactly SWLC kernels by deriving their weighting schemes explicitly. We do not claim universality over all conceivable variants; non-separable terms would indeed lie outside the class. To clarify the scope, we will include a short discussion noting that the SWLC class encompasses the separable weighted leaf-collision proximities prevalent in the literature, with a brief example of a non-separable variant that falls outside. revision: yes

Circularity Check

0 steps flagged

No circularity: factorization follows directly from leaf-incidence definition

full rationale

The paper defines SWLC kernels from the standard leaf-collision structure of decision forests and derives the sparse factorization K = L W L^T as an algebraic identity on the leaf-incidence matrix L. This is a direct rewriting of the proximity definition rather than a self-referential loop, fitted parameter renamed as prediction, or load-bearing self-citation. No step reduces the claimed result to its own inputs by construction; the derivation remains self-contained against the external definition of forest proximities.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the standard definition of leaf incidence in decision trees and the algebraic properties of sparse matrix multiplication; no free parameters or new physical entities are introduced.

axioms (2)

domain assumption Leaf-incidence matrix between samples and leaves is sparse under standard forest construction
Invoked to obtain near-linear scaling
domain assumption Proximity definitions can be written as separable weighted sums over leaf collisions
Central modeling assumption that unifies prior methods

invented entities (1)

Separable Weighted Leaf-Collision (SWLC) kernel no independent evidence
purpose: Unified representation of forest proximities
New class defined to separate weighting from incidence

pith-pipeline@v0.9.0 · 5496 in / 1307 out tokens · 50248 ms · 2026-05-16T17:50:25.618401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Aumon, S

A. Aumon, S. Ni, M. Lizotte, G. Wolf, K. R. Moon, and J. S. Rhodes. Random Forest Autoencoders for Guided Representation Learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, Oct. 2025

work page 2025
[2]

L. Breiman. Random forests.Machine learning, 45(1):5–32, 2001

work page 2001
[3]

Breiman and A

L. Breiman and A. Cutler. Manual on setting up, using, and understanding random forests v4.0. Technical report, University of California, Berkeley, Department of Statis- tics, 2003

work page 2003
[4]

Duvaux, Q

L. Duvaux, Q. Geissmann, K. Gharbi, J.-J. Zhou, J. Ferrari, C. M. Smadja, and R. K. Butlin. Dynamics of Copy Number Variation in Host Races of the Pea Aphid. Molecular Biology and Evolution, 32(1):63–80, Jan. 2015. ISSN 1537-1719, 0737-4038. doi: 10.1093/molbev/msu266

work page doi:10.1093/molbev/msu266 2015
[5]

Warren, Lu Cheng, Haidar M

S. Farokhi, H. Chen, K. Moon, and H. Karimi. Advancing Tabular Data Classification with Graph Neural Networks: A Random Forest Proximity Method. In2024 IEEE International Conference on Big Data (BigData), pages 7011–7020, Dec. 2024. doi: 10.1109/BigData62323.2024.10825972

work page doi:10.1109/bigdata62323.2024.10825972 2024
[6]

J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine.The Annals of Statistics, 29(5):1189–1232, 2001. ISSN 0090-5364

work page 2001
[7]

Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, Apr. 2006. ISSN 1573-0565. doi: 10.1007/s10994-006-6226-1

work page doi:10.1007/s10994-006-6226-1 2006
[8]

F. G. Gustavson. Two Fast Algorithms for Sparse Matrices: Multiplication and Per- muted Transposition.ACM Transactions on Mathematical Software, 4(3):250–269, Sept. 1978. ISSN 0098-3500, 1557-7295. doi: 10.1145/355791.355796

work page doi:10.1145/355791.355796 1978
[9]

Hassine, A

K. Hassine, A. Erbad, and R. Hamila. Important Complexity Reduction of Random Forest in Multi-Classification Problem. In2019 15th International Wireless Commu- nications & Mobile Computing Conference (IWCMC), pages 226–231, June 2019. doi: 10.1109/IWCMC.2019.8766544

work page doi:10.1109/iwcmc.2019.8766544 2019
[10]

J. Lee. Patient-Specific Predictive Modeling Using Random Forests: An Observational Study for the Critically Ill.JMIR Medical Informatics, 5(1):e6690, Jan. 2017. doi: 10.2196/medinform.6690

work page doi:10.2196/medinform.6690 2017
[11]

G. Louppe. Understanding Random Forests: From Theory to Practice, June 2015

work page 2015
[12]

J. S. Rhodes and A. G. Rustad. Random Forest-Supervised Manifold Alignment. In 2024 IEEE International Conference on Big Data (BigData), pages 3309–3312. IEEE Computer Society, Dec. 2024. ISBN 979-8-3503-6248-0. doi: 10.1109/BigData62323. 2024.10825663. 7 Aumon, Wolf, Moon and Rhodes

work page doi:10.1109/bigdata62323 2024
[13]

J. S. Rhodes, A. Aumon, S. Morin, M. Girard, C. Larochelle, B. Lahav, E. Brunet- Ratnasingham, W. Zhang, A. Cutler, A. Zhou, D. E. Kaufmann, S. Zandee, A. Prat, G. Wolf, and K. R. Moon. Gaining Biological Insights through Supervised Data Visu- alization, Nov. 2023

work page 2023
[14]

J. S. Rhodes, A. Cutler, and K. R. Moon. Geometry- and Accuracy-Preserving Random Forest Proximities.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10947–10959, Sept. 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2023.3263774

work page doi:10.1109/tpami.2023.3263774 2023
[15]

B. Shaw, J. Rhodes, S. F. Boubrahimi, and K. R. Moon. Forest Proximities for Time Series. In K. Arai, editor,Intelligent Systems and Applications, pages 588–598, Cham, 2026. Springer Nature Switzerland. ISBN 978-3-032-07109-5. doi: 10.1007/978-3-032-07109-5 40

work page doi:10.1007/978-3-032-07109-5 2026
[16]

S. Tan, M. Soloviev, G. Hooker, and M. T. Wells. Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable. InProceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, FODS ’20, pages 23–34, New York, NY, USA, Oct. 2020. Association for Computing Machinery. ISBN 978-1-4503-8103-1. doi: 10.1145/3412815.3416893

work page doi:10.1145/3412815.3416893 2020
[17]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙I. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perk- told, R. Cimrman, I. Henrikse...

work page
[18]

doi: 10.1038/s41592-019-0686-2

work page doi:10.1038/s41592-019-0686-2
[19]

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms, Sept. 2017. 8 Scalable Tree Ensemble Proximities Appendix A. Proofs Proof[Proposition 1] Let the columns ofQandWbe indexed byk∈ K. Since each leaf kbelongs to exactly one tree, lett(k) be the tree index containing leafk. We define the ent...

work page 2017

[1] [1]

Aumon, S

A. Aumon, S. Ni, M. Lizotte, G. Wolf, K. R. Moon, and J. S. Rhodes. Random Forest Autoencoders for Guided Representation Learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, Oct. 2025

work page 2025

[2] [2]

L. Breiman. Random forests.Machine learning, 45(1):5–32, 2001

work page 2001

[3] [3]

Breiman and A

L. Breiman and A. Cutler. Manual on setting up, using, and understanding random forests v4.0. Technical report, University of California, Berkeley, Department of Statis- tics, 2003

work page 2003

[4] [4]

Duvaux, Q

L. Duvaux, Q. Geissmann, K. Gharbi, J.-J. Zhou, J. Ferrari, C. M. Smadja, and R. K. Butlin. Dynamics of Copy Number Variation in Host Races of the Pea Aphid. Molecular Biology and Evolution, 32(1):63–80, Jan. 2015. ISSN 1537-1719, 0737-4038. doi: 10.1093/molbev/msu266

work page doi:10.1093/molbev/msu266 2015

[5] [5]

Warren, Lu Cheng, Haidar M

S. Farokhi, H. Chen, K. Moon, and H. Karimi. Advancing Tabular Data Classification with Graph Neural Networks: A Random Forest Proximity Method. In2024 IEEE International Conference on Big Data (BigData), pages 7011–7020, Dec. 2024. doi: 10.1109/BigData62323.2024.10825972

work page doi:10.1109/bigdata62323.2024.10825972 2024

[6] [6]

J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine.The Annals of Statistics, 29(5):1189–1232, 2001. ISSN 0090-5364

work page 2001

[7] [7]

Extremely randomized trees.Machine Learning, 63(1):3–42, 2006

P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees.Machine Learning, 63(1):3–42, Apr. 2006. ISSN 1573-0565. doi: 10.1007/s10994-006-6226-1

work page doi:10.1007/s10994-006-6226-1 2006

[8] [8]

F. G. Gustavson. Two Fast Algorithms for Sparse Matrices: Multiplication and Per- muted Transposition.ACM Transactions on Mathematical Software, 4(3):250–269, Sept. 1978. ISSN 0098-3500, 1557-7295. doi: 10.1145/355791.355796

work page doi:10.1145/355791.355796 1978

[9] [9]

Hassine, A

K. Hassine, A. Erbad, and R. Hamila. Important Complexity Reduction of Random Forest in Multi-Classification Problem. In2019 15th International Wireless Commu- nications & Mobile Computing Conference (IWCMC), pages 226–231, June 2019. doi: 10.1109/IWCMC.2019.8766544

work page doi:10.1109/iwcmc.2019.8766544 2019

[10] [10]

J. Lee. Patient-Specific Predictive Modeling Using Random Forests: An Observational Study for the Critically Ill.JMIR Medical Informatics, 5(1):e6690, Jan. 2017. doi: 10.2196/medinform.6690

work page doi:10.2196/medinform.6690 2017

[11] [11]

G. Louppe. Understanding Random Forests: From Theory to Practice, June 2015

work page 2015

[12] [12]

J. S. Rhodes and A. G. Rustad. Random Forest-Supervised Manifold Alignment. In 2024 IEEE International Conference on Big Data (BigData), pages 3309–3312. IEEE Computer Society, Dec. 2024. ISBN 979-8-3503-6248-0. doi: 10.1109/BigData62323. 2024.10825663. 7 Aumon, Wolf, Moon and Rhodes

work page doi:10.1109/bigdata62323 2024

[13] [13]

J. S. Rhodes, A. Aumon, S. Morin, M. Girard, C. Larochelle, B. Lahav, E. Brunet- Ratnasingham, W. Zhang, A. Cutler, A. Zhou, D. E. Kaufmann, S. Zandee, A. Prat, G. Wolf, and K. R. Moon. Gaining Biological Insights through Supervised Data Visu- alization, Nov. 2023

work page 2023

[14] [14]

J. S. Rhodes, A. Cutler, and K. R. Moon. Geometry- and Accuracy-Preserving Random Forest Proximities.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10947–10959, Sept. 2023. ISSN 1939-3539. doi: 10.1109/TPAMI.2023.3263774

work page doi:10.1109/tpami.2023.3263774 2023

[15] [15]

B. Shaw, J. Rhodes, S. F. Boubrahimi, and K. R. Moon. Forest Proximities for Time Series. In K. Arai, editor,Intelligent Systems and Applications, pages 588–598, Cham, 2026. Springer Nature Switzerland. ISBN 978-3-032-07109-5. doi: 10.1007/978-3-032-07109-5 40

work page doi:10.1007/978-3-032-07109-5 2026

[16] [16]

S. Tan, M. Soloviev, G. Hooker, and M. T. Wells. Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable. InProceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, FODS ’20, pages 23–34, New York, NY, USA, Oct. 2020. Association for Computing Machinery. ISBN 978-1-4503-8103-1. doi: 10.1145/3412815.3416893

work page doi:10.1145/3412815.3416893 2020

[17] [17]

Virtanen, R

P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, ˙I. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perk- told, R. Cimrman, I. Henrikse...

work page

[18] [18]

doi: 10.1038/s41592-019-0686-2

work page doi:10.1038/s41592-019-0686-2

[19] [19]

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms, Sept. 2017. 8 Scalable Tree Ensemble Proximities Appendix A. Proofs Proof[Proposition 1] Let the columns ofQandWbe indexed byk∈ K. Since each leaf kbelongs to exactly one tree, lett(k) be the tree index containing leafk. We define the ent...

work page 2017