Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

Aur\'elien Bellet; Christian Janos Lebeda; David Erb; Tudor Cebere

arxiv: 2605.22756 · v1 · pith:NLKFZT33new · submitted 2026-05-21 · 💻 cs.LG · cs.DS

Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

Christian Janos Lebeda , David Erb , Tudor Cebere , Aur\'elien Bellet This is my paper

Pith reviewed 2026-05-22 06:46 UTC · model grok-4.3

classification 💻 cs.LG cs.DS

keywords differentially private random forestsheavy hitter detectionhierarchical datadifferential privacydecision tree pruningprivacy-preserving machine learningtabular data

0 comments

The pith

Lumberjack improves differentially private random forests by growing deep trees and pruning them with a new heavy hitter detector for hierarchical data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lumberjack as a way to train random forests on sensitive data while adding differential privacy. It works by first building large trees and then using a new private algorithm to find and keep only the nodes that contain enough records. The key is a heavy hitter detector whose added error grows only with the square root of the log of tree height, which keeps noise manageable even for deeper trees. This leads to models that capture more patterns than earlier private random forest methods. Experiments show better accuracy at the same privacy levels on standard datasets.

Core claim

Lumberjack constructs large random decision trees and applies aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data with error O_{ε,δ}(√log h) enables significantly deeper trees than prior work, leading to improved expressiveness under privacy constraints and higher utility on benchmark datasets.

What carries the argument

A novel (ε,δ)-differentially private heavy hitter detection algorithm for hierarchical data that identifies populated nodes with error scaling as the square root of the logarithm of tree height.

If this is right

Deeper trees become practical under differential privacy without eroding utility.
The privacy-utility tradeoff improves noticeably for budgets used in real applications.
Random forests can be applied more readily to sensitive tabular data tasks.
The heavy hitter technique may extend to other tree-structured private learning problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same detector could be tested on gradient boosted trees or other ensembles to check for similar gains.
Trying the method on datasets with higher dimensionality or more missing values would reveal its limits.
Combining the pruning step with other privacy tools like secure aggregation might yield further improvements.

Load-bearing premise

The error from the heavy hitter detector grows slowly enough with tree depth that deeper trees still produce higher accuracy after the private pruning step.

What would settle it

If Lumberjack shows no accuracy improvement over earlier differentially private random forest methods when both are tested on the same benchmark tabular datasets at a fixed privacy budget such as ε=1, the central performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22756 by Aur\'elien Bellet, Christian Janos Lebeda, David Erb, Tudor Cebere.

**Figure 3.** Figure 3: Test accuracy for the Adult and Folktables ACSIncome and ACSEmployment datasets. We [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Experimental results for two additional Folktables classification tasks. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel $(\varepsilon,\delta)$-DP heavy hitter detection algorithm for hierarchical data, whose error is $O_{\varepsilon,\delta}(\sqrt{\log h})$ for trees of height $h$ and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lumberjack's main contribution is a hierarchical heavy-hitter detector with O(sqrt(log h)) error that lets them build and prune deeper DP trees, yielding measurable utility gains over prior DP random forests on tabular data.

read the letter

The paper's central move is to grow large random trees first and then prune aggressively with a new (ε,δ)-DP heavy-hitter routine designed for hierarchical data. The claimed error scaling O_ε,δ(sqrt(log h)) is what supposedly lets them push tree depth beyond what earlier DP forest methods could sustain without the noise destroying the signal. That primitive looks like the actual novelty; the rest is a sensible application to pruning nodes that fall below a private count threshold. On the empirical side they report consistent improvements in accuracy for practical privacy budgets across standard benchmarks, which is the kind of result that matters for anyone who has tried to use DP trees on real tabular data and watched utility collapse. If the numbers are stable and the composition with the tree-building steps is handled correctly, this is a concrete step forward rather than another marginal tweak. The stress-test worry about hidden dependence on branching factor or domain size is worth checking in the proofs, but the abstract and the reported gains suggest the constants work out in practice for the depths they actually use. I did not see any obvious circularity or reliance on fitted parameters that would undermine the claim. The work is aimed at people building private models for sensitive tabular datasets where random forests are still the default. A reader who needs a drop-in improvement to existing DP forest baselines will find something usable here. The algorithmic idea is clear enough and the empirical comparison is direct enough that it deserves a serious referee rather than a desk reject. I'd send it out for review and ask the authors to expand the proof sketch for the heavy-hitter bound and to add error bars or multiple runs to the tables.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Lumberjack, a differentially private random forest algorithm that constructs large decision trees and applies aggressive privacy-preserving pruning via a novel (ε,δ)-DP heavy-hitter detection algorithm for hierarchical data. The key technical contribution is an error bound of O_{ε,δ}(√log h) for trees of height h, which is claimed to enable significantly deeper trees than prior work while preserving privacy. Empirical evaluation on benchmark datasets is reported to show consistent outperformance over existing DP random forest methods, yielding improved privacy-utility trade-offs for practical ε values.

Significance. If the error bound holds with practical constants and the empirical gains are robust, the work would meaningfully advance differentially private learning on tabular data by reducing the utility gap with non-private random forests. The hierarchical heavy-hitter primitive could have independent value in other tree-structured or hierarchical DP settings.

major comments (2)

[§4.1] §4.1, the derivation of the O_{ε,δ}(√log h) error bound: the analysis assumes a fixed branching factor and domain size, but does not quantify how the leading constants or the per-node invocation of the detector across a full tree scale with height h; without this, it is unclear whether the bound remains small enough to support the claimed deeper trees without eroding pruning quality under fixed privacy budgets.
[Table 3 and §5.2] Table 3 and §5.2: the reported accuracy improvements over baselines lack error bars, standard deviations across random seeds, or statistical significance tests; given that the central claim is 'consistent' outperformance establishing a new state of the art, the absence of these makes it impossible to assess whether the gains are reliable or could be explained by variance.

minor comments (2)

[Abstract and §4] Notation for the heavy-hitter error bound uses O_{ε,δ} without defining the hidden factors; a brief remark on dependence on branching factor would improve clarity.
[§4.3] The manuscript does not discuss how the pruning threshold interacts with the heavy-hitter error; a short paragraph relating these parameters would help readers reproduce the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our submission. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses

Referee: [§4.1] §4.1, the derivation of the O_{ε,δ}(√log h) error bound: the analysis assumes a fixed branching factor and domain size, but does not quantify how the leading constants or the per-node invocation of the detector across a full tree scale with height h; without this, it is unclear whether the bound remains small enough to support the claimed deeper trees without eroding pruning quality under fixed privacy budgets.

Authors: We thank the referee for this observation. Section 4.1 derives the O_{ε,δ}(√log h) bound for a single invocation of the heavy-hitter detector, where the leading constants depend on ε, δ, the fixed branching factor, and domain size but are independent of height h. The full tree uses privacy composition (via the moments accountant) to allocate a fixed total budget across nodes; the per-node ε is scaled inversely with the number of invocations. We will add a clarifying paragraph in §4.1 that explicitly states the independence of constants from h, provides the composition scaling, and includes a short calculation showing the effective error remains practical for the deeper trees used in our experiments. revision: partial
Referee: [Table 3 and §5.2] Table 3 and §5.2: the reported accuracy improvements over baselines lack error bars, standard deviations across random seeds, or statistical significance tests; given that the central claim is 'consistent' outperformance establishing a new state of the art, the absence of these makes it impossible to assess whether the gains are reliable or could be explained by variance.

Authors: We agree that variability measures are needed to support the claim of consistent outperformance. In the revision we will rerun all experiments over 10 random seeds, report mean accuracy with standard deviations in Table 3, and add paired statistical significance tests (with p-values) in §5.2 to quantify the reliability of the improvements over baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity: novel heavy-hitter algorithm and error bound presented as independent contribution

full rationale

The paper introduces Lumberjack via a novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data whose error scales as O_{ε,δ}(√log h). This scaling is stated as a direct property of the new algorithm and is used to justify deeper trees and improved utility. No equations, derivations, or self-citations are shown that reduce the bound or the resulting privacy-utility gains to a fitted parameter, a prior result by the same authors, or an ansatz smuggled in by citation. The central claims rest on the algorithmic construction itself rather than on any self-referential renaming or load-bearing self-citation chain, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5741 in / 1127 out tokens · 49576 ms · 2026-05-22T06:46:04.474848+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data, whose error is O_{ε,δ}(√log h) for trees of height h

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

URLhttps://doi.org/10.1613/jair.1.14649

doi: 10.1613/JAIR.1.14649. URLhttps://doi.org/10.1613/jair.1.14649. Ryan Rogers and Thomas Steinke. A better privacy analysis of the exponential mechanism. DifferentialPrivacy.org, 07 2021. https://differentialprivacy.org/ exponential-mechanism-bounded-range/. Sini Suihkonen. Differential privacy applied to random forest classification. Master’s thesis, U...

work page doi:10.1613/jair.1.14649 2021
[2]

URL https://doi.org/10.1371/journal.pone

doi: 10.1371/journal.pone.0301541. URL https://doi.org/10.1371/journal.pone. 0301541. Daniël V os, Jelle V os, Tianyu Li, Zekeriya Erkin, and Sicco Verwer. Differentially-private decision trees and provable robustness to data poisoning.arXiv preprint arXiv:2305.15394, 2023. 13 Arjun Wilkins, Daniel Kifer, Danfeng Zhang, and Brian Karrer. Exact privacy ana...

work page doi:10.1371/journal.pone.0301541 2023
[3]

one vs. all

The accuracy on Adult roughly matches the plot from [Suihkonen, 2023]. We did not change the hyperparameters for the Folktables tasks except for only considering "one vs. all" splits on categorical attributes. This was required for runtime purposes. The choice of this parameter affects the performance on these datasets, but extensive hyperparameter tuning...

work page 2023
[4]

This data-dependent control flow in the construction of the tree violates DP

Patil and Singh [2014] deterministically stop splitting when all samples in a node belong to the same class (Section IV , subsection D, point 2). This data-dependent control flow in the construction of the tree violates DP. Additionally, they do not properly account for bootstrapping which results in an inaccurate privacy guarantee (Algorithm 1, step 3)

work page 2014
[5]

Fletcher and Islam [2015a] calculate the local sensitivity based on noisy values instead of using data-independent global sensitivity (Algorithm 1, line 17)

work page
[6]

This is heuristically better than using the data points themselves, but it still breaks DP

Li and Li [2017] use the midpoint between adjacent data points as candidates when splitting continuous features (Section III, Subsection D, Equation 5). This is heuristically better than using the data points themselves, but it still breaks DP. They also do not properly account for bootstrapping (Step 2a of the algorithm in Section III, Subsection B)

work page 2017
[7]

The condition that the decision tree stops growing includes: 1) All samples in the node have the same classification result

Hou et al. [2019] use a deterministic data-dependent stopping criterion in the splitting algorithm (Algorithm 1, Line 5). Quoting directly from the paper: "The condition that the decision tree stops growing includes: 1) All samples in the node have the same classification result"

work page 2019
[8]

[2019] have data-dependent stopping conditions in their tree-building process

Xin et al. [2019] have data-dependent stopping conditions in their tree-building process. They do not split nodes with 10 or fewer data points (Algorithm 2, Line 1), and they deterministically stop splitting pure nodes (Algorithm 2, line 5)

work page 2019
[9]

[2020] sort values of continuous attributes and use the average of buckets with 5 data points as candidates for the splitpoint (Algorithm 1, Line 10)

Guan et al. [2020] sort values of continuous attributes and use the average of buckets with 5 data points as candidates for the splitpoint (Algorithm 1, Line 10). The candidate set differs between neighboring datasets which breaks DP. Their implementation also deterministically does not split on pure nodes (Algorithm 1, line 4)

work page 2020
[10]

[2021] deterministically stop splitting when all samples in a node belong to the same class (Algorithms 1, 2, and 3, see termination condition)

Zhang et al. [2021] deterministically stop splitting when all samples in a node belong to the same class (Algorithms 1, 2, and 3, see termination condition)

work page 2021
[11]

They inspect all possible subsets when splitting categorical features

Consul and William [2021] use data-dependent sensitivity in both the splits (Equation 6) and the leaf nodes in the case of regression trees (Section 3, paragraphEstimating leaf node parameters).6 6Incidentally, we note that we could not use this algorithm as a baseline for computational reasons. They inspect all possible subsets when splitting categorical...

work page 2021
[12]

[2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4)

Liu et al. [2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4). In the second part of their algorithm they sample points with weights based on misclassification without accounting for it (Algorithm 3, Line 9). This leads to a higher privacy loss for outlier data points

work page 2023
[13]

[2023] (who focus on the setting of a single decision tree rather than building a forest) deterministically stop splitting at pure nodes (Algorithm 1, line 7)

V os et al. [2023] (who focus on the setting of a single decision tree rather than building a forest) deterministically stop splitting at pure nodes (Algorithm 1, line 7)

work page 2023
[14]

lies between

While we use Suihkonen [2023] as a baseline for greedy forests in our experiments, the code does contain some privacy violations including data-dependent stopping criteria for splitting (line 496 in the implementation), deterministically labeling pure leaf nodes (line 492 in the implementation) and sensitivity miscalibration (line 242 in the implementatio...

work page 2023

[1] [1]

URLhttps://doi.org/10.1613/jair.1.14649

doi: 10.1613/JAIR.1.14649. URLhttps://doi.org/10.1613/jair.1.14649. Ryan Rogers and Thomas Steinke. A better privacy analysis of the exponential mechanism. DifferentialPrivacy.org, 07 2021. https://differentialprivacy.org/ exponential-mechanism-bounded-range/. Sini Suihkonen. Differential privacy applied to random forest classification. Master’s thesis, U...

work page doi:10.1613/jair.1.14649 2021

[2] [2]

URL https://doi.org/10.1371/journal.pone

doi: 10.1371/journal.pone.0301541. URL https://doi.org/10.1371/journal.pone. 0301541. Daniël V os, Jelle V os, Tianyu Li, Zekeriya Erkin, and Sicco Verwer. Differentially-private decision trees and provable robustness to data poisoning.arXiv preprint arXiv:2305.15394, 2023. 13 Arjun Wilkins, Daniel Kifer, Danfeng Zhang, and Brian Karrer. Exact privacy ana...

work page doi:10.1371/journal.pone.0301541 2023

[3] [3]

one vs. all

The accuracy on Adult roughly matches the plot from [Suihkonen, 2023]. We did not change the hyperparameters for the Folktables tasks except for only considering "one vs. all" splits on categorical attributes. This was required for runtime purposes. The choice of this parameter affects the performance on these datasets, but extensive hyperparameter tuning...

work page 2023

[4] [4]

This data-dependent control flow in the construction of the tree violates DP

Patil and Singh [2014] deterministically stop splitting when all samples in a node belong to the same class (Section IV , subsection D, point 2). This data-dependent control flow in the construction of the tree violates DP. Additionally, they do not properly account for bootstrapping which results in an inaccurate privacy guarantee (Algorithm 1, step 3)

work page 2014

[5] [5]

Fletcher and Islam [2015a] calculate the local sensitivity based on noisy values instead of using data-independent global sensitivity (Algorithm 1, line 17)

work page

[6] [6]

This is heuristically better than using the data points themselves, but it still breaks DP

Li and Li [2017] use the midpoint between adjacent data points as candidates when splitting continuous features (Section III, Subsection D, Equation 5). This is heuristically better than using the data points themselves, but it still breaks DP. They also do not properly account for bootstrapping (Step 2a of the algorithm in Section III, Subsection B)

work page 2017

[7] [7]

The condition that the decision tree stops growing includes: 1) All samples in the node have the same classification result

Hou et al. [2019] use a deterministic data-dependent stopping criterion in the splitting algorithm (Algorithm 1, Line 5). Quoting directly from the paper: "The condition that the decision tree stops growing includes: 1) All samples in the node have the same classification result"

work page 2019

[8] [8]

[2019] have data-dependent stopping conditions in their tree-building process

Xin et al. [2019] have data-dependent stopping conditions in their tree-building process. They do not split nodes with 10 or fewer data points (Algorithm 2, Line 1), and they deterministically stop splitting pure nodes (Algorithm 2, line 5)

work page 2019

[9] [9]

[2020] sort values of continuous attributes and use the average of buckets with 5 data points as candidates for the splitpoint (Algorithm 1, Line 10)

Guan et al. [2020] sort values of continuous attributes and use the average of buckets with 5 data points as candidates for the splitpoint (Algorithm 1, Line 10). The candidate set differs between neighboring datasets which breaks DP. Their implementation also deterministically does not split on pure nodes (Algorithm 1, line 4)

work page 2020

[10] [10]

[2021] deterministically stop splitting when all samples in a node belong to the same class (Algorithms 1, 2, and 3, see termination condition)

Zhang et al. [2021] deterministically stop splitting when all samples in a node belong to the same class (Algorithms 1, 2, and 3, see termination condition)

work page 2021

[11] [11]

They inspect all possible subsets when splitting categorical features

Consul and William [2021] use data-dependent sensitivity in both the splits (Equation 6) and the leaf nodes in the case of regression trees (Section 3, paragraphEstimating leaf node parameters).6 6Incidentally, we note that we could not use this algorithm as a baseline for computational reasons. They inspect all possible subsets when splitting categorical...

work page 2021

[12] [12]

[2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4)

Liu et al. [2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4). In the second part of their algorithm they sample points with weights based on misclassification without accounting for it (Algorithm 3, Line 9). This leads to a higher privacy loss for outlier data points

work page 2023

[13] [13]

[2023] (who focus on the setting of a single decision tree rather than building a forest) deterministically stop splitting at pure nodes (Algorithm 1, line 7)

V os et al. [2023] (who focus on the setting of a single decision tree rather than building a forest) deterministically stop splitting at pure nodes (Algorithm 1, line 7)

work page 2023

[14] [14]

lies between

While we use Suihkonen [2023] as a baseline for greedy forests in our experiments, the code does contain some privacy violations including data-dependent stopping criteria for splitting (line 496 in the implementation), deterministically labeling pure leaf nodes (line 492 in the implementation) and sensitivity miscalibration (line 242 in the implementatio...

work page 2023