Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees
Pith reviewed 2026-05-22 06:46 UTC · model grok-4.3
The pith
Lumberjack improves differentially private random forests by growing deep trees and pruning them with a new heavy hitter detector for hierarchical data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lumberjack constructs large random decision trees and applies aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data with error O_{ε,δ}(√log h) enables significantly deeper trees than prior work, leading to improved expressiveness under privacy constraints and higher utility on benchmark datasets.
What carries the argument
A novel (ε,δ)-differentially private heavy hitter detection algorithm for hierarchical data that identifies populated nodes with error scaling as the square root of the logarithm of tree height.
If this is right
- Deeper trees become practical under differential privacy without eroding utility.
- The privacy-utility tradeoff improves noticeably for budgets used in real applications.
- Random forests can be applied more readily to sensitive tabular data tasks.
- The heavy hitter technique may extend to other tree-structured private learning problems.
Where Pith is reading between the lines
- The same detector could be tested on gradient boosted trees or other ensembles to check for similar gains.
- Trying the method on datasets with higher dimensionality or more missing values would reveal its limits.
- Combining the pruning step with other privacy tools like secure aggregation might yield further improvements.
Load-bearing premise
The error from the heavy hitter detector grows slowly enough with tree depth that deeper trees still produce higher accuracy after the private pruning step.
What would settle it
If Lumberjack shows no accuracy improvement over earlier differentially private random forest methods when both are tested on the same benchmark tabular datasets at a fixed privacy budget such as ε=1, the central performance claim would be falsified.
Figures
read the original abstract
Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel $(\varepsilon,\delta)$-DP heavy hitter detection algorithm for hierarchical data, whose error is $O_{\varepsilon,\delta}(\sqrt{\log h})$ for trees of height $h$ and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Lumberjack, a differentially private random forest algorithm that constructs large decision trees and applies aggressive privacy-preserving pruning via a novel (ε,δ)-DP heavy-hitter detection algorithm for hierarchical data. The key technical contribution is an error bound of O_{ε,δ}(√log h) for trees of height h, which is claimed to enable significantly deeper trees than prior work while preserving privacy. Empirical evaluation on benchmark datasets is reported to show consistent outperformance over existing DP random forest methods, yielding improved privacy-utility trade-offs for practical ε values.
Significance. If the error bound holds with practical constants and the empirical gains are robust, the work would meaningfully advance differentially private learning on tabular data by reducing the utility gap with non-private random forests. The hierarchical heavy-hitter primitive could have independent value in other tree-structured or hierarchical DP settings.
major comments (2)
- [§4.1] §4.1, the derivation of the O_{ε,δ}(√log h) error bound: the analysis assumes a fixed branching factor and domain size, but does not quantify how the leading constants or the per-node invocation of the detector across a full tree scale with height h; without this, it is unclear whether the bound remains small enough to support the claimed deeper trees without eroding pruning quality under fixed privacy budgets.
- [Table 3 and §5.2] Table 3 and §5.2: the reported accuracy improvements over baselines lack error bars, standard deviations across random seeds, or statistical significance tests; given that the central claim is 'consistent' outperformance establishing a new state of the art, the absence of these makes it impossible to assess whether the gains are reliable or could be explained by variance.
minor comments (2)
- [Abstract and §4] Notation for the heavy-hitter error bound uses O_{ε,δ} without defining the hidden factors; a brief remark on dependence on branching factor would improve clarity.
- [§4.3] The manuscript does not discuss how the pruning threshold interacts with the heavy-hitter error; a short paragraph relating these parameters would help readers reproduce the method.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our submission. We address each major comment below and describe the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [§4.1] §4.1, the derivation of the O_{ε,δ}(√log h) error bound: the analysis assumes a fixed branching factor and domain size, but does not quantify how the leading constants or the per-node invocation of the detector across a full tree scale with height h; without this, it is unclear whether the bound remains small enough to support the claimed deeper trees without eroding pruning quality under fixed privacy budgets.
Authors: We thank the referee for this observation. Section 4.1 derives the O_{ε,δ}(√log h) bound for a single invocation of the heavy-hitter detector, where the leading constants depend on ε, δ, the fixed branching factor, and domain size but are independent of height h. The full tree uses privacy composition (via the moments accountant) to allocate a fixed total budget across nodes; the per-node ε is scaled inversely with the number of invocations. We will add a clarifying paragraph in §4.1 that explicitly states the independence of constants from h, provides the composition scaling, and includes a short calculation showing the effective error remains practical for the deeper trees used in our experiments. revision: partial
-
Referee: [Table 3 and §5.2] Table 3 and §5.2: the reported accuracy improvements over baselines lack error bars, standard deviations across random seeds, or statistical significance tests; given that the central claim is 'consistent' outperformance establishing a new state of the art, the absence of these makes it impossible to assess whether the gains are reliable or could be explained by variance.
Authors: We agree that variability measures are needed to support the claim of consistent outperformance. In the revision we will rerun all experiments over 10 random seeds, report mean accuracy with standard deviations in Table 3, and add paired statistical significance tests (with p-values) in §5.2 to quantify the reliability of the improvements over baselines. revision: yes
Circularity Check
No significant circularity: novel heavy-hitter algorithm and error bound presented as independent contribution
full rationale
The paper introduces Lumberjack via a novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data whose error scales as O_{ε,δ}(√log h). This scaling is stated as a direct property of the new algorithm and is used to justify deeper trees and improved utility. No equations, derivations, or self-citations are shown that reduce the bound or the resulting privacy-utility gains to a fitted parameter, a prior result by the same authors, or an ansatz smuggled in by citation. The central claims rest on the algorithmic construction itself rather than on any self-referential renaming or load-bearing self-citation chain, rendering the derivation self-contained.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data, whose error is O_{ε,δ}(√log h) for trees of height h
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
URLhttps://doi.org/10.1613/jair.1.14649
doi: 10.1613/JAIR.1.14649. URLhttps://doi.org/10.1613/jair.1.14649. Ryan Rogers and Thomas Steinke. A better privacy analysis of the exponential mechanism. DifferentialPrivacy.org, 07 2021. https://differentialprivacy.org/ exponential-mechanism-bounded-range/. Sini Suihkonen. Differential privacy applied to random forest classification. Master’s thesis, U...
-
[2]
URL https://doi.org/10.1371/journal.pone
doi: 10.1371/journal.pone.0301541. URL https://doi.org/10.1371/journal.pone. 0301541. Daniël V os, Jelle V os, Tianyu Li, Zekeriya Erkin, and Sicco Verwer. Differentially-private decision trees and provable robustness to data poisoning.arXiv preprint arXiv:2305.15394, 2023. 13 Arjun Wilkins, Daniel Kifer, Danfeng Zhang, and Brian Karrer. Exact privacy ana...
-
[3]
The accuracy on Adult roughly matches the plot from [Suihkonen, 2023]. We did not change the hyperparameters for the Folktables tasks except for only considering "one vs. all" splits on categorical attributes. This was required for runtime purposes. The choice of this parameter affects the performance on these datasets, but extensive hyperparameter tuning...
work page 2023
-
[4]
This data-dependent control flow in the construction of the tree violates DP
Patil and Singh [2014] deterministically stop splitting when all samples in a node belong to the same class (Section IV , subsection D, point 2). This data-dependent control flow in the construction of the tree violates DP. Additionally, they do not properly account for bootstrapping which results in an inaccurate privacy guarantee (Algorithm 1, step 3)
work page 2014
-
[5]
Fletcher and Islam [2015a] calculate the local sensitivity based on noisy values instead of using data-independent global sensitivity (Algorithm 1, line 17)
-
[6]
This is heuristically better than using the data points themselves, but it still breaks DP
Li and Li [2017] use the midpoint between adjacent data points as candidates when splitting continuous features (Section III, Subsection D, Equation 5). This is heuristically better than using the data points themselves, but it still breaks DP. They also do not properly account for bootstrapping (Step 2a of the algorithm in Section III, Subsection B)
work page 2017
-
[7]
Hou et al. [2019] use a deterministic data-dependent stopping criterion in the splitting algorithm (Algorithm 1, Line 5). Quoting directly from the paper: "The condition that the decision tree stops growing includes: 1) All samples in the node have the same classification result"
work page 2019
-
[8]
[2019] have data-dependent stopping conditions in their tree-building process
Xin et al. [2019] have data-dependent stopping conditions in their tree-building process. They do not split nodes with 10 or fewer data points (Algorithm 2, Line 1), and they deterministically stop splitting pure nodes (Algorithm 2, line 5)
work page 2019
-
[9]
Guan et al. [2020] sort values of continuous attributes and use the average of buckets with 5 data points as candidates for the splitpoint (Algorithm 1, Line 10). The candidate set differs between neighboring datasets which breaks DP. Their implementation also deterministically does not split on pure nodes (Algorithm 1, line 4)
work page 2020
-
[10]
Zhang et al. [2021] deterministically stop splitting when all samples in a node belong to the same class (Algorithms 1, 2, and 3, see termination condition)
work page 2021
-
[11]
They inspect all possible subsets when splitting categorical features
Consul and William [2021] use data-dependent sensitivity in both the splits (Equation 6) and the leaf nodes in the case of regression trees (Section 3, paragraphEstimating leaf node parameters).6 6Incidentally, we note that we could not use this algorithm as a baseline for computational reasons. They inspect all possible subsets when splitting categorical...
work page 2021
-
[12]
[2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4)
Liu et al. [2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4). In the second part of their algorithm they sample points with weights based on misclassification without accounting for it (Algorithm 3, Line 9). This leads to a higher privacy loss for outlier data points
work page 2023
-
[13]
V os et al. [2023] (who focus on the setting of a single decision tree rather than building a forest) deterministically stop splitting at pure nodes (Algorithm 1, line 7)
work page 2023
-
[14]
While we use Suihkonen [2023] as a baseline for greedy forests in our experiments, the code does contain some privacy violations including data-dependent stopping criteria for splitting (line 496 in the implementation), deterministically labeling pure leaf nodes (line 492 in the implementation) and sensitivity miscalibration (line 242 in the implementatio...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.