pith. sign in

arxiv: 2605.22756 · v1 · pith:NLKFZT33new · submitted 2026-05-21 · 💻 cs.LG · cs.DS

Lumberjack: Better Differentially Private Random Forests through Heavy Hitter Detection in Trees

Pith reviewed 2026-05-22 06:46 UTC · model grok-4.3

classification 💻 cs.LG cs.DS
keywords differentially private random forestsheavy hitter detectionhierarchical datadifferential privacydecision tree pruningprivacy-preserving machine learningtabular data
0
0 comments X

The pith

Lumberjack improves differentially private random forests by growing deep trees and pruning them with a new heavy hitter detector for hierarchical data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Lumberjack as a way to train random forests on sensitive data while adding differential privacy. It works by first building large trees and then using a new private algorithm to find and keep only the nodes that contain enough records. The key is a heavy hitter detector whose added error grows only with the square root of the log of tree height, which keeps noise manageable even for deeper trees. This leads to models that capture more patterns than earlier private random forest methods. Experiments show better accuracy at the same privacy levels on standard datasets.

Core claim

Lumberjack constructs large random decision trees and applies aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data with error O_{ε,δ}(√log h) enables significantly deeper trees than prior work, leading to improved expressiveness under privacy constraints and higher utility on benchmark datasets.

What carries the argument

A novel (ε,δ)-differentially private heavy hitter detection algorithm for hierarchical data that identifies populated nodes with error scaling as the square root of the logarithm of tree height.

If this is right

  • Deeper trees become practical under differential privacy without eroding utility.
  • The privacy-utility tradeoff improves noticeably for budgets used in real applications.
  • Random forests can be applied more readily to sensitive tabular data tasks.
  • The heavy hitter technique may extend to other tree-structured private learning problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same detector could be tested on gradient boosted trees or other ensembles to check for similar gains.
  • Trying the method on datasets with higher dimensionality or more missing values would reveal its limits.
  • Combining the pruning step with other privacy tools like secure aggregation might yield further improvements.

Load-bearing premise

The error from the heavy hitter detector grows slowly enough with tree depth that deeper trees still produce higher accuracy after the private pruning step.

What would settle it

If Lumberjack shows no accuracy improvement over earlier differentially private random forest methods when both are tested on the same benchmark tabular datasets at a fixed privacy budget such as ε=1, the central performance claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.22756 by Aur\'elien Bellet, Christian Janos Lebeda, David Erb, Tudor Cebere.

Figure 2
Figure 2. Figure 2: Example of a recursive iteration of our heavy hitter detector (Algorithm 1). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Test accuracy for the Adult and Folktables ACSIncome and ACSEmployment datasets. We [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Experimental results for two additional Folktables classification tasks. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
read the original abstract

Random forests are widely used in fields involving sensitive tabular data, but existing approaches to enforcing differential privacy (DP) typically degrade performance to the point of impracticality. In this paper, we introduce Lumberjack, a differentially private random forest algorithm that achieves substantially higher utility by constructing large random decision trees and then applying aggressive, privacy-preserving pruning to retain only sufficiently populated nodes. A key component of our approach is a novel $(\varepsilon,\delta)$-DP heavy hitter detection algorithm for hierarchical data, whose error is $O_{\varepsilon,\delta}(\sqrt{\log h})$ for trees of height $h$ and may be of independent interest. This favorable scaling enables the use of significantly deeper trees than in prior work, leading to improved expressiveness under privacy constraints. Our empirical evaluation on benchmark datasets shows that Lumberjack consistently outperforms prior DP random forest methods, establishing a new state of the art. In particular, our approach yields substantial improvements in the privacy-utility trade-off for practical privacy budgets. Our findings suggest that carefully designed DP random forests can close much of the utility gap, highlighting a promising and underexplored direction for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Lumberjack, a differentially private random forest algorithm that constructs large decision trees and applies aggressive privacy-preserving pruning via a novel (ε,δ)-DP heavy-hitter detection algorithm for hierarchical data. The key technical contribution is an error bound of O_{ε,δ}(√log h) for trees of height h, which is claimed to enable significantly deeper trees than prior work while preserving privacy. Empirical evaluation on benchmark datasets is reported to show consistent outperformance over existing DP random forest methods, yielding improved privacy-utility trade-offs for practical ε values.

Significance. If the error bound holds with practical constants and the empirical gains are robust, the work would meaningfully advance differentially private learning on tabular data by reducing the utility gap with non-private random forests. The hierarchical heavy-hitter primitive could have independent value in other tree-structured or hierarchical DP settings.

major comments (2)
  1. [§4.1] §4.1, the derivation of the O_{ε,δ}(√log h) error bound: the analysis assumes a fixed branching factor and domain size, but does not quantify how the leading constants or the per-node invocation of the detector across a full tree scale with height h; without this, it is unclear whether the bound remains small enough to support the claimed deeper trees without eroding pruning quality under fixed privacy budgets.
  2. [Table 3 and §5.2] Table 3 and §5.2: the reported accuracy improvements over baselines lack error bars, standard deviations across random seeds, or statistical significance tests; given that the central claim is 'consistent' outperformance establishing a new state of the art, the absence of these makes it impossible to assess whether the gains are reliable or could be explained by variance.
minor comments (2)
  1. [Abstract and §4] Notation for the heavy-hitter error bound uses O_{ε,δ} without defining the hidden factors; a brief remark on dependence on branching factor would improve clarity.
  2. [§4.3] The manuscript does not discuss how the pruning threshold interacts with the heavy-hitter error; a short paragraph relating these parameters would help readers reproduce the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our submission. We address each major comment below and describe the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§4.1] §4.1, the derivation of the O_{ε,δ}(√log h) error bound: the analysis assumes a fixed branching factor and domain size, but does not quantify how the leading constants or the per-node invocation of the detector across a full tree scale with height h; without this, it is unclear whether the bound remains small enough to support the claimed deeper trees without eroding pruning quality under fixed privacy budgets.

    Authors: We thank the referee for this observation. Section 4.1 derives the O_{ε,δ}(√log h) bound for a single invocation of the heavy-hitter detector, where the leading constants depend on ε, δ, the fixed branching factor, and domain size but are independent of height h. The full tree uses privacy composition (via the moments accountant) to allocate a fixed total budget across nodes; the per-node ε is scaled inversely with the number of invocations. We will add a clarifying paragraph in §4.1 that explicitly states the independence of constants from h, provides the composition scaling, and includes a short calculation showing the effective error remains practical for the deeper trees used in our experiments. revision: partial

  2. Referee: [Table 3 and §5.2] Table 3 and §5.2: the reported accuracy improvements over baselines lack error bars, standard deviations across random seeds, or statistical significance tests; given that the central claim is 'consistent' outperformance establishing a new state of the art, the absence of these makes it impossible to assess whether the gains are reliable or could be explained by variance.

    Authors: We agree that variability measures are needed to support the claim of consistent outperformance. In the revision we will rerun all experiments over 10 random seeds, report mean accuracy with standard deviations in Table 3, and add paired statistical significance tests (with p-values) in §5.2 to quantify the reliability of the improvements over baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity: novel heavy-hitter algorithm and error bound presented as independent contribution

full rationale

The paper introduces Lumberjack via a novel (ε,δ)-DP heavy hitter detection algorithm for hierarchical data whose error scales as O_{ε,δ}(√log h). This scaling is stated as a direct property of the new algorithm and is used to justify deeper trees and improved utility. No equations, derivations, or self-citations are shown that reduce the bound or the resulting privacy-utility gains to a fitted parameter, a prior result by the same authors, or an ansatz smuggled in by citation. The central claims rest on the algorithmic construction itself rather than on any self-referential renaming or load-bearing self-citation chain, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5741 in / 1127 out tokens · 49576 ms · 2026-05-22T06:46:04.474848+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    URLhttps://doi.org/10.1613/jair.1.14649

    doi: 10.1613/JAIR.1.14649. URLhttps://doi.org/10.1613/jair.1.14649. Ryan Rogers and Thomas Steinke. A better privacy analysis of the exponential mechanism. DifferentialPrivacy.org, 07 2021. https://differentialprivacy.org/ exponential-mechanism-bounded-range/. Sini Suihkonen. Differential privacy applied to random forest classification. Master’s thesis, U...

  2. [2]

    URL https://doi.org/10.1371/journal.pone

    doi: 10.1371/journal.pone.0301541. URL https://doi.org/10.1371/journal.pone. 0301541. Daniël V os, Jelle V os, Tianyu Li, Zekeriya Erkin, and Sicco Verwer. Differentially-private decision trees and provable robustness to data poisoning.arXiv preprint arXiv:2305.15394, 2023. 13 Arjun Wilkins, Daniel Kifer, Danfeng Zhang, and Brian Karrer. Exact privacy ana...

  3. [3]

    one vs. all

    The accuracy on Adult roughly matches the plot from [Suihkonen, 2023]. We did not change the hyperparameters for the Folktables tasks except for only considering "one vs. all" splits on categorical attributes. This was required for runtime purposes. The choice of this parameter affects the performance on these datasets, but extensive hyperparameter tuning...

  4. [4]

    This data-dependent control flow in the construction of the tree violates DP

    Patil and Singh [2014] deterministically stop splitting when all samples in a node belong to the same class (Section IV , subsection D, point 2). This data-dependent control flow in the construction of the tree violates DP. Additionally, they do not properly account for bootstrapping which results in an inaccurate privacy guarantee (Algorithm 1, step 3)

  5. [5]

    Fletcher and Islam [2015a] calculate the local sensitivity based on noisy values instead of using data-independent global sensitivity (Algorithm 1, line 17)

  6. [6]

    This is heuristically better than using the data points themselves, but it still breaks DP

    Li and Li [2017] use the midpoint between adjacent data points as candidates when splitting continuous features (Section III, Subsection D, Equation 5). This is heuristically better than using the data points themselves, but it still breaks DP. They also do not properly account for bootstrapping (Step 2a of the algorithm in Section III, Subsection B)

  7. [7]

    The condition that the decision tree stops growing includes: 1) All samples in the node have the same classification result

    Hou et al. [2019] use a deterministic data-dependent stopping criterion in the splitting algorithm (Algorithm 1, Line 5). Quoting directly from the paper: "The condition that the decision tree stops growing includes: 1) All samples in the node have the same classification result"

  8. [8]

    [2019] have data-dependent stopping conditions in their tree-building process

    Xin et al. [2019] have data-dependent stopping conditions in their tree-building process. They do not split nodes with 10 or fewer data points (Algorithm 2, Line 1), and they deterministically stop splitting pure nodes (Algorithm 2, line 5)

  9. [9]

    [2020] sort values of continuous attributes and use the average of buckets with 5 data points as candidates for the splitpoint (Algorithm 1, Line 10)

    Guan et al. [2020] sort values of continuous attributes and use the average of buckets with 5 data points as candidates for the splitpoint (Algorithm 1, Line 10). The candidate set differs between neighboring datasets which breaks DP. Their implementation also deterministically does not split on pure nodes (Algorithm 1, line 4)

  10. [10]

    [2021] deterministically stop splitting when all samples in a node belong to the same class (Algorithms 1, 2, and 3, see termination condition)

    Zhang et al. [2021] deterministically stop splitting when all samples in a node belong to the same class (Algorithms 1, 2, and 3, see termination condition)

  11. [11]

    They inspect all possible subsets when splitting categorical features

    Consul and William [2021] use data-dependent sensitivity in both the splits (Equation 6) and the leaf nodes in the case of regression trees (Section 3, paragraphEstimating leaf node parameters).6 6Incidentally, we note that we could not use this algorithm as a baseline for computational reasons. They inspect all possible subsets when splitting categorical...

  12. [12]

    [2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4)

    Liu et al. [2023] deterministically stop splitting on pure nodes (Algorithm 2, Line 4). In the second part of their algorithm they sample points with weights based on misclassification without accounting for it (Algorithm 3, Line 9). This leads to a higher privacy loss for outlier data points

  13. [13]

    [2023] (who focus on the setting of a single decision tree rather than building a forest) deterministically stop splitting at pure nodes (Algorithm 1, line 7)

    V os et al. [2023] (who focus on the setting of a single decision tree rather than building a forest) deterministically stop splitting at pure nodes (Algorithm 1, line 7)

  14. [14]

    lies between

    While we use Suihkonen [2023] as a baseline for greedy forests in our experiments, the code does contain some privacy violations including data-dependent stopping criteria for splitting (line 496 in the implementation), deterministically labeling pure leaf nodes (line 492 in the implementation) and sensitivity miscalibration (line 242 in the implementatio...