Least Squares Estimation For Hierarchical Data

Pavel Zhuravlev; Ryan Cumings-Menon

arxiv: 2404.13164 · v3 · submitted 2024-04-19 · 📊 stat.CO · cs.CR

Least Squares Estimation For Hierarchical Data

Ryan Cumings-Menon , Pavel Zhuravlev This is my paper

Pith reviewed 2026-05-24 02:16 UTC · model grok-4.3

classification 📊 stat.CO cs.CR

keywords least squares estimationhierarchical datageneralized least squarescensus disclosure avoidancenoisy measurementsconfidence intervalsgeographic hierarchyvariance estimation

0 comments

The pith

An algorithm leveraging geographic hierarchy computes generalized least squares estimates for high-dimensional census data without the full covariance matrix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an algorithm that uses the hierarchical structure of geographic levels in census data to compute very high dimensional least squares estimates efficiently. It proves that the output of this algorithm equals the generalized least squares estimator. The work also shows how to obtain variances of linear functions of the estimator and demonstrates confidence interval computation in a numerical experiment. An experimental data product applies the estimator to the 2020 Census noisy measurements at multiple geographic levels.

Core claim

The algorithm leverages the hierarchical structure of the input data in order to compute very high dimensional least squares estimates in a computationally efficient manner. Afterward, the paper shows that this algorithm's output is equal to the generalized least squares estimator, describes how to find the variance of linear functions of this estimator, and provides a numerical experiment in which confidence intervals of tabulations are computed based on this estimator.

What carries the argument

A recursive or block-wise algorithm that exploits the hierarchy of nation, states, counties, tracts, and blocks to compute the least squares solution without the full dense covariance matrix.

If this is right

The generalized least squares estimator becomes computable for very high dimensions using only the hierarchical structure.
Variances of arbitrary linear functions of the estimator can be obtained directly from the algorithm.
Confidence intervals for population tabulations can be derived from the noisy measurements.
An experimental data product supplies the necessary inputs for all tabulations in the 2020 Redistricting Data File at U.S., state, county, and tract levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structure-exploiting approach may apply to any estimation problem whose covariance exhibits a nested hierarchy.
Data users gain the ability to quantify uncertainty in census tabulations using only the publicly released noisy measurements.
The method could support repeated estimation as new noisy measurements become available without recomputing from scratch.

Load-bearing premise

The hierarchical geographic levels permit an efficient recursive or block-wise computation of the least-squares solution without requiring the full dense covariance matrix.

What would settle it

A side-by-side computation on a small hierarchical dataset where the algorithm output differs from the generalized least squares estimator obtained by direct matrix methods.

Figures

Figures reproduced from arXiv: 2404.13164 by Pavel Zhuravlev, Ryan Cumings-Menon.

**Figure 1.** Figure 1: This is a graphical depiction of the 2020 tabulation US spine, as defined by the Geography Division of the Census Bureau. The right column provides the geolevel names and indices. The left side of the figure provides an example of a path from the US geounit to a block geounit. The census geographic codes for the geounits on this path are provided in parentheses. counts, i.e., a flattened fully saturated co… view at source ↗

**Figure 2.** Figure 2: In this section we derive Cov(β˜(u), β˜(v)) for the three types of adjacency relationships between u, v ∈ G represented in the subplots above: u and v are siblings (A); v is an ancestor vertex of u (B); and cases in which neither u or v are a direct descendant of the other (C). Dotted edges are used to denote the portions of ω(u, v) that include an arbitrary number of vertices. Cov(Bu, Bv) =A(u)   X c∈C… view at source ↗

**Figure 3.** Figure 3: Q-Q Plot: The points in this plot can be viewed as points on a parametric curve, parameterized by the quantile q ∈ (0, 1). The horizontal axis provides the q quantile of the standard normal distribution and the vertical axis provides the sample quantile of the Z−scores of the total population estimates for tabulation block groups. The line provides the identity function, f(x) = x. 6. Comparison with Hay et… view at source ↗

**Figure 4.** Figure 4: The sub-graph Gg considered in Example 2. Example 2. Consider the sub-graph Gg for g ∈ Level(G, L − 2) shown in [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗

**Figure 5.** Figure 5: The rooted tree considered in Example 3. g to the final rows of FG, FG can be defined according to (28) such that F1 =               S(r) S(r) S(r) S(u1) 0 0 0 0 0 0 S(u2) S(u2) S(u4) 0 0 0 S(u5) S(u5) S(u8) 0 0 0 S(u9) 0 0 0 S(u10)               , F2 =               S(r) S(r) S(u1) S(u1) S(g) S(g) 0 0 0 0 0 0 0 0 0 0 0 0               , and Fγ = S(… view at source ↗

read the original abstract

The U.S. Census Bureau's 2020 Disclosure Avoidance System (DAS) bases its output on noisy measurements, which are population tabulations added to realizations of mean-zero random variables. These noisy measurements are observed for a set of hierarchical geographic levels, e.g., the U.S. as a whole, states, counties, census tracts, and census blocks. The Census Bureau released the noisy measurements generated in the DAS executions for the two primary 2020 Census data products, in part to allow data users to assess uncertainty in 2020 Census tabulations introduced by disclosure avoidance. This paper describes an algorithm that can leverage the hierarchical structure of the input data in order to compute very high dimensional least squares estimates in a computationally efficient manner. Afterward, we show that this algorithm's output is equal to the generalized least squares estimator, describe how to find the variance of linear functions of this estimator, and provide a numerical experiment in which we compute confidence intervals of tabulations based on this estimator. We also describe an accompanying Census Bureau experimental data product that applies this estimator to the publicly available noisy measurements to provide data users with the inputs required to derive confidence intervals for all tabulations that were included in the 2020 Redistricting Data File, for the U.S., state, county, and census tract geographic levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable algorithm for GLS on the 2020 DAS hierarchical noisy measurements plus a released data product for CIs on tabulations.

read the letter

This paper supplies an algorithm that computes the generalized least squares estimator for the 2020 Census noisy measurements by using the geographic hierarchy, and it comes with a released data product for confidence intervals on redistricting tabulations at nation, state, county, and tract levels. The core contribution is the concrete implementation that avoids building the full dense covariance matrix while still delivering the GLS solution, along with variance formulas for linear functions of the estimator and a numerical experiment on confidence intervals. The accompanying experimental data product applies the method to the public noisy measurements and supplies the inputs users need for their own tabulations. That combination of method plus released product is the part that stands out as new relative to prior work on multilevel least squares. The paper does well on the practical side by making high-dimensional estimation feasible for census-scale data and by showing the algorithm output equals the GLS estimator after the description. The numerical check provides at least initial evidence that the variance formulas produce usable intervals. The soft spots are limited. The abstract leaves the exact recursive steps and covariance assumptions implicit, so the full manuscript needs to be checked for whether the hierarchy really permits block-wise computation without missing cross-level terms. The experiment is only one case, which is fine for illustration but not a broad verification. Nothing in the description points to circularity or an internal contradiction in the normal equations. This work is aimed at statisticians and data users who need uncertainty quantification on 2020 Census tabulations for redistricting or allocation purposes. A reader already working in statistical disclosure avoidance or official statistics will get direct value from both the algorithm and the data product. It deserves a serious referee because the claims are externally verifiable through the released product and the equivalence to a standard estimator. I would send it to peer review.

Referee Report

0 major / 3 minor

Summary. The paper presents an algorithm that exploits the hierarchical structure of noisy measurements (nation, states, counties, tracts, blocks) from the 2020 Census DAS to compute high-dimensional least-squares estimates efficiently. It asserts that the algorithm output equals the generalized least squares (GLS) estimator, supplies formulas for the variance of linear functions of the estimator, reports a numerical experiment producing confidence intervals, and describes an accompanying experimental data product for the Redistricting Data File at U.S., state, county, and tract levels.

Significance. If the claimed equivalence to GLS holds and the variance formulas are correctly derived, the work supplies a practical, scalable route to uncertainty quantification for census tabulations that avoids explicit formation or inversion of the full dense covariance matrix. The release of an experimental data product that supplies the necessary inputs for users to form confidence intervals constitutes a direct, usable contribution to the statistical infrastructure around the 2020 Census releases.

minor comments (3)

[Abstract] The abstract states that equivalence to GLS is shown 'afterward,' but the manuscript would benefit from an explicit pointer (e.g., 'see §4, Theorem 1') immediately after the algorithm description so readers can locate the proof without searching.
[Introduction / §2] Notation for the hierarchical levels and the associated design matrices is introduced gradually; a single consolidated table or diagram early in the paper that lists the levels, their dimensions, and the corresponding blocks of the covariance structure would improve readability.
[Numerical experiment] The numerical experiment section reports confidence-interval coverage but does not state the number of Monte Carlo replications or the random seed; adding these details would make the experiment fully reproducible from the description alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its potential contribution to uncertainty quantification for 2020 Census data, and recommendation of minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; GLS equivalence is externally defined

full rationale

The paper presents a hierarchical algorithm for least-squares estimation on noisy Census measurements, then derives that its output equals the generalized least squares estimator and provides variance formulas. This equivalence is shown after the algorithm is defined and is to an externally standard statistical target (GLS), not to any fitted parameter or self-referential quantity within the paper. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or reader's assessment; the derivation chain is self-contained against the standard GLS definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mean-zero noise model for the DAS measurements and the existence of a hierarchical nesting that permits efficient computation; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Noisy measurements are population tabulations added to realizations of mean-zero random variables.
Stated in the first sentence of the abstract as the basis for the DAS output.

pith-pipeline@v0.9.0 · 5758 in / 1276 out tokens · 26573 ms · 2026-05-24T02:16:42.577193+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The 2020 US Decennial Census is more private than you (might) think
cs.CR 2024-10 unverdicted novelty 6.0

Using f-differential privacy to track losses across eight geographic levels, the 2020 Census provides stronger privacy than its nominal guarantees, enabling 15.08-24.82% noise variance reduction.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper

[1]

Abowd, J. M., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., and Zhuravlev, P. (2022). The 2020 Census Disclosure Avoidance System TopDown Algorithm . Harvard Data Science Review , (Special Issue 2). https://hdsr.mitpress.mit.edu/pub/7evz361i

work page 2022
[2]

Aitken, A. C. (1935). On least squares and linear combination of observations. Proceedings of Royal Statistical Society , 55:42--48

work page 1935
[3]

B., Pritts, M., Zhuravlev, P., and Keller, S

Ashmead, R., Hawes, M. B., Pritts, M., Zhuravlev, P., and Keller, S. A. (2024). An approximate M onte C arlo simulation method for estimating uncertainty and constructing confidence intervals for 2020 C ensus statistics. http://arxiv.org/abs/2503.19714

work page arXiv 2024
[4]

and Steinke, T

Bun, M. and Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference , pages 635--658. Springer

work page 2016
[5]

L., Kamath, G., and Steinke, T

Canonne, C. L., Kamath, G., and Steinke, T. (2020). The discrete G aussian for differential privacy. Advances in Neural Information Processing Systems , 33:15676--15688

work page 2020
[6]

Cumings-Menon, R., Ashmead, R., Kifer, D., Leclerc, P., Ocker, J., Ratcliffe, M., Zhuravlev, P., and Abowd, J. (2024). Geographic spines in the 2020 C ensus disclosure avoidance system. Journal of Privacy and Confidentiality , 14(3)

work page 2024
[7]

Cumings-Menon, R., Ashmead, R., Kifer, D., Leclerc, P., Spence, M., Zhuravlev, P., and Abowd, J. M. (2023). Disclosure avoidance for the 2020 Census Demographic and Housing Characteristics File . arXiv preprint arXiv:2312.10863

work page arXiv 2023
[8]

Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference , pages 265--284. Springer

work page 2006
[9]

Greene, W. H. (2003). Econometric analysis . Prentice Hall

work page 2003
[10]

Hay, M., Rastogi, V., Miklau, G., and Suciu, D. (2010). Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment , 3(1)

work page 2010
[11]

Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of a sum of matrices. SIAM review , 23(1):53--60

work page 1981
[12]

Honaker, J. (2015). Efficient use of differentially private binary trees. Theory and Practice of Differential Privacy (TPDP 2015), London, UK , 2:26--27

work page 2015
[13]

Li, C., Hay, M., Rastogi, V., Miklau, G., and McGregor, A. (2010). Optimizing linear counting queries under differential privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 123--134

work page 2010
[14]

Census Bureau (2023a)

U.S. Census Bureau (2023a). Decennial Census P.L. 94-171 Redistricting Data

work page
[15]

Census Bureau (2023b)

U.S. Census Bureau (2023b). Developing the DAS: Demonstration Data and Progress Metrics

work page
[16]

Willsky, A. S. (2002). Multiresolution markov models for signal and image processing. Proceedings of the IEEE , 90(8):1396--1458

work page 2002
[17]

Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., and Winslett, M. (2013). Differentially private histogram publication. The VLDB journal , 22:797--822

work page 2013

[1] [1]

Abowd, J. M., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., and Zhuravlev, P. (2022). The 2020 Census Disclosure Avoidance System TopDown Algorithm . Harvard Data Science Review , (Special Issue 2). https://hdsr.mitpress.mit.edu/pub/7evz361i

work page 2022

[2] [2]

Aitken, A. C. (1935). On least squares and linear combination of observations. Proceedings of Royal Statistical Society , 55:42--48

work page 1935

[3] [3]

B., Pritts, M., Zhuravlev, P., and Keller, S

Ashmead, R., Hawes, M. B., Pritts, M., Zhuravlev, P., and Keller, S. A. (2024). An approximate M onte C arlo simulation method for estimating uncertainty and constructing confidence intervals for 2020 C ensus statistics. http://arxiv.org/abs/2503.19714

work page arXiv 2024

[4] [4]

and Steinke, T

Bun, M. and Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference , pages 635--658. Springer

work page 2016

[5] [5]

L., Kamath, G., and Steinke, T

Canonne, C. L., Kamath, G., and Steinke, T. (2020). The discrete G aussian for differential privacy. Advances in Neural Information Processing Systems , 33:15676--15688

work page 2020

[6] [6]

Cumings-Menon, R., Ashmead, R., Kifer, D., Leclerc, P., Ocker, J., Ratcliffe, M., Zhuravlev, P., and Abowd, J. (2024). Geographic spines in the 2020 C ensus disclosure avoidance system. Journal of Privacy and Confidentiality , 14(3)

work page 2024

[7] [7]

Cumings-Menon, R., Ashmead, R., Kifer, D., Leclerc, P., Spence, M., Zhuravlev, P., and Abowd, J. M. (2023). Disclosure avoidance for the 2020 Census Demographic and Housing Characteristics File . arXiv preprint arXiv:2312.10863

work page arXiv 2023

[8] [8]

Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference , pages 265--284. Springer

work page 2006

[9] [9]

Greene, W. H. (2003). Econometric analysis . Prentice Hall

work page 2003

[10] [10]

Hay, M., Rastogi, V., Miklau, G., and Suciu, D. (2010). Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment , 3(1)

work page 2010

[11] [11]

Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of a sum of matrices. SIAM review , 23(1):53--60

work page 1981

[12] [12]

Honaker, J. (2015). Efficient use of differentially private binary trees. Theory and Practice of Differential Privacy (TPDP 2015), London, UK , 2:26--27

work page 2015

[13] [13]

Li, C., Hay, M., Rastogi, V., Miklau, G., and McGregor, A. (2010). Optimizing linear counting queries under differential privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 123--134

work page 2010

[14] [14]

Census Bureau (2023a)

U.S. Census Bureau (2023a). Decennial Census P.L. 94-171 Redistricting Data

work page

[15] [15]

Census Bureau (2023b)

U.S. Census Bureau (2023b). Developing the DAS: Demonstration Data and Progress Metrics

work page

[16] [16]

Willsky, A. S. (2002). Multiresolution markov models for signal and image processing. Proceedings of the IEEE , 90(8):1396--1458

work page 2002

[17] [17]

Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., and Winslett, M. (2013). Differentially private histogram publication. The VLDB journal , 22:797--822

work page 2013