pith. sign in

arxiv: 2404.13164 · v3 · submitted 2024-04-19 · 📊 stat.CO · cs.CR

Least Squares Estimation For Hierarchical Data

Pith reviewed 2026-05-24 02:16 UTC · model grok-4.3

classification 📊 stat.CO cs.CR
keywords least squares estimationhierarchical datageneralized least squarescensus disclosure avoidancenoisy measurementsconfidence intervalsgeographic hierarchyvariance estimation
0
0 comments X

The pith

An algorithm leveraging geographic hierarchy computes generalized least squares estimates for high-dimensional census data without the full covariance matrix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an algorithm that uses the hierarchical structure of geographic levels in census data to compute very high dimensional least squares estimates efficiently. It proves that the output of this algorithm equals the generalized least squares estimator. The work also shows how to obtain variances of linear functions of the estimator and demonstrates confidence interval computation in a numerical experiment. An experimental data product applies the estimator to the 2020 Census noisy measurements at multiple geographic levels.

Core claim

The algorithm leverages the hierarchical structure of the input data in order to compute very high dimensional least squares estimates in a computationally efficient manner. Afterward, the paper shows that this algorithm's output is equal to the generalized least squares estimator, describes how to find the variance of linear functions of this estimator, and provides a numerical experiment in which confidence intervals of tabulations are computed based on this estimator.

What carries the argument

A recursive or block-wise algorithm that exploits the hierarchy of nation, states, counties, tracts, and blocks to compute the least squares solution without the full dense covariance matrix.

If this is right

  • The generalized least squares estimator becomes computable for very high dimensions using only the hierarchical structure.
  • Variances of arbitrary linear functions of the estimator can be obtained directly from the algorithm.
  • Confidence intervals for population tabulations can be derived from the noisy measurements.
  • An experimental data product supplies the necessary inputs for all tabulations in the 2020 Redistricting Data File at U.S., state, county, and tract levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structure-exploiting approach may apply to any estimation problem whose covariance exhibits a nested hierarchy.
  • Data users gain the ability to quantify uncertainty in census tabulations using only the publicly released noisy measurements.
  • The method could support repeated estimation as new noisy measurements become available without recomputing from scratch.

Load-bearing premise

The hierarchical geographic levels permit an efficient recursive or block-wise computation of the least-squares solution without requiring the full dense covariance matrix.

What would settle it

A side-by-side computation on a small hierarchical dataset where the algorithm output differs from the generalized least squares estimator obtained by direct matrix methods.

Figures

Figures reproduced from arXiv: 2404.13164 by Pavel Zhuravlev, Ryan Cumings-Menon.

Figure 1
Figure 1. Figure 1: This is a graphical depiction of the 2020 tabulation US spine, as defined by the Geography Division of the Census Bureau. The right column provides the geolevel names and indices. The left side of the figure provides an example of a path from the US geounit to a block geounit. The census geographic codes for the geounits on this path are provided in parentheses. counts, i.e., a flattened fully saturated co… view at source ↗
Figure 2
Figure 2. Figure 2: In this section we derive Cov(β˜(u), β˜(v)) for the three types of adjacency rela￾tionships between u, v ∈ G represented in the subplots above: u and v are siblings (A); v is an ancestor vertex of u (B); and cases in which neither u or v are a direct descendant of the other (C). Dotted edges are used to denote the portions of ω(u, v) that include an arbitrary number of vertices. Cov(Bu, Bv) =A(u)   X c∈C… view at source ↗
Figure 3
Figure 3. Figure 3: Q-Q Plot: The points in this plot can be viewed as points on a parametric curve, parameterized by the quantile q ∈ (0, 1). The horizontal axis provides the q quantile of the standard normal distribution and the vertical axis provides the sample quantile of the Z−scores of the total population estimates for tabulation block groups. The line provides the identity function, f(x) = x. 6. Comparison with Hay et… view at source ↗
Figure 4
Figure 4. Figure 4: The sub-graph Gg considered in Example 2. Example 2. Consider the sub-graph Gg for g ∈ Level(G, L − 2) shown in [PITH_FULL_IMAGE:figures/full_fig_p023_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The rooted tree considered in Example 3. g to the final rows of FG, FG can be defined according to (28) such that F1 =               S(r) S(r) S(r) S(u1) 0 0 0 0 0 0 S(u2) S(u2) S(u4) 0 0 0 S(u5) S(u5) S(u8) 0 0 0 S(u9) 0 0 0 S(u10)               , F2 =               S(r) S(r) S(u1) S(u1) S(g) S(g) 0 0 0 0 0 0 0 0 0 0 0 0               , and Fγ =  S(… view at source ↗
read the original abstract

The U.S. Census Bureau's 2020 Disclosure Avoidance System (DAS) bases its output on noisy measurements, which are population tabulations added to realizations of mean-zero random variables. These noisy measurements are observed for a set of hierarchical geographic levels, e.g., the U.S. as a whole, states, counties, census tracts, and census blocks. The Census Bureau released the noisy measurements generated in the DAS executions for the two primary 2020 Census data products, in part to allow data users to assess uncertainty in 2020 Census tabulations introduced by disclosure avoidance. This paper describes an algorithm that can leverage the hierarchical structure of the input data in order to compute very high dimensional least squares estimates in a computationally efficient manner. Afterward, we show that this algorithm's output is equal to the generalized least squares estimator, describe how to find the variance of linear functions of this estimator, and provide a numerical experiment in which we compute confidence intervals of tabulations based on this estimator. We also describe an accompanying Census Bureau experimental data product that applies this estimator to the publicly available noisy measurements to provide data users with the inputs required to derive confidence intervals for all tabulations that were included in the 2020 Redistricting Data File, for the U.S., state, county, and census tract geographic levels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper presents an algorithm that exploits the hierarchical structure of noisy measurements (nation, states, counties, tracts, blocks) from the 2020 Census DAS to compute high-dimensional least-squares estimates efficiently. It asserts that the algorithm output equals the generalized least squares (GLS) estimator, supplies formulas for the variance of linear functions of the estimator, reports a numerical experiment producing confidence intervals, and describes an accompanying experimental data product for the Redistricting Data File at U.S., state, county, and tract levels.

Significance. If the claimed equivalence to GLS holds and the variance formulas are correctly derived, the work supplies a practical, scalable route to uncertainty quantification for census tabulations that avoids explicit formation or inversion of the full dense covariance matrix. The release of an experimental data product that supplies the necessary inputs for users to form confidence intervals constitutes a direct, usable contribution to the statistical infrastructure around the 2020 Census releases.

minor comments (3)
  1. [Abstract] The abstract states that equivalence to GLS is shown 'afterward,' but the manuscript would benefit from an explicit pointer (e.g., 'see §4, Theorem 1') immediately after the algorithm description so readers can locate the proof without searching.
  2. [Introduction / §2] Notation for the hierarchical levels and the associated design matrices is introduced gradually; a single consolidated table or diagram early in the paper that lists the levels, their dimensions, and the corresponding blocks of the covariance structure would improve readability.
  3. [Numerical experiment] The numerical experiment section reports confidence-interval coverage but does not state the number of Monte Carlo replications or the random seed; adding these details would make the experiment fully reproducible from the description alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the manuscript, recognition of its potential contribution to uncertainty quantification for 2020 Census data, and recommendation of minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity; GLS equivalence is externally defined

full rationale

The paper presents a hierarchical algorithm for least-squares estimation on noisy Census measurements, then derives that its output equals the generalized least squares estimator and provides variance formulas. This equivalence is shown after the algorithm is defined and is to an externally standard statistical target (GLS), not to any fitted parameter or self-referential quantity within the paper. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or reader's assessment; the derivation chain is self-contained against the standard GLS definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mean-zero noise model for the DAS measurements and the existence of a hierarchical nesting that permits efficient computation; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Noisy measurements are population tabulations added to realizations of mean-zero random variables.
    Stated in the first sentence of the abstract as the basis for the DAS output.

pith-pipeline@v0.9.0 · 5758 in / 1276 out tokens · 26573 ms · 2026-05-24T02:16:42.577193+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The 2020 US Decennial Census is more private than you (might) think

    cs.CR 2024-10 unverdicted novelty 6.0

    Using f-differential privacy to track losses across eight geographic levels, the 2020 Census provides stronger privacy than its nominal guarantees, enabling 15.08-24.82% noise variance reduction.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper

  1. [1]

    Abowd, J. M., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., and Zhuravlev, P. (2022). The 2020 Census Disclosure Avoidance System TopDown Algorithm . Harvard Data Science Review , (Special Issue 2). https://hdsr.mitpress.mit.edu/pub/7evz361i

  2. [2]

    Aitken, A. C. (1935). On least squares and linear combination of observations. Proceedings of Royal Statistical Society , 55:42--48

  3. [3]

    B., Pritts, M., Zhuravlev, P., and Keller, S

    Ashmead, R., Hawes, M. B., Pritts, M., Zhuravlev, P., and Keller, S. A. (2024). An approximate M onte C arlo simulation method for estimating uncertainty and constructing confidence intervals for 2020 C ensus statistics. http://arxiv.org/abs/2503.19714

  4. [4]

    and Steinke, T

    Bun, M. and Steinke, T. (2016). Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference , pages 635--658. Springer

  5. [5]

    L., Kamath, G., and Steinke, T

    Canonne, C. L., Kamath, G., and Steinke, T. (2020). The discrete G aussian for differential privacy. Advances in Neural Information Processing Systems , 33:15676--15688

  6. [6]

    Cumings-Menon, R., Ashmead, R., Kifer, D., Leclerc, P., Ocker, J., Ratcliffe, M., Zhuravlev, P., and Abowd, J. (2024). Geographic spines in the 2020 C ensus disclosure avoidance system. Journal of Privacy and Confidentiality , 14(3)

  7. [7]

    Cumings-Menon, R., Ashmead, R., Kifer, D., Leclerc, P., Spence, M., Zhuravlev, P., and Abowd, J. M. (2023). Disclosure avoidance for the 2020 Census Demographic and Housing Characteristics File . arXiv preprint arXiv:2312.10863

  8. [8]

    Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference , pages 265--284. Springer

  9. [9]

    Greene, W. H. (2003). Econometric analysis . Prentice Hall

  10. [10]

    Hay, M., Rastogi, V., Miklau, G., and Suciu, D. (2010). Boosting the accuracy of differentially private histograms through consistency. Proceedings of the VLDB Endowment , 3(1)

  11. [11]

    Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of a sum of matrices. SIAM review , 23(1):53--60

  12. [12]

    Honaker, J. (2015). Efficient use of differentially private binary trees. Theory and Practice of Differential Privacy (TPDP 2015), London, UK , 2:26--27

  13. [13]

    Li, C., Hay, M., Rastogi, V., Miklau, G., and McGregor, A. (2010). Optimizing linear counting queries under differential privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems , pages 123--134

  14. [14]

    Census Bureau (2023a)

    U.S. Census Bureau (2023a). Decennial Census P.L. 94-171 Redistricting Data

  15. [15]

    Census Bureau (2023b)

    U.S. Census Bureau (2023b). Developing the DAS: Demonstration Data and Progress Metrics

  16. [16]

    Willsky, A. S. (2002). Multiresolution markov models for signal and image processing. Proceedings of the IEEE , 90(8):1396--1458

  17. [17]

    Xu, J., Zhang, Z., Xiao, X., Yang, Y., Yu, G., and Winslett, M. (2013). Differentially private histogram publication. The VLDB journal , 22:797--822