pith. sign in

arxiv: 2410.18915 · v4 · pith:S2RDGBXPnew · submitted 2024-10-24 · 💻 cs.DS · cs.LG

Testing Support Size More Efficiently Than Learning Histograms

Pith reviewed 2026-05-23 19:16 UTC · model grok-4.3

classification 💻 cs.DS cs.LG
keywords support size testingdistribution testingsample complexityChebyshev polynomialshistogram learningproperty testingtotal variation distance
0
0 comments X

The pith

Support size testing can be done with O(n/(ε log n) log(1/ε)) samples by analyzing Chebyshev approximations outside their standard range.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that testing whether an unknown distribution has support size at most n or is ε-far from any distribution with that support size requires fewer samples than learning the full histogram of the distribution. It achieves a sample complexity of O(n/(ε log n) log(1/ε)), which improves on the Θ(n/(ε² log n)) bound from histogram learning and comes close to the Ω(n/(ε log n)) lower bound. The same approach also produces stronger lower bounds on the support size given a fixed number of samples. The improvement rests on extending the analysis of Chebyshev polynomial approximations beyond the interval where they are known to work well. A reader would care because the result separates the sample needs of testing this property from those of learning the entire distribution.

Core claim

Testing whether p is supported on at most n elements or ε-far from any such distribution in total variation distance can be performed with O(n/(ε log n) log(1/ε)) samples from p; the algorithm also yields better lower bounds on support size than prior methods, and the proof proceeds by analyzing Chebyshev polynomial approximations outside the range where they were originally designed to approximate well.

What carries the argument

Chebyshev polynomial approximations analyzed outside their designed range to bound the tester's sample complexity.

If this is right

  • Support size testing requires only O(n/(ε log n) log(1/ε)) samples rather than the Θ(n/(ε² log n)) needed for histogram learning.
  • The achieved upper bound nearly matches the known Ω(n/(ε log n)) lower bound.
  • The same method produces larger lower bounds on support size than what follows from previous histogram-based approaches.
  • The Chebyshev polynomial technique supplies a self-contained proof for the improved tester.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The efficiency gain may indicate that testing other simple distribution properties can also avoid the full cost of learning.
  • Extensions of the Chebyshev analysis could tighten bounds for related testing tasks such as uniformity or monotonicity.
  • The separation between testing and learning sample complexities might be quantified more sharply for support-size-like properties.

Load-bearing premise

The analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations is valid and directly yields the improved sample complexity for support size testing.

What would settle it

A concrete distribution for which any tester using o(n/(ε log n) log(1/ε)) samples fails to distinguish support size at most n from being ε-far would falsify the claimed upper bound.

read the original abstract

Consider two problems about an unknown probability distribution $p$: 1. How many samples from $p$ are required to test if $p$ is supported on $n$ elements or not? Specifically, given samples from $p$, determine whether it is supported on at most $n$ elements, or it is "$\epsilon$-far" (in total variation distance) from being supported on $n$ elements. 2. Given $m$ samples from $p$, what is the largest lower bound on its support size that we can produce? The best known upper bound for problem (1) uses a general algorithm for learning the histogram of the distribution $p$, which requires $\Theta(\tfrac{n}{\epsilon^2 \log n})$ samples. We show that testing can be done more efficiently than learning the histogram, using only $O(\tfrac{n}{\epsilon \log n} \log(1/\epsilon))$ samples, nearly matching the best known lower bound of $\Omega(\tfrac{n}{\epsilon \log n})$. This algorithm also provides a better solution to problem (2), producing larger lower bounds on support size than what follows from previous work. The proof relies on an analysis of Chebyshev polynomial approximations outside the range where they are designed to be good approximations, and the paper is intended as an accessible self-contained exposition of the Chebyshev polynomial method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that support-size testing (distinguishing whether an unknown distribution p has support size ≤n or is ε-far in TV distance from any distribution with support ≤n) can be performed with O(n/(ε log n) log(1/ε)) samples. This improves on the Θ(n/(ε² log n)) sample bound obtained by reducing to histogram learning and nearly matches the Ω(n/(ε log n)) lower bound. The same technique yields improved lower bounds on support size. The proof is presented as a self-contained analysis of Chebyshev polynomial approximations extended outside their standard approximation interval.

Significance. If the central analysis holds, the result is significant for distribution property testing: it separates testing from learning for a natural property and demonstrates that Chebyshev-based polynomial methods can yield near-optimal testers without requiring full histogram recovery. The self-contained exposition of the polynomial technique is a positive contribution to the literature.

major comments (1)
  1. [§4] §4 (Chebyshev analysis outside the design interval): The central sample-complexity improvement from Θ(1/ε²) to O(1/ε) rests on the claim that the error incurred by the Chebyshev approximation when evaluated outside its nominal range remains small enough not to re-introduce an extra 1/ε factor. The manuscript states that the analysis is self-contained, but the provided derivation does not explicitly bound the out-of-range contribution (e.g., via a concrete lemma relating the minimax error on an enlarged interval to the final tester threshold). Without this bound, it is unclear whether the stated O(n/(ε log n) log(1/ε)) bound is achieved or whether hidden factors cancel the improvement.
minor comments (1)
  1. [§1] The abstract and introduction use “n” both for the support-size parameter and (implicitly) for the universe size; a single clarifying sentence in §1 would remove ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting the significance of separating testing from learning in this setting. We address the sole major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (Chebyshev analysis outside the design interval): The central sample-complexity improvement from Θ(1/ε²) to O(1/ε) rests on the claim that the error incurred by the Chebyshev approximation when evaluated outside its nominal range remains small enough not to re-introduce an extra 1/ε factor. The manuscript states that the analysis is self-contained, but the provided derivation does not explicitly bound the out-of-range contribution (e.g., via a concrete lemma relating the minimax error on an enlarged interval to the final tester threshold). Without this bound, it is unclear whether the stated O(n/(ε log n) log(1/ε)) bound is achieved or whether hidden factors cancel the improvement.

    Authors: We agree that an explicit bound on the out-of-range error would make the argument clearer and easier to verify. In the revised version we will add a short lemma (new Lemma 4.3) that directly relates the minimax error of the degree-d Chebyshev approximant on the enlarged interval [0,1+δ] (with δ = Θ(ε log n / n) as used by the tester) to the acceptance threshold. The lemma uses the standard bound |T_k(x)| ≤ (x + sqrt(x²-1))^k for x>1 together with the concrete choice of d and the approximation interval in our construction to show that the extraneous error is at most O(ε / log(1/ε)). This is absorbed into the existing O(log(1/ε)) factor without restoring a 1/ε² dependence, confirming that the claimed O(n/(ε log n) log(1/ε)) sample bound holds. We thank the referee for prompting this clarification. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained polynomial analysis

full rationale

The paper derives an improved sample complexity for support size testing via analysis of Chebyshev polynomial approximations, presented as a self-contained exposition. No load-bearing steps reduce to self-citations, fitted inputs renamed as predictions, or self-definitional equivalences. The central upper bound O(n/(ε log n) log(1/ε)) is obtained from explicit approximation error bounds rather than by construction from the input histogram learner or prior author results. This matches the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard tools from approximation theory and probability; no free parameters, new entities, or ad-hoc axioms are indicated in the abstract.

axioms (1)
  • standard math Standard properties of Chebyshev polynomials and total variation distance in distribution testing
    The proof relies on extending Chebyshev approximations beyond their designed range.

pith-pipeline@v0.9.0 · 5777 in / 1273 out tokens · 36206 ms · 2026-05-23T19:16:21.376259+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    A unified maximum likelihood approach for estimating symmetric properties of discrete distributions

    Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A unified maximum likelihood approach for estimating symmetric properties of discrete distributions. In , volume 70, pages 11--21, 2017

  2. [2]

    Refining the adaptivity notion in the huge object model

    Tomer Adar and Eldar Fischer. Refining the adaptivity notion in the huge object model. In , 2024

  3. [3]

    Support testing in the huge object model

    Tomer Adar, Eldar Fischer, and Amit Levi. Support testing in the huge object model. In , 2024

  4. [4]

    Distribution testing lower bounds via reductions from communication complexity

    Eric Blais, Clément L Canonne, and Tom Gur. Distribution testing lower bounds via reductions from communication complexity. , 11(2):1--37, 2019

  5. [5]

    Learnability and the vapnik-chervonenkis dimension

    Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM) , 36(4):929--965, 1989

  6. [6]

    Estimating the number of species: a review

    John Bunge and Michael Fitzpatrick. Estimating the number of species: a review. Journal of the American statistical Association , 88(421):364--373, 1993

  7. [7]

    VC dimension and distribution-free sample-based testing

    Eric Blais, Renato Ferreira Pinto Jr , and Nathaniel Harms. VC dimension and distribution-free sample-based testing. In , 2021

  8. [8]

    A survey on distribution testing: Your data is big

    Cl \'e ment Canonne. A survey on distribution testing: Your data is big. but is it blue? , 2020

  9. [9]

    Testing for families of distributions via the fourier transform

    Cl \'e ment L Canonne, Ilias Diakonikolas, and Alistair Stewart. Testing for families of distributions via the fourier transform. In , 2018

  10. [10]

    The relation between the number of species and the number of individuals in a random sample of an animal population

    Ronald A Fisher, A Steven Corbet, and Carrington B Williams. The relation between the number of species and the number of individuals in a random sample of an animal population. The Journal of Animal Ecology , pages 42--58, 1943

  11. [11]

    Distribution testing under the parity trace

    Renato Ferreira Pinto Jr. and Nathaniel Harms. Distribution testing under the parity trace. arXiv preprint arXiv:2304.01374 , 2023

  12. [12]

    Property testing and its connection to learning and approximation

    Oded Goldreich, Shafi Goldwasser, and Dana Ron. Property testing and its connection to learning and approximation. , 45(4):653--750, 1998

  13. [13]

    Introduction to property testing

    Oded Goldreich. Introduction to property testing . Cambridge University Press, 2017

  14. [14]

    On the complexity of estimating the effective support size

    Oded Goldreich. On the complexity of estimating the effective support size. ECCC, TR19-088 , 2019

  15. [15]

    Testing Bipartitness in an Augmented VDF Bounded-Degree Graph Model

    Oded Goldreich. Testing bipartitness in an augmented vdf bounded-degree graph model. arXiv preprint arXiv:1905.03070 , 2019

  16. [16]

    On the estimation of the number of classes in a population

    Leo A Goodman. On the estimation of the number of classes in a population. The Annals of Mathematical Statistics , 20(4):572--579, 1949

  17. [17]

    I.J. Good. The population frequencies of species and the estimation of population parameters. Biometrika , pages 237--264, 1953

  18. [18]

    On sample-based testers

    Oded Goldreich and Dana Ron. On sample-based testers. , 8(2):1--54, 2016

  19. [19]

    On the relation between the relative earth mover distance and the variation distance (an exposition)

    Oded Goldreich and Dana Ron. On the relation between the relative earth mover distance and the variation distance (an exposition). In Oded Goldreich, editor, Computational Complexity and Property Testing: On the Interplay Between Randomness and Computation , pages 141--151. Springer Cham, 2020

  20. [20]

    Testing distributions of huge objects

    Oded Goldreich and Dana Ron. Testing distributions of huge objects. TheoretiCS , 2, 2023

  21. [21]

    The optimal sample complexity of PAC learning

    Steve Hanneke. The optimal sample complexity of PAC learning. , 17(38):1--15, 2016

  22. [22]

    CSTheory stackexchange answer

    Steve Hanneke. CSTheory stackexchange answer. https://cstheory.stackexchange.com/questions/40161/proper-pac-learning-vc-dimension-bounds, 2019. Accessed 2024-09-10

  23. [23]

    Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance

    Yanjun Han, Jiantao Jiao, and Tsachy Weissman. Local moment matching: A unified methodology for symmetric functional estimation and distribution estimation under Wasserstein distance. In , pages 3189--3221. PMLR, 2018

  24. [24]

    The broad optimality of profile maximum likelihood

    Yi Hao and Alon Orlitsky. The broad optimality of profile maximum likelihood. , 2019

  25. [25]

    Unified sample-optimal property estimation in near-linear time

    Yi Hao and Alon Orlitsky. Unified sample-optimal property estimation in near-linear time. In , 2019

  26. [26]

    Online versus offline adversaries in property testing

    Esty Kelman, Ephraim Linder, and Sofya Raskhodnikova. Online versus offline adversaries in property testing. In , pages 65:1--65:18, 2025

  27. [27]

    Estimating the effective support size in constant query complexity

    Shyam Narayanan and Jakub T e tek. Estimating the effective support size in constant query complexity. In , pages 242--252, 2023

  28. [28]

    OEIS sequence A008310

    OEIS . OEIS sequence A008310 . https://oeis.org/A008310. Accessed 2024-08-13

  29. [29]

    Tolerant property testing and distance approximation

    Michal Parnas, Dana Ron, and Ronitt Rubinfeld. Tolerant property testing and distance approximation. Journal of Computer and System Sciences , 72(6):1012--1042, 2006

  30. [30]

    Almost optimal distribution-free sample-based testing of k-modality

    Dana Ron and Asaf Rosin. Almost optimal distribution-free sample-based testing of k-modality. In , 2020

  31. [31]

    Optimal distribution-free sample-based testing of subsequence-freeness with one-sided error

    Dana Ron and Asaf Rosin. Optimal distribution-free sample-based testing of subsequence-freeness with one-sided error. , 14(1):1--31, 2022

  32. [32]

    Strong lower bounds for approximating distribution support size and the distinct elements problem

    Sofya Raskhodnikova, Dana Ron, Amir Shpilka, and Adam Smith. Strong lower bounds for approximating distribution support size and the distinct elements problem. , 39(3):813--842, 2009

  33. [33]

    Estimating the unseen: an n/ (n) -sample estimator for entropy and support size, shown optimal via new CLTs

    Gregory Valiant and Paul Valiant. Estimating the unseen: an n/ (n) -sample estimator for entropy and support size, shown optimal via new CLTs . In , 2011

  34. [34]

    The power of linear estimators

    Gregory Valiant and Paul Valiant. The power of linear estimators. In , 2011

  35. [35]

    Instance optimal learning of discrete distributions

    Gregory Valiant and Paul Valiant. Instance optimal learning of discrete distributions. In , 2016

  36. [36]

    Estimating the unseen: improved estimators for entropy and other properties

    Gregory Valiant and Paul Valiant. Estimating the unseen: improved estimators for entropy and other properties. , 2017

  37. [37]

    Chebyshev polynomials, moment matching, and optimal estimation of the unseen

    Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics , 47(2):857--883, 2019

  38. [38]

    Polynomial methods in statistical inference: theory and practice

    Yihong Wu and Pengkun Yang. Polynomial methods in statistical inference: theory and practice . Foundations and Trends in Communications and Information Theory. now Publishers, 2020