On (not) learning the M\"obius function
Pith reviewed 2026-05-08 07:05 UTC · model grok-4.3
The pith
Standard learning methods fail to efficiently learn the Möbius or Liouville function.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We prove lower bounds on learning the Möbius or Liouville function with a variety of standard learning techniques, including kernel methods, noisy gradient methods, and correlational statistical query algorithms. These results follow from quantitative bounds on the correlation of Möbius with digital characters of various finite abelian groups, where the group is dictated by the type of input data the algorithm is given. Using residues mod p for many different primes corresponds to a cyclic group, and using the base p expansion for a fixed prime corresponds to an elementary abelian p-group. We also note that lower bounds of this form are closely related to certain types of digital prime umber
What carries the argument
Quantitative bounds on the correlation of the Möbius function with digital characters of finite abelian groups determined by the input representation.
If this is right
- Kernel methods cannot learn the Möbius function efficiently under the given input models.
- Noisy gradient methods and correlational statistical query algorithms also fail to learn it.
- The Liouville function is subject to the same lower bounds.
- Learning lower bounds of this type relate closely to digital prime number theorems.
Where Pith is reading between the lines
- Algorithms relying on features from these group characters will be ineffective for the Möbius function.
- Methods that capture multiplicative correlations might bypass these barriers.
- Similar hardness results may hold for other arithmetic functions with pseudorandom properties.
- Proving stronger correlation bounds could strengthen the learning lower bounds further.
Load-bearing premise
The Möbius function has correlations with digital characters of the relevant finite abelian groups that are small enough to yield the lower bounds for the learning algorithms.
What would settle it
A successful demonstration of any of the considered learning techniques achieving non-trivial accuracy on the Möbius function for the described input types would falsify the lower bounds.
read the original abstract
We prove lower bounds on learning the M\"obius or Liouville function with a variety of standard learning techniques, including kernel methods, noisy gradient methods, and correlational statistical query algorithms. These results follow from quantitative bounds on the correlation of M\"obius with digital characters of various finite abelian groups, where the group is dictated by the type of input data the algorithm is given. Using residues mod $p$ for many different primes corresponds to a cyclic group, and using the base $p$ expansion for a fixed prime corresponds to an elementary abelian $p$-group. We also note that lower bounds of this form are closely related to certain types of digital prime number theorems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves lower bounds on learning the Möbius and Liouville functions using kernel methods, noisy gradient methods, and correlational statistical query algorithms. These lower bounds are derived from quantitative upper bounds on the correlation of the Möbius function with digital characters of finite abelian groups (cyclic groups arising from residues modulo many primes, and elementary abelian p-groups from base-p digit expansions). The work also notes a connection between such correlation bounds and certain digital prime number theorems.
Significance. If the stated quantitative correlation bounds hold, the results establish a concrete bridge between analytic number theory and learning theory by showing that the pseudorandomness properties of the Möbius function (via its small correlations with digital characters) imply explicit hardness for several standard learning paradigms. The explicitness of the lower bounds and the link to digital prime number theorems are strengths that could stimulate further work at the intersection of the two fields.
major comments (2)
- [§3, Theorem 3.2] §3, Theorem 3.2 (cyclic case): the quantitative correlation bound is stated in a form that depends on the range of primes; the paper should verify that the implied constant is strong enough to produce a non-trivial learning lower bound (advantage o(1)) for the kernel and SQ settings when the input dimension is polynomial in the number of primes.
- [§5.1] §5.1, the reduction for noisy gradient methods: the argument maps small correlation to a lower bound on the excess risk, but the noise level and step-size regime are not explicitly matched to the correlation decay rate; a short calculation showing that the resulting lower bound remains non-trivial for the stated parameter ranges would strengthen the claim.
minor comments (3)
- [§2] The term 'digital character' is defined in §2 but would benefit from an explicit formula or small example (e.g., for p=3 and 2 digits) to make the group action immediately clear to readers outside number theory.
- [Introduction] Notation for the Liouville function λ(n) is introduced but the text occasionally reverts to μ(n) without comment; a single sentence clarifying that all stated bounds apply verbatim to both functions would remove ambiguity.
- [Related work] The related-work paragraph on correlational SQ algorithms cites the original definition but omits the year and full bibliographic details; adding the standard reference would improve traceability.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the specific suggestions for clarifying the quantitative aspects of our lower bounds. We address each major comment below.
read point-by-point responses
-
Referee: [§3, Theorem 3.2] §3, Theorem 3.2 (cyclic case): the quantitative correlation bound is stated in a form that depends on the range of primes; the paper should verify that the implied constant is strong enough to produce a non-trivial learning lower bound (advantage o(1)) for the kernel and SQ settings when the input dimension is polynomial in the number of primes.
Authors: We agree that an explicit verification of the constants is useful. The correlation bound in Theorem 3.2 is of the form |⟨μ, χ⟩| ≪ (log log X / log X)^{1/2} (with the implied constant absolute and independent of the number of primes k, for X in the stated range). In the kernel and correlational SQ settings the advantage is at most this correlation (up to a universal factor). When the input dimension d is polynomial in k, one may take the modulus range up to X = exp(k^ε) for small ε > 0; the resulting correlation is then o(1) as k → ∞. We have added a short remark immediately after the statement of Theorem 3.2 that records this calculation and confirms the advantage tends to zero for the polynomial-dimension regime. revision: yes
-
Referee: [§5.1] §5.1, the reduction for noisy gradient methods: the argument maps small correlation to a lower bound on the excess risk, but the noise level and step-size regime are not explicitly matched to the correlation decay rate; a short calculation showing that the resulting lower bound remains non-trivial for the stated parameter ranges would strengthen the claim.
Authors: We thank the referee for this observation. The reduction in §5.1 shows that excess risk is at least Ω(ρ² / (η L + σ² / η)), where ρ is the correlation, L the Lipschitz constant of the loss, η the step size, and σ² the noise variance. Choosing η = Θ(ρ / L) and σ² = o(ρ²) (which is permitted in the parameter ranges of the theorem) yields excess risk Ω(ρ). Since ρ = o(1) by the correlation bound, the lower bound remains non-trivial. We have inserted this short calculation as a new paragraph in §5.1 of the revised manuscript. revision: yes
Circularity Check
No significant circularity
full rationale
The derivation proceeds by first establishing quantitative upper bounds on the correlation of the Möbius function with digital characters (cyclic groups for residues mod p, elementary abelian p-groups for base-p digits) using number-theoretic techniques, then invoking the standard learning-theoretic implication that sufficiently small correlation with a hypothesis class yields explicit lower bounds for kernel methods, noisy gradient descent, and correlational SQ algorithms. These correlation bounds are proved directly in the paper rather than fitted or defined from the learning results, and the implication step is an external, non-circular reduction. No self-definitional equations, renamed predictions, or load-bearing self-citations appear in the chain; the argument remains self-contained against independent benchmarks in analytic number theory and statistical query learning.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard properties of the Möbius and Liouville functions and their behavior under group characters
Reference graph
Works this paper leans on
-
[1]
On the non-universality of deep learning: quantifying the cost of symmetry, 2022
[ABA22] Emmanuel Abbe and Enric Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry, 2022. [ABAM24] Emmanuel Abbe, Enric Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks, 2024. [AS23] Emmanu...
work page 2022
-
[2]
Revised and with a preface by Hugh L
Springer-Verlag, New York, 3 edition, 2000. Revised and with a preface by Hugh L. Montgomery. [DMR20] Michael Drmota, Christian Mauduit, and Jo¨ el Rivat. Prime numbers in two bases.Duke Math- ematical Journal, 169(10):1809–1876, 2020. [Gre12] Ben Green. On (not) computing the mobius function using bounded depth circuits, 2012. [GRS20] Noah Golowich, Alex...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.