On (not) learning the M\"obius function

Alexey Pozdnyakov

arxiv: 2604.23427 · v1 · submitted 2026-04-25 · 🧮 math.NT · cs.LG

On (not) learning the M\"obius function

Alexey Pozdnyakov This is my paper

Pith reviewed 2026-05-08 07:05 UTC · model grok-4.3

classification 🧮 math.NT cs.LG

keywords Möbius functionLiouville functionlearning lower boundsstatistical query algorithmsdigital charactersfinite abelian groupsnumber theoretic functions

0 comments

The pith

Standard learning methods fail to efficiently learn the Möbius or Liouville function.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that kernel methods, noisy gradient methods, and correlational statistical query algorithms cannot learn the Möbius or Liouville function to high accuracy. This follows because the Möbius function has only weak correlations with digital characters of finite abelian groups that model the input data, whether as residues modulo many primes or as base-p digit expansions. A sympathetic reader would care because these results identify concrete computational barriers when standard machine learning techniques are applied to a central object in number theory. The work also connects the learning lower bounds to questions about digital versions of the prime number theorem.

Core claim

We prove lower bounds on learning the Möbius or Liouville function with a variety of standard learning techniques, including kernel methods, noisy gradient methods, and correlational statistical query algorithms. These results follow from quantitative bounds on the correlation of Möbius with digital characters of various finite abelian groups, where the group is dictated by the type of input data the algorithm is given. Using residues mod p for many different primes corresponds to a cyclic group, and using the base p expansion for a fixed prime corresponds to an elementary abelian p-group. We also note that lower bounds of this form are closely related to certain types of digital prime umber

What carries the argument

Quantitative bounds on the correlation of the Möbius function with digital characters of finite abelian groups determined by the input representation.

If this is right

Kernel methods cannot learn the Möbius function efficiently under the given input models.
Noisy gradient methods and correlational statistical query algorithms also fail to learn it.
The Liouville function is subject to the same lower bounds.
Learning lower bounds of this type relate closely to digital prime number theorems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Algorithms relying on features from these group characters will be ineffective for the Möbius function.
Methods that capture multiplicative correlations might bypass these barriers.
Similar hardness results may hold for other arithmetic functions with pseudorandom properties.
Proving stronger correlation bounds could strengthen the learning lower bounds further.

Load-bearing premise

The Möbius function has correlations with digital characters of the relevant finite abelian groups that are small enough to yield the lower bounds for the learning algorithms.

What would settle it

A successful demonstration of any of the considered learning techniques achieving non-trivial accuracy on the Möbius function for the described input types would falsify the lower bounds.

read the original abstract

We prove lower bounds on learning the M\"obius or Liouville function with a variety of standard learning techniques, including kernel methods, noisy gradient methods, and correlational statistical query algorithms. These results follow from quantitative bounds on the correlation of M\"obius with digital characters of various finite abelian groups, where the group is dictated by the type of input data the algorithm is given. Using residues mod $p$ for many different primes corresponds to a cyclic group, and using the base $p$ expansion for a fixed prime corresponds to an elementary abelian $p$-group. We also note that lower bounds of this form are closely related to certain types of digital prime number theorems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper derives learning lower bounds for the Möbius function from its weak correlations with digital characters, and the high-level argument holds up without gaps.

read the letter

The main thing to know is that this paper shows standard machine learning techniques cannot learn the Möbius or Liouville function effectively when given digital representations of numbers, because those functions have limited correlation with the characters of the corresponding groups. It does well by proving quantitative bounds on those correlations for cyclic groups from residues modulo many primes and for elementary abelian groups from base-p expansions. From there it derives explicit lower bounds for kernel methods, noisy gradient methods, and correlational statistical query algorithms. The connection to digital prime number theorems adds a nice number-theoretic flavor without overclaiming. The reasoning looks direct: the correlation bounds come from number theory and feed into the learning lower bounds in the usual way, with no obvious circularity. The abstract outlines a clear proof strategy. The soft spots are minor. The lower bounds apply only to the listed classes of algorithms, so they leave open whether other methods could succeed. Without the full details, it's unclear how tight the correlation bounds are or how large the implied sample sizes become, which affects how strong the negative result is in practice. The paper focuses on these input formats, which is fine but means it doesn't address arbitrary representations. This paper is for researchers interested in the overlap between computational number theory and theoretical machine learning. A reader who knows character sums and statistical query models will get the most out of it. It has enough structure and a clear new application to merit a serious referee. I would send it for peer review. The ideas are worth the time to verify.

Referee Report

2 major / 3 minor

Summary. The paper proves lower bounds on learning the Möbius and Liouville functions using kernel methods, noisy gradient methods, and correlational statistical query algorithms. These lower bounds are derived from quantitative upper bounds on the correlation of the Möbius function with digital characters of finite abelian groups (cyclic groups arising from residues modulo many primes, and elementary abelian p-groups from base-p digit expansions). The work also notes a connection between such correlation bounds and certain digital prime number theorems.

Significance. If the stated quantitative correlation bounds hold, the results establish a concrete bridge between analytic number theory and learning theory by showing that the pseudorandomness properties of the Möbius function (via its small correlations with digital characters) imply explicit hardness for several standard learning paradigms. The explicitness of the lower bounds and the link to digital prime number theorems are strengths that could stimulate further work at the intersection of the two fields.

major comments (2)

[§3, Theorem 3.2] §3, Theorem 3.2 (cyclic case): the quantitative correlation bound is stated in a form that depends on the range of primes; the paper should verify that the implied constant is strong enough to produce a non-trivial learning lower bound (advantage o(1)) for the kernel and SQ settings when the input dimension is polynomial in the number of primes.
[§5.1] §5.1, the reduction for noisy gradient methods: the argument maps small correlation to a lower bound on the excess risk, but the noise level and step-size regime are not explicitly matched to the correlation decay rate; a short calculation showing that the resulting lower bound remains non-trivial for the stated parameter ranges would strengthen the claim.

minor comments (3)

[§2] The term 'digital character' is defined in §2 but would benefit from an explicit formula or small example (e.g., for p=3 and 2 digits) to make the group action immediately clear to readers outside number theory.
[Introduction] Notation for the Liouville function λ(n) is introduced but the text occasionally reverts to μ(n) without comment; a single sentence clarifying that all stated bounds apply verbatim to both functions would remove ambiguity.
[Related work] The related-work paragraph on correlational SQ algorithms cites the original definition but omits the year and full bibliographic details; adding the standard reference would improve traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the specific suggestions for clarifying the quantitative aspects of our lower bounds. We address each major comment below.

read point-by-point responses

Referee: [§3, Theorem 3.2] §3, Theorem 3.2 (cyclic case): the quantitative correlation bound is stated in a form that depends on the range of primes; the paper should verify that the implied constant is strong enough to produce a non-trivial learning lower bound (advantage o(1)) for the kernel and SQ settings when the input dimension is polynomial in the number of primes.

Authors: We agree that an explicit verification of the constants is useful. The correlation bound in Theorem 3.2 is of the form |⟨μ, χ⟩| ≪ (log log X / log X)^{1/2} (with the implied constant absolute and independent of the number of primes k, for X in the stated range). In the kernel and correlational SQ settings the advantage is at most this correlation (up to a universal factor). When the input dimension d is polynomial in k, one may take the modulus range up to X = exp(k^ε) for small ε > 0; the resulting correlation is then o(1) as k → ∞. We have added a short remark immediately after the statement of Theorem 3.2 that records this calculation and confirms the advantage tends to zero for the polynomial-dimension regime. revision: yes
Referee: [§5.1] §5.1, the reduction for noisy gradient methods: the argument maps small correlation to a lower bound on the excess risk, but the noise level and step-size regime are not explicitly matched to the correlation decay rate; a short calculation showing that the resulting lower bound remains non-trivial for the stated parameter ranges would strengthen the claim.

Authors: We thank the referee for this observation. The reduction in §5.1 shows that excess risk is at least Ω(ρ² / (η L + σ² / η)), where ρ is the correlation, L the Lipschitz constant of the loss, η the step size, and σ² the noise variance. Choosing η = Θ(ρ / L) and σ² = o(ρ²) (which is permitted in the parameter ranges of the theorem) yields excess risk Ω(ρ). Since ρ = o(1) by the correlation bound, the lower bound remains non-trivial. We have inserted this short calculation as a new paragraph in §5.1 of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation proceeds by first establishing quantitative upper bounds on the correlation of the Möbius function with digital characters (cyclic groups for residues mod p, elementary abelian p-groups for base-p digits) using number-theoretic techniques, then invoking the standard learning-theoretic implication that sufficiently small correlation with a hypothesis class yields explicit lower bounds for kernel methods, noisy gradient descent, and correlational SQ algorithms. These correlation bounds are proved directly in the paper rather than fitted or defined from the learning results, and the implication step is an external, non-circular reduction. No self-definitional equations, renamed predictions, or load-bearing self-citations appear in the chain; the argument remains self-contained against independent benchmarks in analytic number theory and statistical query learning.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on quantitative bounds on correlations of the Möbius function with digital characters, which are asserted but not detailed or derived in the abstract; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption Standard properties of the Möbius and Liouville functions and their behavior under group characters
The lower bounds rely on these number-theoretic properties to establish the correlation bounds.

pith-pipeline@v0.9.0 · 5404 in / 1130 out tokens · 45836 ms · 2026-05-08T07:05:12.244848+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

On the non-universality of deep learning: quantifying the cost of symmetry, 2022

[ABA22] Emmanuel Abbe and Enric Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry, 2022. [ABAM24] Emmanuel Abbe, Enric Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks, 2024. [AS23] Emmanu...

work page 2022
[2]

Revised and with a preface by Hugh L

Springer-Verlag, New York, 3 edition, 2000. Revised and with a preface by Hugh L. Montgomery. [DMR20] Michael Drmota, Christian Mauduit, and Jo¨ el Rivat. Prime numbers in two bases.Duke Math- ematical Journal, 169(10):1809–1876, 2020. [Gre12] Ben Green. On (not) computing the mobius function using bounded depth circuits, 2012. [GRS20] Noah Golowich, Alex...

work page 2000

[1] [1]

On the non-universality of deep learning: quantifying the cost of symmetry, 2022

[ABA22] Emmanuel Abbe and Enric Boix-Adsera. On the non-universality of deep learning: quantifying the cost of symmetry, 2022. [ABAM24] Emmanuel Abbe, Enric Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks, 2024. [AS23] Emmanu...

work page 2022

[2] [2]

Revised and with a preface by Hugh L

Springer-Verlag, New York, 3 edition, 2000. Revised and with a preface by Hugh L. Montgomery. [DMR20] Michael Drmota, Christian Mauduit, and Jo¨ el Rivat. Prime numbers in two bases.Duke Math- ematical Journal, 169(10):1809–1876, 2020. [Gre12] Ben Green. On (not) computing the mobius function using bounded depth circuits, 2012. [GRS20] Noah Golowich, Alex...

work page 2000