NOVA: Fundamental Limits of Knowledge Discovery Through AI

Ken Duffy; Muriel M\'edard; Salman Avestimehr

arxiv: 2605.15219 · v1 · pith:ZIF2GF2Snew · submitted 2026-05-12 · 💻 cs.AI · cs.IT· math.IT

NOVA: Fundamental Limits of Knowledge Discovery Through AI

Salman Avestimehr , Ken Duffy , Muriel M\'edard This is my paper

Pith reviewed 2026-05-19 17:13 UTC · model grok-4.3

classification 💻 cs.AI cs.ITmath.IT

keywords knowledge discoveryAI self-improvementscaling lawsZipf lawcontamination trapadaptive samplingdiminishing returns

0 comments

The pith

Under a Zipf-law assumption on discovery probabilities, the cost to gather D new AI discoveries grows as D to the power alpha.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the NOVA framework for modeling how AI systems improve themselves by generating candidates, verifying them, accumulating the valid ones, and retraining. It establishes conditions under which this loop can exhaust a finite set of genuine knowledge and describes the specific ways the process can fail when those conditions are not met. The main theorem shows that if the probabilities of hitting new valid items match the tail of a Zipf distribution with exponent alpha greater than one, then the total generation cost to reach D distinct discoveries is on the order of D to the alpha. This result highlights why progress slows dramatically after the initial phase. It further notes that small verification errors can cause the system to fill up with invalid items instead of true discoveries.

Core claim

We prove that the cumulative generation cost required to obtain D distinct genuine discoveries satisfies R_cum(D) = Theta(c_gen D^alpha) under a tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent alpha greater than 1. This scaling quantifies the asymptotic diminishing returns in the knowledge discovery process.

What carries the argument

The tail-equivalence assumption that connects the model's effective discovery distribution to a Zipf law with exponent alpha greater than 1, allowing derivation of the power-law cost scaling.

If this is right

Accumulated genuine knowledge will eventually cover the entire finite domain when sufficient conditions hold.
Violations of the conditions produce failure modes including contamination, forgetting, exploration failure, and acceptance failure.
Imperfect verification creates a contamination trap in which invalid artifacts enter the knowledge base faster than genuine discoveries as easy items are exhausted.
Good-Turing estimation acts only as a local batch-diversity diagnostic and does not estimate the historically undiscovered valid mass.
Expert human input provides the greatest amplification when guiding around autonomous exploration barriers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The scaling law implies that self-improving AI systems will face increasingly prohibitive costs without external intervention.
Similar cost analyses could apply to other iterative discovery processes in science or engineering.
Empirical tests could involve logging discovery rates in large language models to estimate the effective alpha value.

Load-bearing premise

The model's effective discovery distribution is tail-equivalent to a Zipf law with an exponent alpha strictly greater than 1.

What would settle it

Measuring the frequency of new valid discoveries over many iterations in an actual AI system and verifying whether the cumulative cost curve matches the predicted Theta(D^alpha) scaling for some alpha > 1.

Figures

Figures reproduced from arXiv: 2605.15219 by Ken Duffy, Muriel M\'edard, Salman Avestimehr.

**Figure 1.** Figure 1: Local contamination trap. The curves show the fraction of newly accepted artifacts that are invalid, f marg t ≈ δtUt/(rtMnew t + δtUt), as a function of the false-positive rate δt. As the new-valid mass Mnew t decreases, even small false-positive rates can make invalid artifacts dominate the accepted increments. 0.0 0.1 0.2 0.3 0.4 0.5 False-positive rate δt 0.0 0.2 0.4 0.6 0.8 1.0 Marginal contamination … view at source ↗

read the original abstract

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $\alpha>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=\Theta(c_{\mathrm{gen}}D^\alpha)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames AI self-improvement as an adaptive sampling loop, flags a contamination trap, and derives a power-law cost scaling under an added Zipf-tail assumption, but does not show that retraining preserves the tail.

read the letter

The core contribution is a model of the generate-verify-accumulate-retrain cycle as sampling over a knowledge space, plus explicit conditions that lead to coverage or to one of four failure modes. They also isolate a contamination trap where imperfect verification lets invalid items accumulate faster once easy discoveries are exhausted. The clarification that Good-Turing only measures local batch diversity, not the remaining valid mass, is a useful distinction from prior estimation work. Under the separate tail-equivalence assumption they then prove the cumulative generation cost scales as Theta(c_gen D^alpha) for alpha > 1. That scaling is the quantitative claim most readers will notice.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the NOVA framework modeling the generate-verify-accumulate-retrain loop as adaptive sampling over a knowledge space. It states sufficient conditions for eventual coverage of a finite domain and identifies four failure modes (contamination, forgetting, exploration failure, acceptance failure). It analyzes imperfect verification, deriving a contamination trap in which invalid artifacts accumulate faster than genuine discoveries once easy knowledge is exhausted. Good-Turing estimation is positioned as a local batch-diversity diagnostic rather than an estimator of undiscovered valid mass. Under a separate tail-equivalence assumption that the effective discovery distribution remains Zipf with fixed exponent alpha > 1, the paper proves that cumulative generation cost satisfies R_cum(D) = Theta(c_gen D^alpha). The work concludes by formalizing human amplification via guidance, generation, and verification.

Significance. If the scaling result holds under the stated assumptions, the paper supplies a useful asymptotic characterization of diminishing returns in iterative AI knowledge discovery, which could guide expectations for self-improving systems. The explicit catalog of failure modes and the contamination-trap analysis provide conceptual tools for loop design. Credit is given for the formal adaptive-sampling treatment and for clearly labeling the tail-equivalence assumption as external rather than derived from the NOVA definitions.

major comments (2)

[Scaling-law derivation] The derivation of R_cum(D) = Theta(c_gen D^alpha) (the paragraph stating the scaling result) invokes tail-equivalence to a fixed Zipf law with exponent alpha > 1 but supplies no argument that the retraining step on accumulated genuine discoveries leaves this exponent invariant. Because the NOVA loop explicitly updates the model, the effective discovery distribution could concentrate on already-covered regions or interact with contamination dynamics, allowing alpha to drift with D and rendering the fixed-exponent integral inapplicable. This invariance is load-bearing for the central asymptotic claim.
[Statement of the tail-equivalence assumption] Alpha appears as a free parameter inside the tail-equivalence assumption without any procedure for independent measurement or prediction from the model or data. Consequently the scaling R_cum(D) = Theta(c_gen D^alpha) is not forced by the framework's own definitions but is shaped by the external distributional choice, weakening the claim that the cost law is an intrinsic consequence of the NOVA process.

minor comments (2)

[Abstract] The abstract lists the four failure modes but does not briefly characterize each; adding one sentence per mode would improve immediate readability.
[Cost-model paragraph] Notation for the per-candidate generation cost c_gen is introduced without an explicit cross-reference to its definition in the cost model; a forward pointer would reduce ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential utility of the scaling characterization and failure mode analysis. We address the two major comments below. We agree that the presentation of the tail-equivalence assumption and its implications for the scaling law can be strengthened, and we will revise the manuscript to clarify these points without altering the core claims.

read point-by-point responses

Referee: [Scaling-law derivation] The derivation of R_cum(D) = Theta(c_gen D^alpha) (the paragraph stating the scaling result) invokes tail-equivalence to a fixed Zipf law with exponent alpha > 1 but supplies no argument that the retraining step on accumulated genuine discoveries leaves this exponent invariant. Because the NOVA loop explicitly updates the model, the effective discovery distribution could concentrate on already-covered regions or interact with contamination dynamics, allowing alpha to drift with D and rendering the fixed-exponent integral inapplicable. This invariance is load-bearing for the central asymptotic claim.

Authors: We acknowledge the validity of this observation. The manuscript presents the tail-equivalence as a separate assumption on the effective discovery distribution and does not derive or prove that retraining preserves a fixed exponent alpha. In the current analysis, we assume that the distribution remains Zipf-equivalent with constant alpha throughout the process to obtain the asymptotic result. However, as the referee notes, model updates could in principle alter the tail behavior. We will revise the relevant section to explicitly note that the fixed-alpha assumption is an idealization and to discuss potential mechanisms (such as sufficient diversity in genuine discoveries) that might approximately maintain it. We will also add a brief discussion of alpha drift as a direction for future investigation. This revision will make the scope of the result clearer. revision: partial
Referee: [Statement of the tail-equivalence assumption] Alpha appears as a free parameter inside the tail-equivalence assumption without any procedure for independent measurement or prediction from the model or data. Consequently the scaling R_cum(D) = Theta(c_gen D^alpha) is not forced by the framework's own definitions but is shaped by the external distributional choice, weakening the claim that the cost law is an intrinsic consequence of the NOVA process.

Authors: We agree that alpha is an external parameter in the assumption and that the scaling result is conditional upon it rather than being derived purely from the NOVA loop definitions. The intent of the section is to show the consequences for cumulative cost under a realistic heavy-tailed discovery model, not to claim that NOVA itself forces a particular alpha. We will revise the manuscript to more prominently state that the scaling law holds under this additional assumption and to avoid language that might suggest it is an intrinsic property of the framework alone. This should address the concern about the strength of the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; scaling derived conditionally from separate external assumption

full rationale

The paper presents the NOVA framework as an adaptive sampling model over knowledge space, identifies sufficient conditions for covering a finite domain, analyzes failure modes including contamination, and clarifies Good-Turing usage. The scaling claim R_cum(D) = Theta(c_gen D^alpha) is explicitly introduced under a separate tail-equivalence assumption to a Zipf law with alpha > 1; the text states this assumption is invoked to derive the asymptotic cost rather than being obtained from the loop definitions, retraining dynamics, or any fitted parameter. No self-definitional reduction, fitted-input prediction, load-bearing self-citation, uniqueness theorem from prior author work, or ansatz smuggling is exhibited. The derivation chain remains self-contained once the external distributional assumption is granted, with no evidence that the result is equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central scaling claim rests on one key distributional assumption and treats the per-candidate generation cost as an exogenous constant.

free parameters (2)

alpha
Exponent of the Zipf law that governs the tail of the discovery distribution; required for the Theta(D^alpha) scaling.
c_gen
Per-candidate generation cost treated as a fixed multiplier in the cumulative cost expression.

axioms (1)

domain assumption Tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent alpha greater than 1
Invoked explicitly to obtain the asymptotic form of R_cum(D).

pith-pipeline@v0.9.0 · 5794 in / 1348 out tokens · 35476 ms · 2026-05-19T17:13:38.327509+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Anna Ben-Hamou, Stéphane Boucheron, and Mesrob I

doi: 10.1109/ISIT.2018.8437620. Anna Ben-Hamou, Stéphane Boucheron, and Mesrob I. Ohannessian. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications.Bernoulli, 23(1):249–287,

work page doi:10.1109/isit.2018.8437620 2018
[2]

Sharp thresholds for high-dimensional and noisy sparsity re- covery usingℓ 1-constrained quadratic programming (Lasso)

doi: 10.1109/TIT. 2024.3440661. Prafulla Chandra, Andrew Thangaraj, and Nived Rajaraman. How good is good-turing for markov samples?Transactions on Machine Learning Research,

work page doi:10.1109/tit 2024
[3]

Bradley Efron and Ronald Thisted

doi: 10.1109/TSP.2022.3186176. Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did Shakespeare know?Biometrika, 63(3):435–447,

work page doi:10.1109/tsp.2022.3186176 2022
[4]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language model- ing.arXiv preprint arXiv:2308.08998,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

org/abs/2509.07355

URLhttps://arxiv. org/abs/2509.07355. Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži ´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Za- havy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercur...

work page arXiv
[6]

Nature , author =

doi: 10.1038/s41586-021-03819-2. Samuel Karlin. Central limit theorems for certain infinite urn schemes.Journal of Mathematics and Mechanics, 17(4):373–401,

work page doi:10.1038/s41586-021-03819-2
[7]

Mathematical methods and human thought in the age of AI.arXiv preprint arXiv:2603.26524, March

Tanya Klowden and Terence Tao. Mathematical methods and human thought in the age of AI.arXiv preprint arXiv:2603.26524, March

work page arXiv
[8]

Amichai Painsky

doi: 10.1109/ITW48936.2021.9611389. Amichai Painsky. Convergence guarantees for the Good-Turing estimator.Journal of Machine Learning Research, 23(279):1–37,

work page doi:10.1109/itw48936.2021.9611389 2021
[9]

URLhttps:// arxiv.org/abs/2604.05057. Z.Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z.F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yux- uan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subg...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

2021.9518169

doi: 10.1109/ISIT45174. 2021.9518169. 11 Geoffrey Wolfer and Aryeh Kontorovich. Statistical estimation of ergodic Markov chain kernel over discrete state space.Bernoulli, 27(1):532–553,

work page doi:10.1109/isit45174 2021
[11]

Huajian Xin, Z.Z

doi: 10.3150/20-BEJ1248. Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and Monte-Carlo tree search. InICLR,

work page doi:10.3150/20-bej1248
[12]

A language modelM t generates candidate proofs in a formal lan- guage (e.g., Lean 4)

12 A Motivating Examples Example 1: Discovering new mathematical proofs.ConsiderKas the set of valid formal proofs of mathematical conjectures. A language modelM t generates candidate proofs in a formal lan- guage (e.g., Lean 4). The verification step is performed by a proof assistant: the Lean type-checker mechanically checks whether each candidate is a ...

work page 2025
[13]

, XN be i.i.d

and McAllester and Schapire [2000]).LetX 1, . . . , XN be i.i.d. from a discrete distributionQ, and let MN = X x Q(x)1[x /∈ {X1, . . . , XN }] be the missing mass after the batch. The Good–Turing estimatorf 1/N, wheref 1 is the number of species observed exactly once, is the classical estimator forM N , the probability that the next draw belongs to a spec...

work page 2000

[1] [1]

Anna Ben-Hamou, Stéphane Boucheron, and Mesrob I

doi: 10.1109/ISIT.2018.8437620. Anna Ben-Hamou, Stéphane Boucheron, and Mesrob I. Ohannessian. Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications.Bernoulli, 23(1):249–287,

work page doi:10.1109/isit.2018.8437620 2018

[2] [2]

Sharp thresholds for high-dimensional and noisy sparsity re- covery usingℓ 1-constrained quadratic programming (Lasso)

doi: 10.1109/TIT. 2024.3440661. Prafulla Chandra, Andrew Thangaraj, and Nived Rajaraman. How good is good-turing for markov samples?Transactions on Machine Learning Research,

work page doi:10.1109/tit 2024

[3] [3]

Bradley Efron and Ronald Thisted

doi: 10.1109/TSP.2022.3186176. Bradley Efron and Ronald Thisted. Estimating the number of unseen species: How many words did Shakespeare know?Biometrika, 63(3):435–447,

work page doi:10.1109/tsp.2022.3186176 2022

[4] [4]

Reinforced Self-Training (ReST) for Language Modeling

Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (ReST) for language model- ing.arXiv preprint arXiv:2308.08998,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

org/abs/2509.07355

URLhttps://arxiv. org/abs/2509.07355. Thomas Hubert, Rishi Mehta, Laurent Sartran, Miklós Z. Horváth, Goran Žuži ´c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, Ottavia Bertolli, Tom Za- havy, Amol Mandhane, Jessica Yung, Iuliya Beloshapka, Borja Ibarz, Vivek Veeriah, Lei Yu, Oliver Nash, Paul Lezeau, Salvatore Mercur...

work page arXiv

[6] [6]

Nature , author =

doi: 10.1038/s41586-021-03819-2. Samuel Karlin. Central limit theorems for certain infinite urn schemes.Journal of Mathematics and Mechanics, 17(4):373–401,

work page doi:10.1038/s41586-021-03819-2

[7] [7]

Mathematical methods and human thought in the age of AI.arXiv preprint arXiv:2603.26524, March

Tanya Klowden and Terence Tao. Mathematical methods and human thought in the age of AI.arXiv preprint arXiv:2603.26524, March

work page arXiv

[8] [8]

Amichai Painsky

doi: 10.1109/ITW48936.2021.9611389. Amichai Painsky. Convergence guarantees for the Good-Turing estimator.Journal of Machine Learning Research, 23(279):1–37,

work page doi:10.1109/itw48936.2021.9611389 2021

[9] [9]

URLhttps:// arxiv.org/abs/2604.05057. Z.Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z.F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yux- uan Liu, Wenjun Gao, Daya Guo, and Chong Ruan. DeepSeek-Prover-V2: Advancing formal mathematical reasoning via reinforcement learning for subg...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

2021.9518169

doi: 10.1109/ISIT45174. 2021.9518169. 11 Geoffrey Wolfer and Aryeh Kontorovich. Statistical estimation of ergodic Markov chain kernel over discrete state space.Bernoulli, 27(1):532–553,

work page doi:10.1109/isit45174 2021

[11] [11]

Huajian Xin, Z.Z

doi: 10.3150/20-BEJ1248. Huajian Xin, Z.Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Haowei Zhang, Qihao Zhu, Dejian Yang, Zhibin Gou, Z.F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing proof assistant feedback for reinforcement learning and Monte-Carlo tree search. InICLR,

work page doi:10.3150/20-bej1248

[12] [12]

A language modelM t generates candidate proofs in a formal lan- guage (e.g., Lean 4)

12 A Motivating Examples Example 1: Discovering new mathematical proofs.ConsiderKas the set of valid formal proofs of mathematical conjectures. A language modelM t generates candidate proofs in a formal lan- guage (e.g., Lean 4). The verification step is performed by a proof assistant: the Lean type-checker mechanically checks whether each candidate is a ...

work page 2025

[13] [13]

, XN be i.i.d

and McAllester and Schapire [2000]).LetX 1, . . . , XN be i.i.d. from a discrete distributionQ, and let MN = X x Q(x)1[x /∈ {X1, . . . , XN }] be the missing mass after the batch. The Good–Turing estimatorf 1/N, wheref 1 is the number of species observed exactly once, is the classical estimator forM N , the probability that the next draw belongs to a spec...

work page 2000