An Information-Theoretic Criterion for Efficient Data Synthesis
Pith reviewed 2026-05-20 22:13 UTC · model grok-4.3
The pith
Synthetic data improves models only when the generation-training loop is information-open, receiving external signals that add task-relevant information.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied
What carries the argument
the distinction between information-open and information-closed generation-training loops, with external signals determining whether task-relevant information increases or decreases according to the data processing inequality.
If this is right
- Closed-loop synthetic data pipelines must produce a net loss of task-relevant information and eventual performance collapse.
- Coarser signals such as binary correctness yield behaviors that generalize across tasks and domains because they do not specify particular surface forms.
- Learning converges to whichever signal component carries the highest information efficiency among those available.
- Reward hacking arises when a spurious pattern happens to be the simplest information-efficient component in the signal.
Where Pith is reading between the lines
- The open-versus-closed distinction offers a diagnostic for why many self-training methods degrade without independent feedback mechanisms.
- Pipeline designers could prioritize adding minimal external verifiers that supply just enough new information to keep loops open.
- The preference for efficient signal components may extend to explaining shortcut learning in supervised settings without synthetic data.
- Measuring mutual information between generated data and task objectives before and after training loops could directly test the account.
Load-bearing premise
The classification of generation-training loops as information-open versus information-closed is sufficient to determine whether task-relevant information increases or decreases, with the data processing inequality applying directly to the overall loop.
What would settle it
Conduct a closed-loop experiment that generates and trains repeatedly on the model's own outputs with no external verifier or rubric, then check whether accuracy on a held-out test set measuring task-relevant information steadily declines over iterations.
read the original abstract
Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that synthetic data improves LLM performance only in information-open generation-training loops, where external signals (verifiers, environments, rubrics) inject task-relevant information beyond the model's current distribution; in information-closed loops relying on the model's own outputs, the data processing inequality guarantees monotonic decrease in task-relevant mutual information, predicting collapse. Among open loops, coarser supervision (e.g., binary correctness) yields better generalization because it is not tied to specific surface forms. The guiding thesis is that learning converges to the most information-efficient signal component available.
Significance. If the result holds, the work supplies a principled information-theoretic lens for predicting and avoiding synthetic-data collapse, explaining empirical inconsistencies, and guiding pipeline design toward external signals and coarse supervision. It could unify observations across self-training, RLHF, and synthetic-data methods while highlighting the role of meta-level supervision efficiency.
major comments (2)
- [Abstract / theoretical argument on closed loops] The central claim applies the data processing inequality directly to the composite iterative closed loop (abstract and theoretical argument), treating it as a single channel that monotonically decreases task-relevant mutual information. However, each training step updates model parameters, so the next generation is performed by a different distribution; the overall map is not a fixed Markov chain X→Y→Z. A proof that the iterative operator still obeys the DPI bound for task-relevant information is required, as the standard inequality does not automatically extend to this setting.
- [Introduction and definitions of open/closed loops] The distinction between information-open and information-closed loops is load-bearing for the main thesis, yet the manuscript supplies no formal definitions, quantitative criteria, or measurable quantities (e.g., mutual information thresholds or external-signal injection rates) that would allow classification of concrete pipelines or falsification of the predictions.
minor comments (1)
- [Theoretical framework] Notation for mutual information and task-relevant quantities should be introduced explicitly with symbols and units early in the theoretical section to improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed comments on our manuscript. These points help clarify the presentation of our information-theoretic framework. We address each major comment below, indicating the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract / theoretical argument on closed loops] The central claim applies the data processing inequality directly to the composite iterative closed loop (abstract and theoretical argument), treating it as a single channel that monotonically decreases task-relevant mutual information. However, each training step updates model parameters, so the next generation is performed by a different distribution; the overall map is not a fixed Markov chain X→Y→Z. A proof that the iterative operator still obeys the DPI bound for task-relevant information is required, as the standard inequality does not automatically extend to this setting.
Authors: We acknowledge that the iterative nature of the closed loop, with parameter updates after each training step, means the overall process is not a fixed Markov chain, and thus the standard DPI does not apply directly. Our argument relies on the intuition that without external signals, no new task-relevant information is introduced at any step. To make this rigorous, we will add a dedicated subsection in the revised manuscript that defines the iterative generation-training operator and provides a proof that task-relevant mutual information is non-increasing in information-closed loops. This will draw on concepts from adaptive information processing and show that the composition cannot increase relevant information. revision: yes
-
Referee: [Introduction and definitions of open/closed loops] The distinction between information-open and information-closed loops is load-bearing for the main thesis, yet the manuscript supplies no formal definitions, quantitative criteria, or measurable quantities (e.g., mutual information thresholds or external-signal injection rates) that would allow classification of concrete pipelines or falsification of the predictions.
Authors: We agree that formal definitions and criteria are necessary to make our framework operational and falsifiable. In the revision, we will add a new section early in the paper that formally defines information-open and information-closed loops. A loop is information-closed if the external signal S satisfies I(task; S | current_model) = 0, meaning no additional task-relevant information is injected. We will also propose measurable proxies, such as the mutual information between the external signal and the task labels, to classify pipelines and enable empirical validation of our predictions. revision: yes
Circularity Check
No circularity: derivation applies external DPI to loop classification
full rationale
The paper's central thesis applies the standard data processing inequality from information theory to information-closed generation-training loops, predicting monotonic decrease in task-relevant mutual information and model collapse. This rests on an external, independently established theorem rather than any quantities defined in terms of the paper's own fitted parameters, self-referential definitions, or load-bearing self-citations. The open/closed loop distinction is a conceptual framing that does not reduce the claimed outcome to the inputs by construction; the derivation chain remains self-contained and draws its force from the cited information-theoretic result.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Data processing inequality: processing a random variable cannot increase the mutual information it shares with another variable.
invented entities (2)
-
information-open loop
no independent evidence
-
information-closed loop
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
When the loop is information-closed ... the data processing inequality ensures that task-relevant information can only decrease
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
X→(Z_t,S)→Z_{t+1} and I(X;Z_{t+1})≤I(X;Z_t,S)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alemohammad, Sina and. Self-. The
-
[2]
Google DeepMind , url =
-
[3]
Constitutional AI: Harmlessness from AI Feedback
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and. Constitutional. doi:10.48550/arXiv.2...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.08073
- [4]
-
[5]
Seed-. doi:10.48550/arXiv.2512.17260 , url =. arXiv , keywords =:2512.17260 , publisher =
-
[6]
Seed-. doi:10.48550/arXiv.2507.23726 , url =. arXiv , keywords =:2507.23726 , publisher =
-
[7]
Gold-medalist. doi:10.48550/arXiv.2502.03544 , url =. arXiv , keywords =:2502.03544 , publisher =
-
[8]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
doi:10.48550/arXiv.2512.02556 , url =. arXiv , keywords =:2512.02556 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556
-
[9]
arXiv , keywords =:2601.08468 , publisher =
doi:10.48550/arXiv.2601.08468 , url =. arXiv , keywords =:2601.08468 , publisher =
-
[10]
doi: 10.1038/s41586-025-09422-z
Nature , volume =. doi:10.1038/s41586-025-09422-z , url =
-
[11]
Distilling the Knowledge in a Neural Network
Distilling the. doi:10.48550/arXiv.1503.02531 , url =. arXiv , keywords =:1503.02531 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531
-
[12]
Huang, Jiaxin and Gu, Shixiang and Hou, Le and Wu, Yuexin and Wang, Xuezhi and Yu, Hongkun and Han, Jiawei , editor =. Large. Proceedings of the 2023. doi:10.18653/V1/2023.EMNLP-MAIN.67 , url =
-
[13]
Winning. doi:10.48550/arXiv.2507.15855 , url =. arXiv , keywords =:2507.15855 , publisher =
-
[14]
URL https://aclanthology.org/2025
Olympiad-level formal mathematical reasoning with reinforcement learning , year = 2025, month = nov, journal =. doi:10.1038/s41586-025-09833-y , url =
-
[15]
arXiv , keywords =:2511.10515 , publisher =
doi:10.48550/arXiv.2511.10515 , url =. arXiv , keywords =:2511.10515 , publisher =
-
[16]
Scaling Laws for Neural Language Models
Scaling. doi:10.48550/arXiv.2001.08361 , url =. arXiv , keywords =:2001.08361 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2001.08361 2001
-
[17]
Le, Hung and Wang, Yue and Gotmare, Akhilesh Deepak and Savarese, Silvio and Hoi, Steven Chu-Hong , editor =. Advances in
-
[18]
Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, Rémi and Eccles, Tom and Keeling, James and Gimeno, Felix and Dal Lago, Agustin and Hubert, Thomas and Choy, Peter and. Competition-level code generation with. Science , volume =. doi:10.1126/science.abq1158 , url =
-
[19]
doi:10.48550/arXiv.2508.11874 , url =
Discovering. doi:10.48550/arXiv.2508.11874 , url =. arXiv , keywords =:2508.11874 , publisher =
-
[20]
doi:10.48550/arXiv.2601.06052 , url =
Reinforcement. doi:10.48550/arXiv.2601.06052 , url =. arXiv , keywords =:2601.06052 , publisher =
-
[21]
Thinking Machines Lab: Connectionism , doi =
On-. Thinking Machines Lab: Connectionism , doi =
-
[22]
Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , editor =. Self-. Advances in
-
[23]
Orca: Progressive Learning from Complex Explanation Traces of GPT-4
Orca:. doi:10.48550/arXiv.2306.02707 , url =. arXiv , keywords =:2306.02707 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.02707
-
[24]
AlphaEvolve: A coding agent for scientific and algorithmic discovery
doi:10.48550/arXiv.2506.13131 , url =. arXiv , keywords =:2506.13131 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.13131
-
[25]
Press, Ofir and Zhang, Muru and Min, Sewon and Schmidt, Ludwig and Smith, Noah A. and Lewis, Mike , editor =. Measuring and. Findings of the. doi:10.18653/V1/2023.FINDINGS-EMNLP.378 , url =
-
[26]
Harness design for long-running application development , year = 2026, month = mar, url =
work page 2026
-
[27]
doi:10.48550/arXiv.2504.21801 , url =. arXiv , keywords =:2504.21801 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.21801
-
[28]
doi: 10.1038/s41586-023-06924-6
Mathematical discoveries from program search with large language models , year = 2024, month = jan, journal =. doi:10.1038/s41586-023-06924-6 , url =
-
[29]
Sennrich, Rico and Haddow, Barry and Birch, Alexandra , editor =. Improving. Proceedings of the 54th. doi:10.18653/v1/P16-1009 , url =
-
[30]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
doi:10.48550/arXiv.2402.03300 , url =. arXiv , keywords =:2402.03300 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
-
[31]
Anderson and Yarin Gal , title =
Nature , volume =. doi:10.1038/s41586-024-07566-y , url =
-
[32]
doi:10.48550/arXiv.2507.22876 , url =
Automatically discovering heuristics in a complex. doi:10.48550/arXiv.2507.22876 , url =. arXiv , keywords =:2507.22876 , publisher =
-
[33]
arXiv preprint arXiv:2402.10705 (2024)
doi:10.48550/arXiv.2402.10705 , url =. arXiv , keywords =:2402.10705 , publisher =
-
[34]
The bitter lesson , year = 2019, journal =
work page 2019
-
[35]
Training. 2018. doi:10.1109/CVPRW.2018.00143 , url =
-
[36]
Solving olympiad geometry without human demonstrations , year = 2024, month = jan, journal =. doi:10.1038/s41586-023-06747-5 , url =
- [37]
-
[38]
MiMo-V2-Flash Technical Report
doi:10.48550/arXiv.2601.02780 , url =. arXiv , keywords =:2601.02780 , publisher =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2601.02780
-
[39]
doi:10.48550/arXiv.2509.07367 , url =
Autonomous. doi:10.48550/arXiv.2509.07367 , url =. arXiv , keywords =:2509.07367 , publisher =
- [40]
-
[41]
Zheng, Tianyu and Zhang, Ge and Shen, Tianhao and Liu, Xueling and Lin, Bill Yuchen and Fu, Jie and Chen, Wenhu and Yue, Xiang , editor =. Findings of the. doi:10.18653/v1/2024.findings-acl.762 , url =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.