The Ratchet Effect in Silico through Interaction-Driven Cumulative Intelligence in Large Language Models
Pith reviewed 2026-05-19 02:34 UTC · model grok-4.3
The pith
Small language models improve reasoning by verifying each other's solutions and updating parameters with the validated results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the POLIS framework, heterogeneous agents generate solutions to problems, verify one another's outputs, store only the validated artifacts in shared cultural memory, and internalize those artifacts through parameter updates. Populations of 1--4B-parameter models using this process record average gains of 8.8--18.9 points on mathematical reasoning benchmarks and reduce the performance gap to single 70B+ models. Mechanistic tests isolate peer verification as the primary operator that sustains the accumulation and show that internalization prevents loss of the retained knowledge between rounds.
What carries the argument
The POLIS multi-agent loop in which peer verification filters outputs before they enter shared memory and drive parameter internalization.
If this is right
- Models in the 1--4B range can reach performance levels closer to much larger single models solely through repeated cycles of generation, verification, retention, and internalization.
- Peer verification functions as the main mechanism that prevents loss of improvements and enables the ratchet-like accumulation.
- Internalization of validated artifacts into agent parameters sustains the gains from one round to the next rather than requiring constant external prompting.
- The overall approach treats structured interaction as an independent scaling dimension alongside increases in parameter count.
Where Pith is reading between the lines
- Similar interaction structures could be tested on non-mathematical tasks such as code generation or scientific hypothesis refinement to check whether the accumulation effect generalizes.
- Systems built this way might support continuous, open-ended refinement by maintaining an ongoing society of agents instead of periodic full retraining.
- The design raises the practical question of how verification quality must scale with task difficulty to avoid gradual contamination of the shared memory.
Load-bearing premise
Peer agents can reliably separate correct solutions from incorrect ones and keep only the good ones in the shared memory that later shapes updates.
What would settle it
Replace the peer verification step with random acceptance of solutions or with a process that systematically admits errors; if the reported performance gains then disappear or reverse across rounds, the central claim would be falsified.
Figures
read the original abstract
Human intelligence scales through cumulative cultural evolution (CCE), a ratchet process in which innovations are retained against entropic drift. Large language model training, by contrast, still depends primarily on static corpora and parameter growth, leaving little room for endogenous accumulation through interaction. We present POLIS (Population Orchestrated Learning and Inference Society), a framework in which heterogeneous agents generate solutions, verify one another's outputs, retain validated artifacts in shared cultural memory, and internalize them through parameter updates. On mathematical reasoning benchmarks, populations of 1--4B-parameter models achieved average gains of 8.8--18.9 points over base models and narrowed the gap to 70B+ monoliths. Mechanistic ablations identify peer verification as the main ratchet operator and show that internalization sustains accumulation across rounds, providing computational evidence that epistemic vigilance organizes durable knowledge growth. These results position structured social interaction as a scaling lever orthogonal to parameter count.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces POLIS, a multi-agent framework in which heterogeneous LLM agents generate solutions to mathematical reasoning tasks, perform peer verification, retain validated artifacts in a shared cultural memory, and internalize them via parameter updates. It reports that populations of 1--4B-parameter models achieve average benchmark gains of 8.8--18.9 points over base models and narrow the performance gap to 70B+ monolithic models, with mechanistic ablations identifying peer verification as the primary ratchet operator and internalization as the mechanism sustaining accumulation across rounds.
Significance. If the central mechanism is shown to operate as claimed, the work provides computational evidence that structured social interaction can serve as a scaling dimension orthogonal to parameter count, enabling cumulative knowledge growth in LLMs through an endogenous ratchet process analogous to cultural evolution. This would have implications for efficient training of smaller models and for understanding how epistemic vigilance organizes durable knowledge in artificial systems.
major comments (3)
- [§4] §4 (Experimental Setup and Results): The central claim attributes the 8.8--18.9 point gains and gap-narrowing to peer verification acting as a reliable ratchet, yet no quantitative measurements of verification accuracy are reported (e.g., false-positive rate on incorrect solutions or false-negative rate on correct ones for the specific math reasoning tasks). Without these, it remains possible that observed gains arise from internalization of plausible but erroneous artifacts rather than cumulative improvement.
- [§5] §5 (Ablation Studies): The ablations that remove verification are described, but they do not include a direct comparison of the error rate or cleanliness of the retained artifact set versus the raw generated set. This leaves open whether the retained memory is verifiably higher-quality or simply a filtered subset that could still propagate systematic errors.
- [§3] §3 (Mechanism Description): The internalization step is said to drive parameter updates from validated artifacts, but the manuscript provides no controls or measurements for data leakage or contamination between verification rounds and subsequent fine-tuning steps, which could inflate apparent cumulative gains.
minor comments (2)
- [§2] Notation for agent roles and memory structures is introduced without a consolidated table or diagram, making it difficult to track information flow across rounds.
- [Abstract and §4] The abstract and results section use the term 'average gains' without specifying whether this is mean across tasks, models, or runs, and without reporting variance or statistical significance.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and have revised the manuscript to incorporate additional analyses and controls where the original version was lacking.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup and Results): The central claim attributes the 8.8--18.9 point gains and gap-narrowing to peer verification acting as a reliable ratchet, yet no quantitative measurements of verification accuracy are reported (e.g., false-positive rate on incorrect solutions or false-negative rate on correct ones for the specific math reasoning tasks). Without these, it remains possible that observed gains arise from internalization of plausible but erroneous artifacts rather than cumulative improvement.
Authors: We agree that direct quantification of verification accuracy is necessary to support the ratchet interpretation. In the revised manuscript we add a new analysis in §4 reporting false-positive and false-negative rates of the peer-verification step on the GSM8K and MATH tasks. These measurements show acceptably low error rates (false-positive rate below 6 % on average), indicating that the observed gains are unlikely to result primarily from retention of erroneous artifacts. revision: yes
-
Referee: [§5] §5 (Ablation Studies): The ablations that remove verification are described, but they do not include a direct comparison of the error rate or cleanliness of the retained artifact set versus the raw generated set. This leaves open whether the retained memory is verifiably higher-quality or simply a filtered subset that could still propagate systematic errors.
Authors: We accept that a head-to-head error-rate comparison would strengthen the ablation results. The revised §5 now includes this comparison, showing that the retained artifact set has a substantially lower error rate than the raw generated set (approximately 35 % absolute reduction on average across tasks). This supports that peer verification produces a measurably cleaner memory rather than merely a filtered but still noisy subset. revision: yes
-
Referee: [§3] §3 (Mechanism Description): The internalization step is said to drive parameter updates from validated artifacts, but the manuscript provides no controls or measurements for data leakage or contamination between verification rounds and subsequent fine-tuning steps, which could inflate apparent cumulative gains.
Authors: We acknowledge the absence of explicit leakage controls in the original submission. The revised manuscript adds overlap tracking between verification artifacts and fine-tuning data across rounds together with a control condition that removes any overlapping items. Performance gains remain statistically significant under this stricter control, indicating that the cumulative improvement is not an artifact of data contamination. revision: yes
Circularity Check
No significant circularity; empirical gains measured on external benchmarks
full rationale
The paper describes an empirical multi-agent framework (POLIS) in which agents generate, verify, retain, and internalize solutions, then reports measured accuracy gains on mathematical reasoning benchmarks. No equations, fitted parameters, or self-citations are shown to reduce the reported 8.8–18.9 point gains to the input labeling process by construction. The verification-reliability assumption is a potential experimental-validity issue rather than a definitional or self-referential reduction in the derivation chain. The central claim therefore remains an independent empirical observation rather than a tautology.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Abouelenin, A.; Ashfaq, A.; Atkinson, A.; Awadalla, H.; Bach, N.; Bao, J.; Benhaim, A.; Cai, M.; Chaudhary, V.; Chen, C.; et al. 2025. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of-loras. arXiv preprint arXiv:2503.01743
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Anderson, P. W. 1972. More Is Different: Broken symmetry and the nature of the hierarchical structure of science. Science, 177(4047): 393--396
work page 1972
-
[6]
Anil, R.; Pereyra, G.; Passos, A.; Ormandi, R.; Dahl, G. E.; and Hinton, G. E. 2018. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235
-
[7]
Anthropic, A. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 1: 1
work page 2024
-
[8]
D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877--1901
work page 2020
-
[9]
Chen, A.; Li, A.; Gong, B.; Jiang, B.; Fei, B.; Yang, B.; Shan, B.; Yu, C.; Wang, C.; Zhu, C.; et al. 2025. MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention. arXiv preprint arXiv:2506.13585
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Chen, P.; Liu, S.; Zhao, H.; and Jia, J. 2021. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 5008--5017
work page 2021
-
[11]
Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Yang, A.; Fan, A.; et al. 2024. The llama 3 herd of models. arXiv e-prints, arXiv--2407
work page 2024
-
[13]
Frosst, N.; and Hinton, G. 2017. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [14]
-
[15]
Gou, J.; Yu, B.; Maybank, S. J.; and Tao, D. 2021. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6): 1789--1819
work page 2021
-
[16]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Hendrycks, D.; Burns, C.; Kadavath, S.; Arora, A.; Basart, S.; Tang, E.; Song, D.; and Steinhardt, J. 2021. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[18]
Herbrich, R.; Minka, T.; and Graepel, T. 2006. TrueSkill™: a Bayesian skill rating system. Advances in neural information processing systems, 19
work page 2006
-
[19]
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[20]
Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.; Cai, T.; Rutherford, E.; Casas, D. d. L.; Hendricks, L. A.; Welbl, J.; Clark, A.; et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[21]
Holland, J. H. 1992. Complex adaptive systems. Daedalus, 121(1): 17--30
work page 1992
-
[22]
Holland, J. H. 1995. Hidden order. Business Week-Domestic Edition, 21: 93--136
work page 1995
-
[23]
J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al
Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W.; et al. 2022. Lora: Low-rank adaptation of large language models. ICLR, 1(2): 3
work page 2022
-
[24]
Hurst, A.; Lerer, A.; Goucher, A. P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.; Welihinda, A.; Hayes, A.; Radford, A.; et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Scaling Laws for Neural Language Models
Kaplan, J.; McCandlish, S.; Henighan, T.; Brown, T. B.; Chess, B.; Child, R.; Gray, S.; Radford, A.; Wu, J.; and Amodei, D. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[26]
Langton, C. G. 1997. Artificial life: An overview
work page 1997
-
[27]
LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature, 521(7553): 436--444
work page 2015
- [28]
- [29]
-
[30]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Luo, H.; Sun, Q.; Xu, C.; Zhao, P.; Lou, J.-G.; Tao, C.; Geng, X.; Lin, Q.; Chen, S.; Tang, Y.; and Zhang, D. 2025. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct. In The Thirteenth International Conference on Learning Representations
work page 2025
-
[32]
MAA. 2024. American Invitational Mathematics Examination - AIME. In American Invitational Mathematics Examination - AIME
work page 2024
-
[33]
MAA. 2025. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
work page 2025
-
[34]
Minsky, M. 1986. Society of mind. Simon and Schuster
work page 1986
-
[35]
Mitchell, M. 2009. Complexity: A guided tour. Oxford university press
work page 2009
-
[36]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 27730--27744
work page 2022
-
[37]
Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 53728--53741
work page 2023
-
[38]
Rein, D.; Hou, B. L.; Stickland, A. C.; Petty, J.; Pang, R. Y.; Dirani, J.; Michael, J.; and Bowman, S. R. 2024. Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling
work page 2024
- [39]
-
[40]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[41]
Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Team, G.; Kamath, A.; Ferret, J.; Pathak, S.; Vieillard, N.; Merhej, R.; Perrin, S.; Matejovicova, T.; Ram \'e , A.; Rivi \`e re, M.; et al. 2025 a . Gemma 3 technical report. arXiv preprint arXiv:2503.19786
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [43]
-
[44]
Team, Q. 2024. Qwen2 technical report. arXiv preprint arXiv:2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Tong, Y.; Zhang, X.; Wang, R.; Wu, R.; and He, J. 2024. Dart-math: Difficulty-aware rejection tuning for mathematical problem-solving. Advances in Neural Information Processing Systems, 37: 7821--7846
work page 2024
-
[46]
Van der Hoek, W.; and Wooldridge, M. 2008. Multi-agent systems. Foundations of Artificial Intelligence, 3: 887--928
work page 2008
-
[47]
H.; Eickhout, B.; and Van Meijl, H
Verburg, P. H.; Eickhout, B.; and Van Meijl, H. 2008. A multi-scale, multi-model approach for analyzing the future dynamics of European land use. The annals of regional science, 42: 57--77
work page 2008
- [48]
-
[49]
Wei, J.; Tay, Y.; Bommasani, R.; Raffel, C.; Zoph, B.; Borgeaud, S.; Yogatama, D.; Bosma, M.; Zhou, D.; Metzler, D.; et al. 2022 a . Emergent abilities of large language models. arXiv preprint arXiv:2206.07682
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q. V.; Zhou, D.; et al. 2022 b . Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824--24837
work page 2022
-
[51]
Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.; Cao, Y.; and Narasimhan, K. 2023. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36: 11809--11822
work page 2023
-
[54]
Ye, Y.; Huang, Z.; Xiao, Y.; Chern, E.; Xia, S.; and Liu, P. 2025. LIMO: Less is More for Reasoning. arXiv preprint arXiv:2502.03387
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Zelikman, E.; Wu, Y.; Mu, J.; and Goodman, N. 2022. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35: 15476--15488
work page 2022
-
[56]
Zhang, L.; Bao, C.; and Ma, K. 2021. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4388--4403
work page 2021
-
[57]
A Survey of Large Language Models
Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223, 1(2)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Zhuang, R.; Wang, B.; and Sun, S. 2025. Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping. arXiv preprint arXiv:2505.08392
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.