Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

Aneesh Pappu; Federico Bianchi; James Zou; Yongchan Kwon

arxiv: 2606.10402 · v1 · pith:KGGEIJSQnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

Federico Bianchi , Yongchan Kwon , Aneesh Pappu , James Zou This is my paper

Pith reviewed 2026-06-27 13:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords collective AI agentsdecentralized discoverymathematical problemsopen research platformkissing numberverifiersstate-of-the-art resultsagent interaction

0 comments

The pith

AI agents on an open platform have produced 12 new state-of-the-art mathematical results through collective interaction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EinsteinArena as a platform where language-model agents access live open mathematical problems, each with a verifier, leaderboard, and discussion forum. Agents have used it to reach 12 results better than any prior human or AI solution, including raising the lower bound on the kissing number in dimension 11 from 593 to 604. These gains emerged from sequences of submissions, public discussions, verifier updates, and agents borrowing ideas from one another rather than from isolated runs. A sympathetic reader would care because the setup shows how open, decentralized interaction among autonomous agents can support scientific progress on problems that reward cumulative effort over long horizons.

Core claim

EinsteinArena is an agent-native platform for open distributed research and discovery that supplies agents with a live set of open problems, each equipped with a solid verifier, public leaderboard, and problem-specific discussion forum. As of May 2026, agents on the platform have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One example is the kissing number problem in dimension 11, where the best known lower bound improved from 593 to 604. These advances did not arise from single agents or isolated runs but through sequences of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas, providing evidenc

What carries the argument

EinsteinArena, the agent-native platform that gives agents live open problems together with solid verifiers, public leaderboards, and discussion forums for sharing insights.

If this is right

Collective sequences of agent submissions and discussions can produce results beyond what any single agent achieves on the same problems.
Agent-to-agent borrowing of ideas across public forums accelerates progress on mathematical tasks with long time horizons.
Community use of verifiers can lead to their refinement and more precise tracking of advances.
Decentralized open interaction among agents supports sustained collaboration on unsolved problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same platform structure could be tested on other domains if equivalent unambiguous verifiers can be defined.
Human researchers could monitor the public discussion forums to extract useful ideas or add their own contributions.
Increasing the number of participating agents might produce faster cumulative gains on the same set of problems.

Load-bearing premise

The platform verifiers supply unambiguous and reliable measurements that confirm the reported results are genuine improvements over all earlier human and AI solutions.

What would settle it

An independent audit that re-runs the verifiers on the 12 claimed results and finds that at least one does not exceed the previous best known bound or solution.

Figures

Figures reproduced from arXiv: 2606.10402 by Aneesh Pappu, Federico Bianchi, James Zou, Yongchan Kwon.

**Figure 2.** Figure 2: Best known lower bounds for the kissing number in dimension 11. The record stood [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Discussions on the EinsteinArena platform exemplify how agents ask questions and build [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Conversation topic distribution in the kissing number problem. Agents discuss a wide [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Solution lineage of the kissing number problem. Arrows denote parental lineage, which [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Solution lineage for the second autocorrelation inequality problem. Arrows denote [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 3.** Figure 3: A few recent works have moved further in this direction. CORAL [21] deploys multiple agents through shared persistent memory, but within a single orchestrated run rather than a shared public platform, in addition to using homogeneous agent teams. AgentRxiv [22] allows independent agents to share research reports through a centralized preprint server, showing possible improvements on benchmarks. EinsteinAre… view at source ↗

read the original abstract

Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EinsteinArena lets agents collaborate on math problems via shared forums and leaderboards, claiming 12 new records including a kissing-number lift in dim 11, but the verifier soundness for those records is not shown.

read the letter

The paper's core contribution is EinsteinArena, a platform that gives language-model agents live math problems, automatic verifiers, public leaderboards, and discussion forums so they can submit partial results, comment on each other's attempts, and borrow ideas. It reports that this setup produced 12 new state-of-the-art results, with the clearest example being an improvement on the kissing-number lower bound in dimension 11 from 593 to 604. The advance is presented as the product of multiple agents iterating through submissions, forum discussion, and verifier updates rather than any single run.

The platform idea itself is the genuinely new piece. Earlier agent work has mostly been isolated; here the authors supply an explicit shared environment and show that interaction can produce measurable progress on problems where success is unambiguous.

The soft spot is verification. The abstract gives no description of the verifier for the kissing-number task—no mention of exact arithmetic, contact enumeration, floating-point tolerances, or release of the 604 configuration for independent checking. The stress-test concern therefore stands: if the verifier accepts an invalid packing or misses a tighter known bound, the collective-discovery story does not hold. Without that evidence the reported records remain unconfirmed.

The paper is aimed at researchers building multi-agent systems for open-ended scientific tasks. Anyone working on that intersection will find the setup and the reported interaction pattern useful to think about, even if they treat the numerical claims as provisional.

It deserves peer review. The claims are concrete enough that referees can ask for the verifier code and the actual configurations; if those are supplied the work becomes checkable.

Referee Report

1 major / 2 minor

Summary. The paper introduces EinsteinArena, an agent-native platform providing AI agents with open mathematical problems, solid verifiers, public leaderboards, and discussion forums to enable collective discovery. It claims that as of May 2026 agents have produced 12 new state-of-the-art results superior to all prior human or AI solutions, with the kissing-number problem in dimension 11 serving as the central example: the lower bound was raised from 593 to 604 via a sequence of submissions, public discussion, verifier refinement, and agent-to-agent idea transfer rather than any single isolated run.

Significance. If the reported improvements are verifiably correct and strictly superior to all prior bounds, the work would supply concrete empirical support for decentralized collective intelligence emerging among autonomous language-model agents in an open setting, extending beyond isolated-agent benchmarks and illustrating a scalable paradigm for AI-driven mathematical research.

major comments (1)

[Abstract] Abstract: the central claim that 12 new SOTA results have been discovered, including the kissing-number lower bound of 604 in dimension 11, rests on the soundness of the platform's problem-specific verifiers, yet the manuscript supplies no description of verifier implementation (exact vs. floating-point arithmetic, contact enumeration, tolerance settings, or formal certificates). This is load-bearing because any undetected false positive or overlooked tighter bound would invalidate the narrative of genuine collective discovery.

minor comments (2)

The manuscript would be strengthened by releasing at least one winning configuration (or a machine-readable certificate) for the kissing-number result so that independent verification is possible.
Clarify the precise criteria used to confirm that each of the 12 results is strictly better than every previously published human or AI solution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The concern regarding the lack of verifier implementation details is well-taken and directly impacts the credibility of the reported results. We address this point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 12 new SOTA results have been discovered, including the kissing-number lower bound of 604 in dimension 11, rests on the soundness of the platform's problem-specific verifiers, yet the manuscript supplies no description of verifier implementation (exact vs. floating-point arithmetic, contact enumeration, tolerance settings, or formal certificates). This is load-bearing because any undetected false positive or overlooked tighter bound would invalidate the narrative of genuine collective discovery.

Authors: We agree that the current manuscript does not provide sufficient technical detail on the problem-specific verifiers, which is necessary to allow independent verification of the claimed improvements. The abstract and main text emphasize the platform architecture and the collective discovery process but omit implementation specifics such as arithmetic precision, contact enumeration procedures, tolerance thresholds, and formal certificates. In the revised version we will add a dedicated subsection (likely under Methods or a new "Verifier Implementation" section) that describes these aspects for the primary problems, with particular attention to the 11-dimensional kissing number verifier. This addition will include the exact methods used to confirm that each submitted configuration is valid and that no tighter bound was overlooked by the verifier. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical platform outcomes, not derivations

full rationale

The paper reports observed results from the EinsteinArena platform (12 new SOTA improvements including the kissing-number bound in dimension 11) as empirical outcomes of agent submissions, discussions, and verifier checks. No derivation chain, equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. The central claims rest on external verification of configurations rather than self-definitional steps, self-citation load-bearing premises, or ansatzes smuggled via prior work. The paper is therefore self-contained against external benchmarks of reported performance, with no load-bearing step that collapses to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim relies on the existence and functionality of the described platform and the accuracy of the reported agent discoveries, which are not detailed beyond the abstract.

axioms (1)

domain assumption Mathematical problems have solid verifiers that unambiguously measure progress.
The abstract states 'each with a solid verifier' and 'progress can be measured unambiguously'.

invented entities (1)

EinsteinArena platform no independent evidence
purpose: To enable collective AI agent research on open problems.
The platform is introduced in the paper as a new system.

pith-pipeline@v0.9.1-grok · 5763 in / 1329 out tokens · 25037 ms · 2026-06-27T13:08:46.507016+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Socratic agents for autonomous scientific discovery in high-dimensional physical systems
cs.AI 2026-06 unverdicted novelty 6.0

AHOIS is a Socratic multi-agent AI that autonomously discovers and validates a random-interference encoding strategy for multimode fiber optics, achieving 76.97% MNIST and 83.17% Fashion-MNIST accuracy with 16x16 meas...
Structure of kissing arrangements in ${\mathbb R}^{12}$ and a place for the $841$st sphere
cs.IT 2026-06 unverdicted novelty 6.0

Kissing arrangements of 840 spheres in R^12 admit positive-dimensional families of non-isometric realizations via flexible 48-systems in each 60-point block with fixed bridges, enabling a numerical 841-sphere configur...

Reference graph

Works this paper leans on

39 extracted references · 4 canonical work pages · cited by 2 Pith papers

[1]

Marc R. Best. Binary codes with a minimum distance of four (corresp.).IEEE Trans. Inf. Theory, 26:738–742, 1980. URL:https://api.semanticscholar.org/CorpusID:40030299

1980
[2]

An improved example for an autoconvolution inequality

Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality. Experimental Mathematics, pages 1–7, 2026

2026
[3]

J. H. Conway and N. J. A. Sloane.Sphere Packings and Kissing Numbers, pages 1–30. Springer New York, New York, NY, 1988.doi:10.1007/978-1-4757-2016-7_1

work page doi:10.1007/978-1-4757-2016-7_1 1988
[4]

On nonlinear fractional programming.Management science, 13(7):492– 498, 1967

Werner Dinkelbach. On nonlinear fractional programming.Management science, 13(7):492– 498, 1967

1967
[5]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024
[6]

Highly symmetric lines.Linear Algebra and its Applications, 722:12–37, 2025

Mikhail Ganzhinov. Highly symmetric lines.Linear Algebra and its Applications, 722:12–37, 2025

2025
[7]

Metagpt: Meta programming for a multi- agent collaborative framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. Metagpt: Meta programming for a multi- agent collaborative framework, 2024. URL:https://arxiv.org/abs/2308.00352,arXiv: 2308.00352

Pith/arXiv arXiv 2024
[8]

Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025

Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl´ os Z Horv´ ath, GoranˇZuˇ zi´ c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025

2025
[9]

Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Pith/arXiv arXiv 2023
[10]

State-of-the-art solutions for the second autocorre- lation inequality

Justin Kang and ClaudeExplorer. State-of-the-art solutions for the second autocorre- lation inequality. Einstein Arena, 2026. URL:https://github.com/justinkang221/ second-autocorrelation-inequality

2026
[11]

Cambridge University Press, 2004

Yitzhak Katznelson.An introduction to harmonic analysis. Cambridge University Press, 2004

2004
[12]

Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

Pith/arXiv arXiv 2025
[13]

Alpha omega agents, 2026

Woosang Lim. Alpha omega agents, 2026. URL:https://github.com/quasar17/Alpha_ Omega_Agents. 13

2026
[14]

The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Pith/arXiv arXiv 2024
[15]

Self-refine: Iterative refine- ment with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refine- ment with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023
[16]

Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, et al. Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

Pith/arXiv arXiv 2025
[17]

Alexander Novikov, Ngˆ an V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

Pith/arXiv arXiv 2025
[18]

APOLLO: Automated LLM and lean collaboration for advanced formal reasoning

Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh. APOLLO: Automated LLM and lean collaboration for advanced formal reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[19]

Lsqr: An algorithm for sparse linear equations and sparse least squares.ACM Transactions on Mathematical Software (TOMS), 8(1):43–71, 1982

Christopher C Paige and Michael A Saunders. Lsqr: An algorithm for sparse linear equations and sparse least squares.ACM Transactions on Mathematical Software (TOMS), 8(1):43–71, 1982

1982
[20]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. In Lun-Wei Ku, An- dre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational ...

work page doi:10.18653/v1/2024.acl-long.810 2024
[21]

Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

Pith/arXiv arXiv 2026
[22]

Agentrxiv: Towards collaborative autonomous research

Samuel Schmidgall and Michael Moor. Agentrxiv: Towards collaborative autonomous research. arXiv preprint arXiv:2503.18102, 2025

arXiv 2025
[23]

Jsagent: An ai agent for hard mathematical optimization, 2026

Jongmin Sung. Jsagent: An ai agent for hard mathematical optimization, 2026. URL:https: //github.com/jmsung/einstein

2026
[24]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Y. Zou. The virtual lab of AI agents designs new SARS-CoV-2 nanobodies.Nature, 646:716–723, 2025.doi: 10.1038/s41586-025-09442-9

work page doi:10.1038/s41586-025-09442-9 2025
[25]

Cambridge University Press, 2006

Terence Tao and Van H Vu.Additive combinatorics, volume 105. Cambridge University Press, 2006. 14

2006
[26]

AI research agents for machine learning: Search, exploration, and gen- eralization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, An- drei Lupu, Roberta Raileanu, Tatiana Shavrina, Kelvin Niu, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H Miller, Abhishek Charnalia, Derek Dun- field, ...

2026
[27]

Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Pith/arXiv arXiv 2025
[28]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Y Zou. Mixture-of-agents enhances large language model capabilities. InInternational Conference on Learning Repre- sentations, 2025

2025
[29]

Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirho- seini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

arXiv 2025
[30]

From agent-only social networks to autonomous scientific research: Lessons from openclaw and moltbook, and the architecture of clawdlab and beach

Lukas Weidener, Marko Brki´ c, Phillip Lee, Martin Karlsson, Kevin Noessler, and Paul Kohlhaas. From agent-only social networks to autonomous scientific research: Lessons from openclaw and moltbook, and the architecture of clawdlab and beach. science.arXiv preprint arXiv:2602.19810, 2026

arXiv 2026
[31]

Benchmarking all-atom biomolecular structure prediction with FoldBench.Nature Communications, December 2025.doi:10.1038/s41467-025-67127-3

Sheng Xu, Qiantai Feng, Lifeng Qiao, Hao Wu, Tao Shen, Yu Cheng, Shuangjia Zheng, and Siqi Sun. Benchmarking all-atom biomolecular structure prediction with FoldBench.Nature Communications, December 2025.doi:10.1038/s41467-025-67127-3

work page doi:10.1038/s41467-025-67127-3 2025
[32]

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

Pith/arXiv arXiv 2025
[33]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,
[34]

URL:https://arxiv.org/abs/2504.00587,arXiv:2504.00587

arXiv
[35]

Evaluation-driven scaling for scientific discovery.arXiv preprint arXiv:2604.19341, 2026

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, et al. Evaluation-driven scaling for scientific discovery.arXiv preprint arXiv:2604.19341, 2026

Pith/arXiv arXiv 2026
[36]

Learning to discover at test time.ICML, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time.ICML, 2026

2026
[37]

Aflow: Automating agentic workflow generation, 2025

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2025. URL:https://arxiv. org/abs/2410.10762,arXiv:2410.10762. 15

Pith/arXiv arXiv 2025
[38]

Sirius: Self-improving multi- agent systems via bootstrapped reasoning, 2025

Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, and James Zou. Sirius: Self-improving multi- agent systems via bootstrapped reasoning, 2025. URL:https://arxiv.org/abs/2502.04780, arXiv:2502.04780

arXiv 2025
[39]

Language agents as optimizable graphs, 2024

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and J¨ urgen Schmidhuber. Language agents as optimizable graphs, 2024. URL:https://arxiv.org/abs/ 2402.16823,arXiv:2402.16823. A Detailed Descriptions of Problems A.1 Kissing Number (d= 11) The kissing number in dimensiond∈Nasks the maximum number of non-overlapping unit spher...

arXiv 2024

[1] [1]

Marc R. Best. Binary codes with a minimum distance of four (corresp.).IEEE Trans. Inf. Theory, 26:738–742, 1980. URL:https://api.semanticscholar.org/CorpusID:40030299

1980

[2] [2]

An improved example for an autoconvolution inequality

Christopher Boyer and Zane Kun Li. An improved example for an autoconvolution inequality. Experimental Mathematics, pages 1–7, 2026

2026

[3] [3]

J. H. Conway and N. J. A. Sloane.Sphere Packings and Kissing Numbers, pages 1–30. Springer New York, New York, NY, 1988.doi:10.1007/978-1-4757-2016-7_1

work page doi:10.1007/978-1-4757-2016-7_1 1988

[4] [4]

On nonlinear fractional programming.Management science, 13(7):492– 498, 1967

Werner Dinkelbach. On nonlinear fractional programming.Management science, 13(7):492– 498, 1967

1967

[5] [5]

Improv- ing factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024

[6] [6]

Highly symmetric lines.Linear Algebra and its Applications, 722:12–37, 2025

Mikhail Ganzhinov. Highly symmetric lines.Linear Algebra and its Applications, 722:12–37, 2025

2025

[7] [7]

Metagpt: Meta programming for a multi- agent collaborative framework, 2024

Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and J¨ urgen Schmidhuber. Metagpt: Meta programming for a multi- agent collaborative framework, 2024. URL:https://arxiv.org/abs/2308.00352,arXiv: 2308.00352

Pith/arXiv arXiv 2024

[8] [8]

Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025

Thomas Hubert, Rishi Mehta, Laurent Sartran, Mikl´ os Z Horv´ ath, GoranˇZuˇ zi´ c, Eric Wieser, Aja Huang, Julian Schrittwieser, Yannick Schroecker, Hussain Masoom, et al. Olympiad-level formal mathematical reasoning with reinforcement learning.Nature, 2025

2025

[9] [9]

Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

Pith/arXiv arXiv 2023

[10] [10]

State-of-the-art solutions for the second autocorre- lation inequality

Justin Kang and ClaudeExplorer. State-of-the-art solutions for the second autocorre- lation inequality. Einstein Arena, 2026. URL:https://github.com/justinkang221/ second-autocorrelation-inequality

2026

[11] [11]

Cambridge University Press, 2004

Yitzhak Katznelson.An introduction to harmonic analysis. Cambridge University Press, 2004

2004

[12] [12]

Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

Robert Tjarko Lange, Yuki Imajuku, and Edoardo Cetin. Shinkaevolve: Towards open-ended and sample-efficient program evolution.arXiv preprint arXiv:2509.19349, 2025

Pith/arXiv arXiv 2025

[13] [13]

Alpha omega agents, 2026

Woosang Lim. Alpha omega agents, 2026. URL:https://github.com/quasar17/Alpha_ Omega_Agents. 13

2026

[14] [14]

The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

Pith/arXiv arXiv 2024

[15] [15]

Self-refine: Iterative refine- ment with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refine- ment with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023

[16] [16]

Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

Ludovico Mitchener, Angela Yiu, Benjamin Chang, Mathieu Bourdenx, Tyler Nadolski, Arvis Sulovari, Eric C Landsness, Daniel L Barabasi, Siddharth Narayanan, Nicky Evans, et al. Kosmos: An ai scientist for autonomous discovery.arXiv preprint arXiv:2511.02824, 2025

Pith/arXiv arXiv 2025

[17] [17]

Alexander Novikov, Ngˆ an V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

Pith/arXiv arXiv 2025

[18] [18]

APOLLO: Automated LLM and lean collaboration for advanced formal reasoning

Azim Ospanov, Farzan Farnia, and Roozbeh Yousefzadeh. APOLLO: Automated LLM and lean collaboration for advanced formal reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[19] [19]

Lsqr: An algorithm for sparse linear equations and sparse least squares.ACM Transactions on Mathematical Software (TOMS), 8(1):43–71, 1982

Christopher C Paige and Michael A Saunders. Lsqr: An algorithm for sparse linear equations and sparse least squares.ACM Transactions on Mathematical Software (TOMS), 8(1):43–71, 1982

1982

[20] [20]

ChatDev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. In Lun-Wei Ku, An- dre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational ...

work page doi:10.18653/v1/2024.acl-long.810 2024

[21] [21]

Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, et al. Coral: Towards autonomous multi-agent evolution for open-ended discovery.arXiv preprint arXiv:2604.01658, 2026

Pith/arXiv arXiv 2026

[22] [22]

Agentrxiv: Towards collaborative autonomous research

Samuel Schmidgall and Michael Moor. Agentrxiv: Towards collaborative autonomous research. arXiv preprint arXiv:2503.18102, 2025

arXiv 2025

[23] [23]

Jsagent: An ai agent for hard mathematical optimization, 2026

Jongmin Sung. Jsagent: An ai agent for hard mathematical optimization, 2026. URL:https: //github.com/jmsung/einstein

2026

[24] [24]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Y. Zou. The virtual lab of AI agents designs new SARS-CoV-2 nanobodies.Nature, 646:716–723, 2025.doi: 10.1038/s41586-025-09442-9

work page doi:10.1038/s41586-025-09442-9 2025

[25] [25]

Cambridge University Press, 2006

Terence Tao and Van H Vu.Additive combinatorics, volume 105. Cambridge University Press, 2006. 14

2006

[26] [26]

AI research agents for machine learning: Search, exploration, and gen- eralization in MLE-bench

Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, An- drei Lupu, Roberta Raileanu, Tatiana Shavrina, Kelvin Niu, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H Miller, Abhishek Charnalia, Derek Dun- field, ...

2026

[27] [27]

Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi-agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025

Pith/arXiv arXiv 2025

[28] [28]

Mixture-of-agents enhances large language model capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun, Ce Zhang, and James Y Zou. Mixture-of-agents enhances large language model capabilities. InInternational Conference on Learning Repre- sentations, 2025

2025

[29] [29]

Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirho- seini, Ke Wang, and Alex Aiken. Astra: A multi-agent system for gpu kernel performance optimization.arXiv preprint arXiv:2509.07506, 2025

arXiv 2025

[30] [30]

From agent-only social networks to autonomous scientific research: Lessons from openclaw and moltbook, and the architecture of clawdlab and beach

Lukas Weidener, Marko Brki´ c, Phillip Lee, Martin Karlsson, Kevin Noessler, and Paul Kohlhaas. From agent-only social networks to autonomous scientific research: Lessons from openclaw and moltbook, and the architecture of clawdlab and beach. science.arXiv preprint arXiv:2602.19810, 2026

arXiv 2026

[31] [31]

Benchmarking all-atom biomolecular structure prediction with FoldBench.Nature Communications, December 2025.doi:10.1038/s41467-025-67127-3

Sheng Xu, Qiantai Feng, Lifeng Qiao, Hao Wu, Tao Shen, Yu Cheng, Shuangjia Zheng, and Siqi Sun. Benchmarking all-atom biomolecular structure prediction with FoldBench.Nature Communications, December 2025.doi:10.1038/s41467-025-67127-3

work page doi:10.1038/s41467-025-67127-3 2025

[32] [32]

The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

Pith/arXiv arXiv 2025

[33] [33]

Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang. Agentnet: Decentralized evolutionary coordination for llm-based multi-agent systems,

[34] [34]

URL:https://arxiv.org/abs/2504.00587,arXiv:2504.00587

arXiv

[35] [35]

Evaluation-driven scaling for scientific discovery.arXiv preprint arXiv:2604.19341, 2026

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, et al. Evaluation-driven scaling for scientific discovery.arXiv preprint arXiv:2604.19341, 2026

Pith/arXiv arXiv 2026

[36] [36]

Learning to discover at test time.ICML, 2026

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, and Yu Sun. Learning to discover at test time.ICML, 2026

2026

[37] [37]

Aflow: Automating agentic workflow generation, 2025

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu. Aflow: Automating agentic workflow generation, 2025. URL:https://arxiv. org/abs/2410.10762,arXiv:2410.10762. 15

Pith/arXiv arXiv 2025

[38] [38]

Sirius: Self-improving multi- agent systems via bootstrapped reasoning, 2025

Wanjia Zhao, Mert Yuksekgonul, Shirley Wu, and James Zou. Sirius: Self-improving multi- agent systems via bootstrapped reasoning, 2025. URL:https://arxiv.org/abs/2502.04780, arXiv:2502.04780

arXiv 2025

[39] [39]

Language agents as optimizable graphs, 2024

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and J¨ urgen Schmidhuber. Language agents as optimizable graphs, 2024. URL:https://arxiv.org/abs/ 2402.16823,arXiv:2402.16823. A Detailed Descriptions of Problems A.1 Kissing Number (d= 11) The kissing number in dimensiond∈Nasks the maximum number of non-overlapping unit spher...

arXiv 2024