arxiv: 2604.20622 · v1 · submitted 2026-04-22 · 💻 cs.AI · cs.LG· cs.MA

Recognition: unknown

pAI/MSc: ML Theory Research with Humans on the Loop

Mahmoud Abdelmoneum , Pierfrancesco Beneventano , Tomaso Poggio

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:57 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords multi-agent systemsmachine learning theoryhuman-in-the-loop AIresearch workflow automationmanuscript draftingopen source toolsacademic research

0 comments

The pith

A modular multi-agent system reduces the human steering needed to produce ML theory manuscripts by orders of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents pAI/MSc as an open-source modular multi-agent system for academic research in machine learning theory and related fields. Its central goal is to minimize human involvement in steering the process from a stated hypothesis to a full manuscript that incorporates relevant literature, mathematical foundations, experimental results, and is ready for submission. Rather than pursuing fully autonomous research or idea generation, the system emphasizes practical reduction in effort while retaining human oversight at key points. If effective, this could allow researchers to allocate more time to creative aspects of inquiry instead of routine tasks like searching papers or setting up experiments.

Core claim

pAI/MSc is a customizable, modular multi-agent system that, given a hypothesis, produces a literature-grounded, mathematically established, experimentally supported, and submission-oriented manuscript draft with orders of magnitude less human steering than traditional workflows.

What carries the argument

The modular multi-agent architecture in pAI/MSc that distributes tasks across specialized agents for literature retrieval, mathematical reasoning, code execution for experiments, and text generation, all under human supervision.

Load-bearing premise

That the current capabilities of large language models and agent coordination can accurately and reliably execute the steps of literature review, mathematical proof construction, and experimental validation with only minimal human corrections.

What would settle it

Running the system on a well-known ML theory hypothesis and having domain experts review the output draft for accuracy in citations, mathematical correctness, and experimental validity to determine if it meets submission standards without extensive revisions.

Figures

Figures reproduced from arXiv: 2604.20622 by Mahmoud Abdelmoneum, Pierfrancesco Beneventano, Tomaso Poggio.

**Figure 1.** Figure 1: The pAI/MSc execution graph. Dashed violet borders mark counsel-eligible agents. Red dashed arrows are loopbacks triggered by gate failures. Theory and experiment tracks run in parallel when both are selected. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

read the original abstract

We present pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

pAI/MSc outlines a modular multi-agent architecture for human-assisted ML theory drafting but supplies zero measurements or examples to support its core claim of orders-of-magnitude effort reduction.

read the letter

The paper's main point is a practical one: pAI/MSc is an open-source, customizable multi-agent system meant to cut the manual work of turning a hypothesis into a grounded, written ML theory manuscript while keeping a human in the loop. That narrower scope is a plus compared to fully autonomous claims. The modular design, aimed at literature grounding, math, experiments, and drafting, is clearly described at the architectural level and could be useful for groups already building research tools. Naming it and releasing it as open source also makes the idea concrete rather than vague.

Referee Report

1 major / 1 minor

Summary. The paper presents pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows with emphasis on machine learning theory. The stated goal is to reduce by orders of magnitude the human steering required to convert a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft while keeping humans in the loop; the manuscript describes the system architecture but supplies no implementation details, experiments, or metrics.

Significance. If the claimed reduction in human intervention were demonstrated while preserving output quality, the system could meaningfully increase research throughput in quantitative fields. The open-source and modular design is a strength that would support reproducibility and extension by the community. In its current form, however, the manuscript offers only a high-level system description without evidence, so any significance assessment remains prospective.

major comments (1)

Abstract: The central claim of an 'orders of magnitude' reduction in human steering for literature grounding, mathematical establishment, experiment design, and manuscript assembly is unsupported by any quantitative data, logged intervention counts, user studies, baseline comparisons, or worked examples. This renders the claim an untested design goal rather than a demonstrated property of the system.

minor comments (1)

The manuscript would benefit from a dedicated section detailing the agent roles, communication protocols, and customization interfaces, as these are referenced only at a high level in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for reviewing our manuscript on pAI/MSc. We appreciate your assessment of its potential significance and agree that additional clarification is needed regarding the system's claimed capabilities. We provide a point-by-point response to the major comment below.

read point-by-point responses

Referee: Abstract: The central claim of an 'orders of magnitude' reduction in human steering for literature grounding, mathematical establishment, experiment design, and manuscript assembly is unsupported by any quantitative data, logged intervention counts, user studies, baseline comparisons, or worked examples. This renders the claim an untested design goal rather than a demonstrated property of the system.

Authors: We thank the referee for this observation. The manuscript indeed presents pAI/MSc primarily as a system architecture and design, with the reduction in human steering stated as the core objective enabled by its customizable multi-agent framework. No quantitative evaluations, such as intervention counts or user studies, are included because the current work focuses on describing the system rather than evaluating its performance metrics. We will revise the abstract and relevant sections to explicitly characterize the 'orders of magnitude' reduction as a design goal and intended benefit, rather than a demonstrated result. Additionally, we will expand on implementation details, provide worked examples of the workflow where possible, and outline plans for future empirical validation to address this concern. revision: yes

Circularity Check

0 steps flagged

No derivations, predictions, or equations; system description paper has no circularity

full rationale

The manuscript is a descriptive account of an open-source multi-agent architecture for research assistance. It states design goals (reducing human steering by orders of magnitude) but supplies no equations, fitted parameters, uniqueness theorems, self-citations used as load-bearing premises, or renamings of empirical patterns. The central claim is an untested assertion about future performance rather than a derivation that reduces to its own inputs. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The preprint describes a software system for research assistance rather than a theoretical model with fitted parameters, axioms, or new scientific entities. No mathematical derivations or empirical claims requiring such elements are mentioned in the abstract.

pith-pipeline@v0.9.0 · 5377 in / 1211 out tokens · 49867 ms · 2026-05-09T23:57:06.983235+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 55 canonical work pages · 9 internal anchors

[1]

, author Barekatain, M

Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024. doi: 10.1038/s41586-023-06924...

work page doi:10.1038/s41586-023-06924-6 2024
[2]

Funsearch

GoogleDeepMind. Funsearch. GitHubrepository, 2023. URLhttps://github.com/google-deepmind/ funsearch. Repository accompanying the FunSearch Nature paper; accessed 2026-03-21

2023
[3]

Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wag- ner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

work page internal anchor Pith review doi:10.48550/arxiv.2506.13131 2025
[4]

Georgiev, J

Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical explo- ration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025. doi: 10.48550/arXiv.2511.02864. URLhttps://arxiv.org/abs/2511.02864

work page doi:10.48550/arxiv.2511.02864 2025
[5]

Mathematical problem repository for alphaevolve

Google DeepMind. Mathematical problem repository for alphaevolve. GitHub repository, 2025. URL https://github.com/google-deepmind/alphaevolve_repository_of_problems. Repository accom- panying the Mathematical exploration and discovery at scale preprint; accessed 2026-03-21

2025
[6]

Reinforced Generation of Combinatorial Structures: Ramsey Numbers

Ansh Nagda, Prabhakar Raghavan, and Abhradeep Thakurta. Reinforced generation of combinatorial structures: Ramsey numbers.arXiv preprint arXiv:2603.09172, 2026. doi: 10.48550/arXiv.2603.09172. URLhttps://arxiv.org/abs/2603.09172

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.09172 2026
[7]

Donald E. Knuth. Claude’s cycles. Informal note / PDF on Knuth’s preprints page, February 2026. URL https://cs.stanford.edu/~knuth/papers/claude-cycles.pdf. Dated 2026-02-28; revised 2026-03- 16

2026
[8]

Thestoryoferdősproblem#1026

TerenceTao. Thestoryoferdősproblem#1026. BlogpostonWhat’s New, December2025. URLhttps: //terrytao.wordpress.com/2025/12/08/the-story-of-erdos-problem-126/. Published 2025-12- 08

2025
[9]

Abouzaid, A

Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First proof.arXiv preprint arXiv:2602.05192, 2026. doi: 10.48550/arXiv.2602.05192. URLhttps: //arxiv.org/abs/2602.05192

work page doi:10.48550/arxiv.2602.05192 2026
[10]

First batch

First Proof Project. First batch. Project website, February 2026. URLhttps://1stproof.org/first- batch.html. First-batch page; site lists February 2026 release context; accessed 2026-03-21

2026
[11]

Our first proof submissions

OpenAI. Our first proof submissions. OpenAI research page, February 2026. URLhttps://openai. com/index/first-proof-submissions/. Published 2026-02-20

2026
[12]

Lean 4 formal verification of 8/10 #1stproof problems: Complete proofs with ai–human pipeline, partial qed for q4 & q6

Wenlin Zhang and Haobo Ma. Lean 4 formal verification of 8/10 #1stproof problems: Complete proofs with ai–human pipeline, partial qed for q4 & q6. Zenodo preprint, February 2026. URLhttps: //zenodo.org/records/18635744. Created 2026-02-13. Zenodo also lists a second record with the same title and metadata at DOI 10.5281/zenodo.18635110

work page doi:10.5281/zenodo.18635110 2026
[13]

Advancing science and math with gpt-5.2

OpenAI. Advancing science and math with gpt-5.2. OpenAI publication, December 2025. URLhttps: //openai.com/index/gpt-5-2-for-science-and-math. Published 2025-12-11. 17

2025
[14]

On learning-curve monotonicity for maximum likelihood estimators,

Mark Sellke and Steven Yin. On learning-curve monotonicity for maximum likelihood estimators.arXiv preprint arXiv:2512.10220, 2025. doi: 10.48550/arXiv.2512.10220. URLhttps://arxiv.org/abs/ 2512.10220

work page doi:10.48550/arxiv.2512.10220 2025
[15]

Introducing gauss, an agent for autoformalization

Math, Inc. Introducing gauss, an agent for autoformalization. Company blog post, n.d.. URLhttps: //www.math.inc/gauss. Undated page; accessed 2026-03-21

2026
[16]

Strong pnt

Math, Inc. Strong pnt. Project page, n.d.. URLhttps://math-inc.github.io/strongpnt/. Undated page; accessed 2026-03-21

2026
[17]

strongpnt

Math, Inc. strongpnt. GitHub repository, n.d.. URLhttps://github.com/math-inc/strongpnt. Repository for the Strong PNT formalization; accessed 2026-03-21

2026
[18]

Gauss – an agentic formalization of the prime number theorem

Jared Duker Lichtman. Gauss – an agentic formalization of the prime number theorem. Fields In- stitute talk page, October 2025. URLhttps://www.fields.utoronto.ca/talks/Gauss-agentic- formalization-Prime-Number-Theorem. Talk date: 2025-10-28

2025
[19]

Resolution of erdős problem #728: a writeup of aristotle’s lean proof.arXiv preprint arXiv:2601.07421, 2026

Nat Sothanaphan. Resolution of erdős problem #728: a writeup of aristotle’s lean proof.arXiv preprint arXiv:2601.07421, 2026. doi: 10.48550/arXiv.2601.07421. URLhttps://arxiv.org/abs/2601.07421

work page doi:10.48550/arxiv.2601.07421 2026
[20]

Today marks a momentous milestone for ai and mathematics

Harmonic. Today marks a momentous milestone for ai and mathematics. X post, January 2026. URL https://x.com/HarmonicMath/status/2008693723413225814. Posted 2026-01-06; dynamic-source metadata should be rechecked before camera-ready copy if cited in the main text

work page arXiv 2026
[21]

Thomas F. Bloom. Erdős problem #728. ErdosProblems.com entry, January 2026. URLhttps: //www.erdosproblems.com/728. Page last edited 2026-01-06; accessed 2026-03-21

2026
[22]

Thomas F. Bloom. Erdős problem #729. ErdosProblems.com entry, January 2026. URLhttps: //www.erdosproblems.com/729. Page last edited 2026-01-11; accessed 2026-03-21

2026
[23]

Thomas F. Bloom. Erdős problem #397. ErdosProblems.com entry, January 2026. URLhttps: //www.erdosproblems.com/397. Page last edited 2026-01-12; accessed 2026-03-21

2026
[24]

Erdős problem database

teorth. Erdős problem database. GitHub repository, n.d. URLhttps://github.com/teorth/ erdosproblems. Repository README accessed 2026-03-21

2026
[25]

gpt-5 has solved an unsolved mathematical problem,

GIGAZINE. An openai researcher posted that “gpt-5 has solved an unsolved mathematical problem,” but it turned out that the problem had already been solved, leading to ridicule from rival developers, in- cluding google deepmind ceo demis hassabis. News article, October 2025. URLhttps://gigazine.net/ gsc_news/en/20251020-openai-researcher-announced-gpt-5-ma...

work page arXiv 2025
[26]

Gênant: Openai beweert dat chatgpt wiskundeproblemen oplost, maar dat klopt niet

Erwin Vogelaar. Gênant: Openai beweert dat chatgpt wiskundeproblemen oplost, maar dat klopt niet. Bright.nl news article, October 2025. URLhttps://www.bright.nl/nieuws/1703437/g-nant- openai-beweert-dat-chatgpt-wiskundeproblemen-oplost-maar-dat-klopt-niet.html. Pub- lished 2025-10-20; accessed 2026-03-21

work page arXiv 2025
[27]

Schwartz

Matthew D. Schwartz. Vibe physics: The AI grad student. Anthropic Science Blog, March 2026. URL https://www.anthropic.com/research/vibe-physics. Accessed: 2026-03-24

2026
[28]

Vibe physics: the AI grad student,

Matthew D. Schwartz. Resummation of the c-parameter sudakov shoulder using effective field theory. arXiv preprint arXiv:2601.02484, 2026. doi: 10.48550/arXiv.2601.02484. URLhttps://arxiv.org/ abs/2601.02484

work page doi:10.48550/arxiv.2601.02484 2026
[29]

Lipton and Jacob Steinhardt

Zachary C. Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship.Queue, 17 (1), 2019. doi: 10.1145/3317287.3328534. URLhttps://doi.org/10.1145/3317287.3328534. ACM Queue article; multiple secondary indexes report pages 45–77, but page/article-number formatting varies across services, so pages are omitted here deliberately. 18

work page doi:10.1145/3317287.3328534 2019
[30]

troubling trends in machine learning scholarship

Andrew Gelman. “troubling trends in machine learning scholarship”. Statistical Modeling, Causal In- ference, and Social Science blog, September 2019. URLhttps://statmodeling.stat.columbia.edu/ 2019/09/30/troubling-trends-in-machine-learning-scholarship/. Blog commentary pointing to Lipton and Steinhardt and discussing hype, “provably” language, and advert...

2019
[31]

Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Flo- rence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021. URLhttps://www.jmlr.org/papers/v22/20...

2019
[32]

Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Be- harry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hof- mann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili...

work page doi:10.48550/arxiv.2511.04703 2025
[33]

Study identifies weaknesses in how AI systems are evaluated

Oxford Internet Institute. Study identifies weaknesses in how AI systems are evaluated. Press release, November 2025. URLhttps://www.oii.ox.ac.uk/news-events/study-identifies-weaknesses- in-how-ai-systems-are-evaluated/. Press release accompanying the benchmark-validity study; in- cludes quoted claims about unclear definitions, weak methods, and misleadin...

2025
[34]

Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaud- hary, Michael Young, Jean-Francois Crespo, and Dan Dennison

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaud- hary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learn- ing systems. InAdvances in Neural Information Processing Systems 28, pages 2503–2511, 2015. URL https://papers.nips.cc/paper/5656-hidden-technical-debt-in-mac...

2015
[35]

Deep reinforcement learning that matters

Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, pages 3207–3214, 2018. doi: 10.1609/AAAI.V32I1.11694. URLhttps://doi.org/10.1609/ AAAI.V32I1.11694

work page doi:10.1609/aaai.v32i1.11694 2018
[36]

5 arXivTemplateA PREPRINT Anne M

Nick McGreivy and Ammar Hakim. Weak baselines and reporting biases lead to overoptimism in machinelearningforfluid-relatedpartialdifferentialequations.Nature Machine Intelligence, 6(10):1256– 1269, 2024. doi: 10.1038/s42256-024-00897-5. URLhttps://doi.org/10.1038/s42256-024-00897-5

work page doi:10.1038/s42256-024-00897-5 2024
[37]

2020 , month = nov, journal =

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine In- telligence, 2(11):665–673, 2020. doi: 10.1038/s42256-020-00257-z. URLhttps://doi.org/10.1038/ s42256-020-00257-z

work page doi:10.1038/s42256-020-00257-z 2020
[38]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

Dennis Ulmer, Christian Hardmeier, and Jes Frellsen. deep-significance — easy and meaningful statisti- cal significance testing in the age of neural networks, 2022. URLhttps://doi.org/10.48550/arXiv. 2204.06815. arXiv preprint; also listed as a contribution to the ML Evaluation Standards Workshop at ICLR 2022 in institutional repositories. 19

work page internal anchor Pith review doi:10.48550/arxiv 2022
[39]

Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, and Ben Glocker

Charles Jones, Daniel C. Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, and Ben Glocker. A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6:138–146, 2024. doi: 10.1038/s42256-024-00797-8. URLhttps://doi.org/10.1038/ s42256-024-00797-8

work page doi:10.1038/s42256-024-00797-8 2024
[40]

Can robots do epidemiology? machine learning, causal inference, and predicting the outcomes of public health interventions.Philosophy & Technology, 35:14, 2022

Alex Broadbent and Thomas Grote. Can robots do epidemiology? machine learning, causal inference, and predicting the outcomes of public health interventions.Philosophy & Technology, 35:14, 2022. doi: 10.1007/s13347-022-00509-3. URLhttps://doi.org/10.1007/s13347-022-00509-3. Springer presents this as volume 35, article number 14; issue and expanded page-ran...

work page doi:10.1007/s13347-022-00509-3 2022
[41]

Collins and Karel G

Gary S. Collins and Karel G. M. Moons. Reporting of artificial intelligence prediction models.The Lancet, 393(10181):1577–1579, 2019. doi: 10.1016/S0140-6736(19)30037-6. URLhttps://doi.org/ 10.1016/S0140-6736(19)30037-6

work page doi:10.1016/s0140-6736(19)30037-6 2019
[42]

AI Snake Oil

Liz Fuller-Wright. “AI Snake Oil”: A Conversation with Princeton AI Experts Arvind Narayanan and Sayash Kapoor. Princeton University News, December 2024. URL https://www.princeton.edu/news/2024/12/18/ai-snake-oil-conversation-princeton-ai- experts-arvind-narayanan-and-sayash-kapoor. Interview/article quoting Narayanan and Kapoor on AI that does not work a...

2024
[43]

Autoresearch

Andrej Karpathy. Autoresearch. GitHub repository, 2026. URLhttps://github.com/karpathy/ autoresearch/blob/master/program.md. Repository documentation inprogram.md; accessed 2026- 03-26

2026
[44]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=1Fs1LvjYQW

2024
[45]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. doi: 10.48550/arXiv.2408.06292. URLhttps://arxiv.org/abs/2408.06292

work page internal anchor Pith review doi:10.48550/arxiv.2408.06292 2024
[46]

Agent laboratory: Using LLM agents as research assistants

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pag...

2025
[47]

arXiv preprint arXiv:2505.19955 , year =

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating AI agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025. doi: 10.48550/arXiv.2505.19955. URLhttps://arxiv.org/abs/2505.19955

work page doi:10.48550/arxiv.2505.19955 2025
[48]

Defining and identifying sleeping beauties in science.Proceedings of the National Academy of Sciences, 112(24):7426–7431, 2015

Qing Ke, Emilio Ferrara, Filippo Radicchi, and Alessandro Flammini. Defining and identifying sleeping beauties in science.Proceedings of the National Academy of Sciences, 112(24):7426–7431, 2015. doi: 10.1073/pnas.1424329112. URLhttps://www.pnas.org/doi/10.1073/pnas.1424329112

work page doi:10.1073/pnas.1424329112 2015
[49]

Bibliometrics: The Leiden manifesto for research metrics

Diana Hicks, Paul Wouters, Ludo Waltman, Sarah de Rijcke, and Ismael Rafols. Bibliometrics: The leiden manifesto for research metrics.Nature, 520(7548):429–431, 2015. doi: 10.1038/520429a. URL https://www.nature.com/articles/520429a

work page doi:10.1038/520429a 2015
[50]

The metric tide: Report of the independent review of the role of metrics in research assessment and management

James Wilsdon, Liz Allen, Eleonora Belfiore, Philip Campbell, Stephen Curry, Steven Hill, Richard Jones, Jude Hill, Roger Kain, Ben Johnson, Simon Kerridge, Jane Tinkler, Mike Thelwall, Paul Wouters, 20 and Ian Viney. The metric tide: Report of the independent review of the role of metrics in research assessment and management. Technical report, Higher Ed...
[51]

URLhttps://hdl.handle.net/10779/uos.23418680
[52]

Over-optimization of academic publishing metrics: Observing goodhart’s law in action.GigaScience, 8(6):giz053, 2019

Michael Fire and Carlos Guestrin. Over-optimization of academic publishing metrics: Observing goodhart’s law in action.GigaScience, 8(6):giz053, 2019. doi: 10.1093/gigascience/giz053. URL https://doi.org/10.1093/gigascience/giz053

work page doi:10.1093/gigascience/giz053 2019
[53]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

2024
[54]

Citebench: A benchmark for scientific citation text generation, 2022

Martin Funkquist, Ilia Kuznetsov, Yufang Hou, and Iryna Gurevych. Citebench: A benchmark for scientific citation text generation, 2022. URLhttps://arxiv.org/abs/2212.09577. Using the arXiv submission year; later bibliographic records may surface under 2023 metadata updates

work page arXiv 2022
[55]

Chatcite: LLM agent with human workflow guidance for comparative literature summary, 2024

Yutong Li, Lu Chen, Aiwei Liu, Kai Yu, and Lijie Wen. Chatcite: LLM agent with human workflow guidance for comparative literature summary, 2024. URLhttps://arxiv.org/abs/2403.02574

work page arXiv 2024
[56]

Scholarcopilot: Training large language models for academic writing with accurate citations, 2025

Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, and Wenhu Chen. Scholarcopilot: Training large language models for academic writing with accurate citations, 2025. URLhttps://arxiv.org/abs/2504.00824

work page arXiv 2025
[57]

Surveygen: Quality-aware scien- tific survey generation with large language models, 2025

Tong Bao, Mir Tafseer Nayeem, Davood Rafiei, and Chengzhi Zhang. Surveygen: Quality-aware scien- tific survey generation with large language models, 2025. URLhttps://arxiv.org/abs/2508.17647

work page arXiv 2025
[58]

Overleafcopilot: Empowering academic writing in Overleaf with large language models, 2024

Haomin Wen, Zhenjie Wei, Yan Lin, Jiyuan Wang, Yuxuan Liang, and Huaiyu Wan. Overleafcopilot: Empowering academic writing in Overleaf with large language models, 2024. URLhttps://arxiv. org/abs/2403.09733

work page arXiv 2024
[59]

L., Chen, N., Gong, Y ., and He, B

Junyi Hou, Huikai Andre Lin, Nuo Chen, Yiwei Gong, and Bingsheng He. Paperdebugger: A plugin- based multi-agent system for in-editor academic writing, review, and editing, 2025. URLhttps:// arxiv.org/abs/2512.02589

work page arXiv 2025
[60]

Autonomous LLM-driven research – from data to human-verifiable research papers

Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous LLM-driven research — from data to human-verifiable research papers.NEJM AI, 2(1), 2025. doi: 10.1056/AIoa2400555. URLhttps://ai.nejm.org/doi/10.1056/AIoa2400555

work page doi:10.1056/aioa2400555 2025
[61]

The AI scientist: Towards fully automated open-ended scientific discovery, 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408. 06292

2024
[62]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URLhttps://arxiv.org/abs/2504.08066

work page internal anchor Pith review Pith/arXiv arXiv 2025
[63]

doi: 10.18653/v1/2025.findings-emnlp.320

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assis- tants. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.fi...

work page doi:10.18653/v1/2025.findings-emnlp.320 2025
[64]

carrier to- kens

Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review, 2024. URLhttps: //arxiv.org/abs/2411.00816. First submitted in 2024; later revised in 2025. 21

work page arXiv 2024
[65]

2025 , doi =

Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-researcher: Autonomous scientific innovation, 2025. URLhttps://arxiv.org/abs/2505.18705

work page arXiv 2025
[66]

Agentrxiv: Towards collaborative au- tonomous research,

Samuel Schmidgall and Michael Moor. Agentrxiv: Towards collaborative autonomous research, 2025. URLhttps://arxiv.org/abs/2503.18102

work page arXiv 2025
[67]

Build your personalized research group: A multiagent framework for continual and interactive science automation,

Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li, Dirk Bergemann, and Zhuoran Yang. Build your personalized research group: A multiagent framework for continual and interactive science automation,
[68]

URLhttps://arxiv.org/abs/2510.15624

work page arXiv
[69]

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review arXiv 2025
[70]

Internagent: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, and Lei Bai. Internagent: When a...

work page arXiv 2025
[71]

Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lilong Wang, Zi...

work page arXiv 2026
[72]

Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wag- ner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

work page internal anchor Pith review arXiv 2025
[73]

Georgiev, J

Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical explo- ration and discovery at scale, 2025. URLhttps://arxiv.org/abs/2511.02864

work page arXiv 2025
[74]

Deepinnovator: Triggering the innovative capabilities of llms.arXiv preprint arXiv:2602.18920, 2026

Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu, Chengen Huang, Junyang Lin, and Chao Huang. Deepinnovator: Triggering the innovative capabilities of LLMs, 2026. URLhttps: //arxiv.org/abs/2602.18920

work page arXiv 2026
[75]

Openclaw, moltbook, and clawdlab: From agent-only social networks to autonomous scientific research.arXiv preprint arXiv:2602.19810, 2026

Lukas Weidener, Marko Brkić, Phillip Lee, Martin Karlsson, Kevin Noessler, and Paul Kohlhaas. From agent-only social networks to autonomous scientific research: Lessons from OpenClaw and Moltbook, andthearchitectureofClawdLabandBeach.Science, 2026. URLhttps://arxiv.org/abs/2602.19810

work page arXiv 2026
[76]

Uniﬁedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a

KyleLo, LucyLuWang, MarkNeumann, RodneyKinney, andDanielWeld. S2orc: Thesemanticscholar open research corpus. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020. acl-main.447. URLhttps://aclanthology.org/2020.acl-main.447/. 22

work page doi:10.18653/v1/2020 2020
[77]

Shaurya Rohatgi

Jason Priem, Heather Piwowar, and Richard Orr. Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts, 2022. URLhttps://arxiv.org/abs/2205.01833

work page arXiv 2022
[78]

Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023. emnlp-main.398. URLhttps://aclanthology.org/2023.emnlp-main.398/

work page doi:10.18653/v1/2023 2023
[79]

Skarlinski, Sam Cox, Jon M

Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Ham- merling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge, 2024. URLhttps://arxiv.org/abs/2409.13740

work page arXiv 2024
[80]

Elicit: AI for scientific research, n.d

Elicit. Elicit: AI for scientific research, n.d. URLhttps://orion.elicit.com/. Undated product site; accessed 2026-03-21

2026

Showing first 80 references.