pith. machine review for the scientific record. sign in

arxiv: 2604.20622 · v1 · submitted 2026-04-22 · 💻 cs.AI · cs.LG· cs.MA

Recognition: unknown

pAI/MSc: ML Theory Research with Humans on the Loop

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:57 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA
keywords multi-agent systemsmachine learning theoryhuman-in-the-loop AIresearch workflow automationmanuscript draftingopen source toolsacademic research
0
0 comments X

The pith

A modular multi-agent system reduces the human steering needed to produce ML theory manuscripts by orders of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents pAI/MSc as an open-source modular multi-agent system for academic research in machine learning theory and related fields. Its central goal is to minimize human involvement in steering the process from a stated hypothesis to a full manuscript that incorporates relevant literature, mathematical foundations, experimental results, and is ready for submission. Rather than pursuing fully autonomous research or idea generation, the system emphasizes practical reduction in effort while retaining human oversight at key points. If effective, this could allow researchers to allocate more time to creative aspects of inquiry instead of routine tasks like searching papers or setting up experiments.

Core claim

pAI/MSc is a customizable, modular multi-agent system that, given a hypothesis, produces a literature-grounded, mathematically established, experimentally supported, and submission-oriented manuscript draft with orders of magnitude less human steering than traditional workflows.

What carries the argument

The modular multi-agent architecture in pAI/MSc that distributes tasks across specialized agents for literature retrieval, mathematical reasoning, code execution for experiments, and text generation, all under human supervision.

Load-bearing premise

That the current capabilities of large language models and agent coordination can accurately and reliably execute the steps of literature review, mathematical proof construction, and experimental validation with only minimal human corrections.

What would settle it

Running the system on a well-known ML theory hypothesis and having domain experts review the output draft for accuracy in citations, mathematical correctness, and experimental validity to determine if it meets submission standards without extensive revisions.

Figures

Figures reproduced from arXiv: 2604.20622 by Mahmoud Abdelmoneum, Pierfrancesco Beneventano, Tomaso Poggio.

Figure 1
Figure 1. Figure 1: The pAI/MSc execution graph. Dashed violet borders mark counsel-eligible agents. Red dashed arrows are loopbacks triggered by gate failures. Theory and experiment tracks run in parallel when both are selected. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
read the original abstract

We present pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows. Our goal is not autonomous scientific ideation, nor fully automated research. It is narrower and more practical: to reduce by orders of magnitude the human steering required to turn a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft. pAI/MSc is built with a current emphasis on machine learning theory and adjacent quantitative fields.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents pAI/MSc, an open-source, customizable, modular multi-agent system for academic research workflows with emphasis on machine learning theory. The stated goal is to reduce by orders of magnitude the human steering required to convert a specified hypothesis into a literature-grounded, mathematically established, experimentally supported, submission-oriented manuscript draft while keeping humans in the loop; the manuscript describes the system architecture but supplies no implementation details, experiments, or metrics.

Significance. If the claimed reduction in human intervention were demonstrated while preserving output quality, the system could meaningfully increase research throughput in quantitative fields. The open-source and modular design is a strength that would support reproducibility and extension by the community. In its current form, however, the manuscript offers only a high-level system description without evidence, so any significance assessment remains prospective.

major comments (1)
  1. Abstract: The central claim of an 'orders of magnitude' reduction in human steering for literature grounding, mathematical establishment, experiment design, and manuscript assembly is unsupported by any quantitative data, logged intervention counts, user studies, baseline comparisons, or worked examples. This renders the claim an untested design goal rather than a demonstrated property of the system.
minor comments (1)
  1. The manuscript would benefit from a dedicated section detailing the agent roles, communication protocols, and customization interfaces, as these are referenced only at a high level in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for reviewing our manuscript on pAI/MSc. We appreciate your assessment of its potential significance and agree that additional clarification is needed regarding the system's claimed capabilities. We provide a point-by-point response to the major comment below.

read point-by-point responses
  1. Referee: Abstract: The central claim of an 'orders of magnitude' reduction in human steering for literature grounding, mathematical establishment, experiment design, and manuscript assembly is unsupported by any quantitative data, logged intervention counts, user studies, baseline comparisons, or worked examples. This renders the claim an untested design goal rather than a demonstrated property of the system.

    Authors: We thank the referee for this observation. The manuscript indeed presents pAI/MSc primarily as a system architecture and design, with the reduction in human steering stated as the core objective enabled by its customizable multi-agent framework. No quantitative evaluations, such as intervention counts or user studies, are included because the current work focuses on describing the system rather than evaluating its performance metrics. We will revise the abstract and relevant sections to explicitly characterize the 'orders of magnitude' reduction as a design goal and intended benefit, rather than a demonstrated result. Additionally, we will expand on implementation details, provide worked examples of the workflow where possible, and outline plans for future empirical validation to address this concern. revision: yes

Circularity Check

0 steps flagged

No derivations, predictions, or equations; system description paper has no circularity

full rationale

The manuscript is a descriptive account of an open-source multi-agent architecture for research assistance. It states design goals (reducing human steering by orders of magnitude) but supplies no equations, fitted parameters, uniqueness theorems, self-citations used as load-bearing premises, or renamings of empirical patterns. The central claim is an untested assertion about future performance rather than a derivation that reduces to its own inputs. No load-bearing step matches any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The preprint describes a software system for research assistance rather than a theoretical model with fitted parameters, axioms, or new scientific entities. No mathematical derivations or empirical claims requiring such elements are mentioned in the abstract.

pith-pipeline@v0.9.0 · 5377 in / 1211 out tokens · 49867 ms · 2026-05-09T23:57:06.983235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 55 canonical work pages · 9 internal anchors

  1. [1]

    , author Barekatain, M

    Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models.Nature, 625(7995):468–475, 2024. doi: 10.1038/s41586-023-06924...

  2. [2]

    Funsearch

    GoogleDeepMind. Funsearch. GitHubrepository, 2023. URLhttps://github.com/google-deepmind/ funsearch. Repository accompanying the FunSearch Nature paper; accessed 2026-03-21

  3. [3]

    Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wag- ner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

  4. [4]

    Georgiev, J

    Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical explo- ration and discovery at scale.arXiv preprint arXiv:2511.02864, 2025. doi: 10.48550/arXiv.2511.02864. URLhttps://arxiv.org/abs/2511.02864

  5. [5]

    Mathematical problem repository for alphaevolve

    Google DeepMind. Mathematical problem repository for alphaevolve. GitHub repository, 2025. URL https://github.com/google-deepmind/alphaevolve_repository_of_problems. Repository accom- panying the Mathematical exploration and discovery at scale preprint; accessed 2026-03-21

  6. [6]

    Reinforced Generation of Combinatorial Structures: Ramsey Numbers

    Ansh Nagda, Prabhakar Raghavan, and Abhradeep Thakurta. Reinforced generation of combinatorial structures: Ramsey numbers.arXiv preprint arXiv:2603.09172, 2026. doi: 10.48550/arXiv.2603.09172. URLhttps://arxiv.org/abs/2603.09172

  7. [7]

    Donald E. Knuth. Claude’s cycles. Informal note / PDF on Knuth’s preprints page, February 2026. URL https://cs.stanford.edu/~knuth/papers/claude-cycles.pdf. Dated 2026-02-28; revised 2026-03- 16

  8. [8]

    Thestoryoferdősproblem#1026

    TerenceTao. Thestoryoferdősproblem#1026. BlogpostonWhat’s New, December2025. URLhttps: //terrytao.wordpress.com/2025/12/08/the-story-of-erdos-problem-126/. Published 2025-12- 08

  9. [9]

    Abouzaid, A

    Mohammed Abouzaid, Andrew J. Blumberg, Martin Hairer, Joe Kileel, Tamara G. Kolda, Paul D. Nelson, Daniel Spielman, Nikhil Srivastava, Rachel Ward, Shmuel Weinberger, and Lauren Williams. First proof.arXiv preprint arXiv:2602.05192, 2026. doi: 10.48550/arXiv.2602.05192. URLhttps: //arxiv.org/abs/2602.05192

  10. [10]

    First batch

    First Proof Project. First batch. Project website, February 2026. URLhttps://1stproof.org/first- batch.html. First-batch page; site lists February 2026 release context; accessed 2026-03-21

  11. [11]

    Our first proof submissions

    OpenAI. Our first proof submissions. OpenAI research page, February 2026. URLhttps://openai. com/index/first-proof-submissions/. Published 2026-02-20

  12. [12]

    Lean 4 formal verification of 8/10 #1stproof problems: Complete proofs with ai–human pipeline, partial qed for q4 & q6

    Wenlin Zhang and Haobo Ma. Lean 4 formal verification of 8/10 #1stproof problems: Complete proofs with ai–human pipeline, partial qed for q4 & q6. Zenodo preprint, February 2026. URLhttps: //zenodo.org/records/18635744. Created 2026-02-13. Zenodo also lists a second record with the same title and metadata at DOI 10.5281/zenodo.18635110

  13. [13]

    Advancing science and math with gpt-5.2

    OpenAI. Advancing science and math with gpt-5.2. OpenAI publication, December 2025. URLhttps: //openai.com/index/gpt-5-2-for-science-and-math. Published 2025-12-11. 17

  14. [14]

    On learning-curve monotonicity for maximum likelihood estimators,

    Mark Sellke and Steven Yin. On learning-curve monotonicity for maximum likelihood estimators.arXiv preprint arXiv:2512.10220, 2025. doi: 10.48550/arXiv.2512.10220. URLhttps://arxiv.org/abs/ 2512.10220

  15. [15]

    Introducing gauss, an agent for autoformalization

    Math, Inc. Introducing gauss, an agent for autoformalization. Company blog post, n.d.. URLhttps: //www.math.inc/gauss. Undated page; accessed 2026-03-21

  16. [16]

    Strong pnt

    Math, Inc. Strong pnt. Project page, n.d.. URLhttps://math-inc.github.io/strongpnt/. Undated page; accessed 2026-03-21

  17. [17]

    strongpnt

    Math, Inc. strongpnt. GitHub repository, n.d.. URLhttps://github.com/math-inc/strongpnt. Repository for the Strong PNT formalization; accessed 2026-03-21

  18. [18]

    Gauss – an agentic formalization of the prime number theorem

    Jared Duker Lichtman. Gauss – an agentic formalization of the prime number theorem. Fields In- stitute talk page, October 2025. URLhttps://www.fields.utoronto.ca/talks/Gauss-agentic- formalization-Prime-Number-Theorem. Talk date: 2025-10-28

  19. [19]

    Resolution of erdős problem #728: a writeup of aristotle’s lean proof.arXiv preprint arXiv:2601.07421, 2026

    Nat Sothanaphan. Resolution of erdős problem #728: a writeup of aristotle’s lean proof.arXiv preprint arXiv:2601.07421, 2026. doi: 10.48550/arXiv.2601.07421. URLhttps://arxiv.org/abs/2601.07421

  20. [20]

    Today marks a momentous milestone for ai and mathematics

    Harmonic. Today marks a momentous milestone for ai and mathematics. X post, January 2026. URL https://x.com/HarmonicMath/status/2008693723413225814. Posted 2026-01-06; dynamic-source metadata should be rechecked before camera-ready copy if cited in the main text

  21. [21]

    Thomas F. Bloom. Erdős problem #728. ErdosProblems.com entry, January 2026. URLhttps: //www.erdosproblems.com/728. Page last edited 2026-01-06; accessed 2026-03-21

  22. [22]

    Thomas F. Bloom. Erdős problem #729. ErdosProblems.com entry, January 2026. URLhttps: //www.erdosproblems.com/729. Page last edited 2026-01-11; accessed 2026-03-21

  23. [23]

    Thomas F. Bloom. Erdős problem #397. ErdosProblems.com entry, January 2026. URLhttps: //www.erdosproblems.com/397. Page last edited 2026-01-12; accessed 2026-03-21

  24. [24]

    Erdős problem database

    teorth. Erdős problem database. GitHub repository, n.d. URLhttps://github.com/teorth/ erdosproblems. Repository README accessed 2026-03-21

  25. [25]

    gpt-5 has solved an unsolved mathematical problem,

    GIGAZINE. An openai researcher posted that “gpt-5 has solved an unsolved mathematical problem,” but it turned out that the problem had already been solved, leading to ridicule from rival developers, in- cluding google deepmind ceo demis hassabis. News article, October 2025. URLhttps://gigazine.net/ gsc_news/en/20251020-openai-researcher-announced-gpt-5-ma...

  26. [26]

    Gênant: Openai beweert dat chatgpt wiskundeproblemen oplost, maar dat klopt niet

    Erwin Vogelaar. Gênant: Openai beweert dat chatgpt wiskundeproblemen oplost, maar dat klopt niet. Bright.nl news article, October 2025. URLhttps://www.bright.nl/nieuws/1703437/g-nant- openai-beweert-dat-chatgpt-wiskundeproblemen-oplost-maar-dat-klopt-niet.html. Pub- lished 2025-10-20; accessed 2026-03-21

  27. [27]

    Schwartz

    Matthew D. Schwartz. Vibe physics: The AI grad student. Anthropic Science Blog, March 2026. URL https://www.anthropic.com/research/vibe-physics. Accessed: 2026-03-24

  28. [28]

    Vibe physics: the AI grad student,

    Matthew D. Schwartz. Resummation of the c-parameter sudakov shoulder using effective field theory. arXiv preprint arXiv:2601.02484, 2026. doi: 10.48550/arXiv.2601.02484. URLhttps://arxiv.org/ abs/2601.02484

  29. [29]

    Lipton and Jacob Steinhardt

    Zachary C. Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship.Queue, 17 (1), 2019. doi: 10.1145/3317287.3328534. URLhttps://doi.org/10.1145/3317287.3328534. ACM Queue article; multiple secondary indexes report pages 45–77, but page/article-number formatting varies across services, so pages are omitted here deliberately. 18

  30. [30]

    troubling trends in machine learning scholarship

    Andrew Gelman. “troubling trends in machine learning scholarship”. Statistical Modeling, Causal In- ference, and Social Science blog, September 2019. URLhttps://statmodeling.stat.columbia.edu/ 2019/09/30/troubling-trends-in-machine-learning-scholarship/. Blog commentary pointing to Lipton and Steinhardt and discussing hype, “provably” language, and advert...

  31. [31]

    Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021

    Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vincent Larivière, Alina Beygelzimer, Flo- rence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program).Journal of Machine Learning Research, 22(164):1–20, 2021. URLhttps://www.jmlr.org/papers/v22/20...

  32. [32]

    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Be- harry, Cornelius Emde, Thomas Foster, Anna Gausen, María Grandury, Simeng Han, Valentin Hof- mann, Lujain Ibrahim, Hazel Kim, Hannah Rose Kirk, Fangru Lin, Gabrielle Kaili...

  33. [33]

    Study identifies weaknesses in how AI systems are evaluated

    Oxford Internet Institute. Study identifies weaknesses in how AI systems are evaluated. Press release, November 2025. URLhttps://www.oii.ox.ac.uk/news-events/study-identifies-weaknesses- in-how-ai-systems-are-evaluated/. Press release accompanying the benchmark-validity study; in- cludes quoted claims about unclear definitions, weak methods, and misleadin...

  34. [34]

    Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaud- hary, Michael Young, Jean-Francois Crespo, and Dan Dennison

    D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaud- hary, Michael Young, Jean-Francois Crespo, and Dan Dennison. Hidden technical debt in machine learn- ing systems. InAdvances in Neural Information Processing Systems 28, pages 2503–2511, 2015. URL https://papers.nips.cc/paper/5656-hidden-technical-debt-in-mac...

  35. [35]

    Deep reinforcement learning that matters

    Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. InProceedings of the AAAI Conference on Artificial Intelligence, volume 32, pages 3207–3214, 2018. doi: 10.1609/AAAI.V32I1.11694. URLhttps://doi.org/10.1609/ AAAI.V32I1.11694

  36. [36]

    5 arXivTemplateA PREPRINT Anne M

    Nick McGreivy and Ammar Hakim. Weak baselines and reporting biases lead to overoptimism in machinelearningforfluid-relatedpartialdifferentialequations.Nature Machine Intelligence, 6(10):1256– 1269, 2024. doi: 10.1038/s42256-024-00897-5. URLhttps://doi.org/10.1038/s42256-024-00897-5

  37. [37]

    2020 , month = nov, journal =

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard S. Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine In- telligence, 2(11):665–673, 2020. doi: 10.1038/s42256-020-00257-z. URLhttps://doi.org/10.1038/ s42256-020-00257-z

  38. [38]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Dennis Ulmer, Christian Hardmeier, and Jes Frellsen. deep-significance — easy and meaningful statisti- cal significance testing in the age of neural networks, 2022. URLhttps://doi.org/10.48550/arXiv. 2204.06815. arXiv preprint; also listed as a contribution to the ML Evaluation Standards Workshop at ICLR 2022 in institutional repositories. 19

  39. [39]

    Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, and Ben Glocker

    Charles Jones, Daniel C. Castro, Fabio De Sousa Ribeiro, Ozan Oktay, Melissa McCradden, and Ben Glocker. A causal perspective on dataset bias in machine learning for medical imaging.Nature Machine Intelligence, 6:138–146, 2024. doi: 10.1038/s42256-024-00797-8. URLhttps://doi.org/10.1038/ s42256-024-00797-8

  40. [40]

    Can robots do epidemiology? machine learning, causal inference, and predicting the outcomes of public health interventions.Philosophy & Technology, 35:14, 2022

    Alex Broadbent and Thomas Grote. Can robots do epidemiology? machine learning, causal inference, and predicting the outcomes of public health interventions.Philosophy & Technology, 35:14, 2022. doi: 10.1007/s13347-022-00509-3. URLhttps://doi.org/10.1007/s13347-022-00509-3. Springer presents this as volume 35, article number 14; issue and expanded page-ran...

  41. [41]

    Collins and Karel G

    Gary S. Collins and Karel G. M. Moons. Reporting of artificial intelligence prediction models.The Lancet, 393(10181):1577–1579, 2019. doi: 10.1016/S0140-6736(19)30037-6. URLhttps://doi.org/ 10.1016/S0140-6736(19)30037-6

  42. [42]

    AI Snake Oil

    Liz Fuller-Wright. “AI Snake Oil”: A Conversation with Princeton AI Experts Arvind Narayanan and Sayash Kapoor. Princeton University News, December 2024. URL https://www.princeton.edu/news/2024/12/18/ai-snake-oil-conversation-princeton-ai- experts-arvind-narayanan-and-sayash-kapoor. Interview/article quoting Narayanan and Kapoor on AI that does not work a...

  43. [43]

    Autoresearch

    Andrej Karpathy. Autoresearch. GitHub repository, 2026. URLhttps://github.com/karpathy/ autoresearch/blob/master/program.md. Repository documentation inprogram.md; accessed 2026- 03-26

  44. [44]

    Mlagentbench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=1Fs1LvjYQW

  45. [45]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024. doi: 10.48550/arXiv.2408.06292. URLhttps://arxiv.org/abs/2408.06292

  46. [46]

    Agent laboratory: Using LLM agents as research assistants

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assistants. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Findings of the Association for Computational Linguistics: EMNLP 2025, pag...

  47. [47]

    arXiv preprint arXiv:2505.19955 , year =

    Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating AI agents on open-ended machine learning research.arXiv preprint arXiv:2505.19955, 2025. doi: 10.48550/arXiv.2505.19955. URLhttps://arxiv.org/abs/2505.19955

  48. [48]

    Defining and identifying sleeping beauties in science.Proceedings of the National Academy of Sciences, 112(24):7426–7431, 2015

    Qing Ke, Emilio Ferrara, Filippo Radicchi, and Alessandro Flammini. Defining and identifying sleeping beauties in science.Proceedings of the National Academy of Sciences, 112(24):7426–7431, 2015. doi: 10.1073/pnas.1424329112. URLhttps://www.pnas.org/doi/10.1073/pnas.1424329112

  49. [49]

    Bibliometrics: The Leiden manifesto for research metrics

    Diana Hicks, Paul Wouters, Ludo Waltman, Sarah de Rijcke, and Ismael Rafols. Bibliometrics: The leiden manifesto for research metrics.Nature, 520(7548):429–431, 2015. doi: 10.1038/520429a. URL https://www.nature.com/articles/520429a

  50. [50]

    The metric tide: Report of the independent review of the role of metrics in research assessment and management

    James Wilsdon, Liz Allen, Eleonora Belfiore, Philip Campbell, Stephen Curry, Steven Hill, Richard Jones, Jude Hill, Roger Kain, Ben Johnson, Simon Kerridge, Jane Tinkler, Mike Thelwall, Paul Wouters, 20 and Ian Viney. The metric tide: Report of the independent review of the role of metrics in research assessment and management. Technical report, Higher Ed...

  51. [51]

    URLhttps://hdl.handle.net/10779/uos.23418680

  52. [52]

    Over-optimization of academic publishing metrics: Observing goodhart’s law in action.GigaScience, 8(6):giz053, 2019

    Michael Fire and Carlos Guestrin. Over-optimization of academic publishing metrics: Observing goodhart’s law in action.GigaScience, 8(6):giz053, 2019. doi: 10.1093/gigascience/giz053. URL https://doi.org/10.1093/gigascience/giz053

  53. [53]

    Improving factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

  54. [54]

    Citebench: A benchmark for scientific citation text generation, 2022

    Martin Funkquist, Ilia Kuznetsov, Yufang Hou, and Iryna Gurevych. Citebench: A benchmark for scientific citation text generation, 2022. URLhttps://arxiv.org/abs/2212.09577. Using the arXiv submission year; later bibliographic records may surface under 2023 metadata updates

  55. [55]

    Chatcite: LLM agent with human workflow guidance for comparative literature summary, 2024

    Yutong Li, Lu Chen, Aiwei Liu, Kai Yu, and Lijie Wen. Chatcite: LLM agent with human workflow guidance for comparative literature summary, 2024. URLhttps://arxiv.org/abs/2403.02574

  56. [56]

    Scholarcopilot: Training large language models for academic writing with accurate citations, 2025

    Yubo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, and Wenhu Chen. Scholarcopilot: Training large language models for academic writing with accurate citations, 2025. URLhttps://arxiv.org/abs/2504.00824

  57. [57]

    Surveygen: Quality-aware scien- tific survey generation with large language models, 2025

    Tong Bao, Mir Tafseer Nayeem, Davood Rafiei, and Chengzhi Zhang. Surveygen: Quality-aware scien- tific survey generation with large language models, 2025. URLhttps://arxiv.org/abs/2508.17647

  58. [58]

    Overleafcopilot: Empowering academic writing in Overleaf with large language models, 2024

    Haomin Wen, Zhenjie Wei, Yan Lin, Jiyuan Wang, Yuxuan Liang, and Huaiyu Wan. Overleafcopilot: Empowering academic writing in Overleaf with large language models, 2024. URLhttps://arxiv. org/abs/2403.09733

  59. [59]

    L., Chen, N., Gong, Y ., and He, B

    Junyi Hou, Huikai Andre Lin, Nuo Chen, Yiwei Gong, and Bingsheng He. Paperdebugger: A plugin- based multi-agent system for in-editor academic writing, review, and editing, 2025. URLhttps:// arxiv.org/abs/2512.02589

  60. [60]

    Autonomous LLM-driven research – from data to human-verifiable research papers

    Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous LLM-driven research — from data to human-verifiable research papers.NEJM AI, 2(1), 2025. doi: 10.1056/AIoa2400555. URLhttps://ai.nejm.org/doi/10.1056/AIoa2400555

  61. [61]

    The AI scientist: Towards fully automated open-ended scientific discovery, 2024

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery, 2024. URLhttps://arxiv.org/abs/2408. 06292

  62. [62]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist-v2: Workshop-level automated scientific discovery via agentic tree search, 2025. URLhttps://arxiv.org/abs/2504.08066

  63. [63]

    doi: 10.18653/v1/2025.findings-emnlp.320

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using LLM agents as research assis- tants. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.fi...

  64. [64]

    carrier to- kens

    Yixuan Weng, Minjun Zhu, Guangsheng Bao, Hongbo Zhang, Jindong Wang, Yue Zhang, and Linyi Yang. Cycleresearcher: Improving automated research via automated review, 2024. URLhttps: //arxiv.org/abs/2411.00816. First submitted in 2024; later revised in 2025. 21

  65. [65]

    2025 , doi =

    Jiabin Tang, Lianghao Xia, Zhonghang Li, and Chao Huang. AI-researcher: Autonomous scientific innovation, 2025. URLhttps://arxiv.org/abs/2505.18705

  66. [66]

    Agentrxiv: Towards collaborative au- tonomous research,

    Samuel Schmidgall and Michael Moor. Agentrxiv: Towards collaborative autonomous research, 2025. URLhttps://arxiv.org/abs/2503.18102

  67. [67]

    Build your personalized research group: A multiagent framework for continual and interactive science automation,

    Ed Li, Junyu Ren, Xintian Pan, Cat Yan, Chuanhao Li, Dirk Bergemann, and Zhuoran Yang. Build your personalized research group: A multiagent framework for continual and interactive science automation,

  68. [68]

    URLhttps://arxiv.org/abs/2510.15624

  69. [69]

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

  70. [70]

    Internagent: When agent becomes the scientist–building closed-loop system from hypothesis to verification.arXiv preprint arXiv:2505.16938, 2025

    InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, and Lei Bai. Internagent: When a...

  71. [71]

    Internagent-1.5: A unified agentic framework for long-horizon autonomous scientific discovery.arXiv preprint arXiv:2602.08990, 2026

    Shiyang Feng, Runmin Ma, Xiangchao Yan, Yue Fan, Yusong Hu, Songtao Huang, Shuaiyu Zhang, Zongsheng Cao, Tianshuo Peng, Jiakang Yuan, Zijie Guo, Zhijie Zhong, Shangheng Du, Weida Wang, Jinxin Shi, Yuhao Zhou, Xiaohan He, Zhiyin Yu, Fangchen Yu, Qihao Zheng, Jiamin Wu, Mianxin Liu, Chi Zhang, Shaowei Hou, Shuya Li, Yankai Jiang, Wenjie Lou, Lilong Wang, Zi...

  72. [72]

    Alexander Novikov, Ngân V˜ u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wag- ner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. Alphaevolve: A coding agent for scientific and al...

  73. [73]

    Georgiev, J

    Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. Mathematical explo- ration and discovery at scale, 2025. URLhttps://arxiv.org/abs/2511.02864

  74. [74]

    Deepinnovator: Triggering the innovative capabilities of llms.arXiv preprint arXiv:2602.18920, 2026

    Tianyu Fan, Fengji Zhang, Yuxiang Zheng, Bei Chen, Xinyao Niu, Chengen Huang, Junyang Lin, and Chao Huang. Deepinnovator: Triggering the innovative capabilities of LLMs, 2026. URLhttps: //arxiv.org/abs/2602.18920

  75. [75]

    Openclaw, moltbook, and clawdlab: From agent-only social networks to autonomous scientific research.arXiv preprint arXiv:2602.19810, 2026

    Lukas Weidener, Marko Brkić, Phillip Lee, Martin Karlsson, Kevin Noessler, and Paul Kohlhaas. From agent-only social networks to autonomous scientific research: Lessons from OpenClaw and Moltbook, andthearchitectureofClawdLabandBeach.Science, 2026. URLhttps://arxiv.org/abs/2602.19810

  76. [76]

    Unifiedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a

    KyleLo, LucyLuWang, MarkNeumann, RodneyKinney, andDanielWeld. S2orc: Thesemanticscholar open research corpus. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4969–4983. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020. acl-main.447. URLhttps://aclanthology.org/2020.acl-main.447/. 22

  77. [77]

    Shaurya Rohatgi

    Jason Priem, Heather Piwowar, and Richard Orr. Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts, 2022. URLhttps://arxiv.org/abs/2205.01833

  78. [78]

    Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023. emnlp-main.398. URLhttps://aclanthology.org/2023.emnlp-main.398/

  79. [79]

    Skarlinski, Sam Cox, Jon M

    Michael D. Skarlinski, Sam Cox, Jon M. Laurent, James D. Braza, Michaela Hinks, Michael J. Ham- merling, Manvitha Ponnapati, Samuel G. Rodriques, and Andrew D. White. Language agents achieve superhuman synthesis of scientific knowledge, 2024. URLhttps://arxiv.org/abs/2409.13740

  80. [80]

    Elicit: AI for scientific research, n.d

    Elicit. Elicit: AI for scientific research, n.d. URLhttps://orion.elicit.com/. Undated product site; accessed 2026-03-21

Showing first 80 references.