pith. sign in

arxiv: 2605.15871 · v1 · pith:GREATM22new · submitted 2026-05-15 · 💻 cs.AI

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

Pith reviewed 2026-05-20 18:52 UTC · model grok-4.3

classification 💻 cs.AI
keywords neural architecture searchLLM agentsfoundation modelsAIRA-ComposeAIRA-Designscaling efficiencyattention mechanismsautonomous discovery
0
0 comments X

The pith

LLM agents autonomously discover neural architectures that match or surpass hand-designed baselines like Llama 3.2

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLM agents can independently design foundation model architectures that go beyond standard Transformers. It introduces AIRA-Compose, where 11 agents explore computational primitives within a 24-hour limit and extrapolate promising designs to 350M-3B scales, and AIRA-Design, where 20 agents implement novel attention mechanisms and training scripts. The resulting AIRAformer and AIRAhybrid families, when pre-trained at 1B scale, deliver accuracy gains of 2.4-3.8% over Llama 3.2 on downstream tasks and markedly better scaling curves. A sympathetic reader would care because this work provides concrete evidence that agents can reduce dependence on human architects for model progress.

Core claim

Toward recursive self-improvement, the authors show that LLM agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. AIRA-Compose deploys 11 agents to evaluate million-parameter candidates across fundamental primitives, producing 14 architectures in two families that, at 1B scale, outperform Llama 3.2 and Composer baselines in accuracy while exhibiting 54-71% faster scaling than Llama 3.2 and 23-37% better scaling than Nemotron-2 and Composer hybrids. AIRA-Design tasks 20 agents with creating long-range attention mechanisms that reach within 2.3-2.6% of human state-of-the-art on Long Range Arena tasks and training scripts,

What carries the argument

Dual agent frameworks AIRA-Compose for high-level architecture search and AIRA-Design for low-level mechanistic implementation

Load-bearing premise

The observed accuracy and scaling advantages stem from the agent-discovered architectures and mechanisms rather than uncontrolled differences in training data, optimizer settings, or evaluation protocols.

What would settle it

Reproduce the 1B-scale pre-training runs for the top AIRAformer and AIRAhybrid models together with Llama 3.2 and Composer baselines under identical data, optimizer, and evaluation protocols, then compare final accuracy and scaling slopes.

read the original abstract

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces two LLM-agent frameworks for autonomous neural architecture discovery: AIRA-Compose, which deploys 11 agents to search high-level computational primitives and extrapolate designs to 350M–3B scales, and AIRA-Design, which uses 20 agents to invent attention mechanisms and training scripts. The search yields 14 architectures in the AIRAformer (Transformer-based) and AIRAhybrid (Transformer-Mamba) families. When pre-trained at 1B scale, selected models reportedly exceed Llama 3.2 accuracy by 2.4–3.8 % on downstream tasks, exhibit 23–71 % faster scaling frontiers than Llama 3.2, Nemotron-2 and Composer baselines, and reach within 2.3–2.6 % of human SOTA on Long Range Arena while surpassing published Autoresearch minima. The work positions these results as evidence that agents can autonomously produce foundation-model designs competitive with or superior to hand-crafted baselines.

Significance. Should the reported gains prove attributable to the agent-discovered architectures under matched training conditions, the work would constitute a concrete demonstration that multi-agent systems can explore and improve upon standard Transformer and hybrid designs within modest compute budgets. The dual high-level / low-level decomposition and the scaling-law comparisons would be of interest to the automated ML and foundation-model communities as a step toward recursive self-improvement pipelines.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (+2.4 % and +3.8 % accuracy over Llama 3.2 at 1B scale, 54 % and 71 % faster scaling than Llama 3.2 and Composer’s best Transformer, 23 % and 37 % outscaling of Nemotron-2 and Composer’s best hybrid) are presented without any statement that token count, data mixture, optimizer, learning-rate schedule, or total compute were held identical to the published baselines. Because the attribution of gains to the AIRAformers/AIRAhybrids rests on this equivalence, the absence of such controls is load-bearing for the primary claim.
  2. [Abstract] Abstract: no error bars, ablation controls, statistical significance tests, or hyper-parameter tables are supplied for the accuracy deltas or scaling exponents, rendering the quantitative superiority statements difficult to evaluate.
minor comments (2)
  1. [Abstract] The manuscript should explicitly list the exact pre-training token budget, data sources, and optimizer settings used for the 1B-scale AIRA models so that readers can verify protocol parity with Llama 3.2 and Nemotron-2.
  2. [Abstract] Figure or table captions for scaling curves should state the precise functional form fitted (e.g., power-law exponent) and the range of model sizes over which the 54 % / 71 % speed-up figures were computed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. The points raised about experimental controls and statistical reporting are important for strengthening the primary claims. We address each comment below and indicate the revisions we will make to the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (+2.4 % and +3.8 % accuracy over Llama 3.2 at 1B scale, 54 % and 71 % faster scaling than Llama 3.2 and Composer’s best Transformer, 23 % and 37 % outscaling of Nemotron-2 and Composer’s best hybrid) are presented without any statement that token count, data mixture, optimizer, learning-rate schedule, or total compute were held identical to the published baselines. Because the attribution of gains to the AIRAformers/AIRAhybrids rests on this equivalence, the absence of such controls is load-bearing for the primary claim.

    Authors: We agree that the abstract should explicitly state the matched training conditions to support attribution of gains to the discovered architectures. The full manuscript (Section 4) details that all models, including Llama 3.2, Nemotron-2, and Composer baselines, were pre-trained with identical token counts, data mixtures, optimizer settings, learning-rate schedules, and total compute budgets. We will revise the abstract to include a clarifying phrase such as 'under matched pre-training conditions at the 1B scale' immediately following the performance claims. revision: yes

  2. Referee: [Abstract] Abstract: no error bars, ablation controls, statistical significance tests, or hyper-parameter tables are supplied for the accuracy deltas or scaling exponents, rendering the quantitative superiority statements difficult to evaluate.

    Authors: The full manuscript provides ablation studies (Section 5) and reports error bars from multiple independent runs along with hyper-parameter details in the supplementary material. Statistical significance is assessed via paired t-tests on downstream task results. Due to abstract length constraints, these elements cannot be fully reproduced there; however, we will add a concise statement in the revised abstract directing readers to the supplementary material for error bars, ablations, and scaling-exponent confidence intervals. We acknowledge that the current abstract version lacks this explicit pointer. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical agent-based search process (AIRA-Compose and AIRA-Design) that generates candidate architectures, extrapolates them to larger scales, pre-trains the resulting models at 1B parameters, and reports downstream accuracy and scaling comparisons against externally published baselines such as Llama 3.2 and Nemotron-2. No equations, fitted parameters, or self-referential metrics are defined in terms of the target performance quantities; the central claims rest on experimental outcomes rather than identities that reduce to the paper's own inputs by construction. No load-bearing self-citations or uniqueness theorems imported from prior author work appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only the abstract is available, so the ledger is limited to the core domain assumption required by any agentic design claim.

axioms (1)
  • domain assumption LLM agents can autonomously explore and evaluate neural architecture design spaces at useful fidelity
    This premise is required for both AIRA-Compose and AIRA-Design to produce meaningful outputs.
invented entities (1)
  • AIRAformers and AIRAhybrids no independent evidence
    purpose: Labels for the two families of architectures returned by the agent search
    These are discovered outputs rather than a priori postulated entities; no independent evidence outside the search is provided.

pith-pipeline@v0.9.0 · 5911 in / 1406 out tokens · 63483 ms · 2026-05-20T18:52:56.932568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · 13 internal anchors

  1. [1]

    Lupidi, Alisia and Gauri, Bhavul and Foster, Thomas Simon and Omari, Bassel Al and Magka, Despoina and Pepe, Alberto and Audran-Reiss, Alexis and Aghamelu, Muna and Baldwin, Nicolas and Cipolina-Kun, Lucia and others , journal=

  2. [2]

    KernelEvolve: Scaling agentic kernel coding for heterogeneous AI accelerators at meta.arXiv preprint arXiv:2512.23236, 2025

    Gang Liao and Hongsen Qin and Ying Wang and Alicia Golden and Michael Kuchnik and Yavuz Yetim and Jia Jiunn Ang and Chunli Fu and Yihan He and Samuel Hsia and Zewei Jiang and Dianshi Li and Uladzimir Pashkevich and Varna Puvvada and Feng Shi and Matt Steiner and Ruichao Xiao and Nathan Yan and Xiayu Yu and Zhou Fang and Roman Levenstein and Kunming Ho and...

  3. [3]

    Alec M. Hammond and Aram Markosyan and Aman Dontula and Simon Mahns and Zacharias Fisches and Dmitrii Pedchenko and Keyur Muzumdar and Natacha Supper and Mark Saroufim and Joe Isaacson and Laura Wang and Warren Hunt and Kaustubh Gondkar and Roman Levenstein and Gabriel Synnaeve and Richard Li and Jacob Kahn and Ajit Mathews , year=. 2512.10977 , archivePrefix=

  4. [4]

    2024 , publisher=

    Romera-Paredes, Bernardino and Barekatain, Mohammadamin and Novikov, Alexander and Balog, Matej and Kumar, M Pawan and Dupont, Emilien and Ruiz, Francisco JR and Ellenberg, Jordan S and Wang, Pengming and Fawzi, Omar and others , journal=. 2024 , publisher=

  5. [5]

    Diao, Shizhe and Yang, Yu and Fu, Yonggan and Dong, Xin and Su, Dan and Kliegl, Markus and Chen, Zijia and Belcak, Peter and Suhara, Yoshi and Yin, Hongxu and others , journal=

  6. [6]

    arXiv preprint arXiv:2603.22473 , year=

    Borobia, Hector and Segu. arXiv preprint arXiv:2603.22473 , year=

  7. [7]

    arXiv preprint arXiv:2507.06261 , year=

  8. [8]

    arXiv preprint arXiv:2403.05530 , year=

  9. [9]

    arXiv preprint arXiv:2412.16720 , year=

  10. [10]

    arXiv preprint arXiv:2508.10925 , year=

  11. [11]

    Zoph, Barret and Le, Quoc , booktitle=

  12. [12]

    arXiv preprint arXiv:2412.19437 , year=

  13. [13]

    Gu, Yuxian and Hu, Qinghao and Yang, Shang and Xi, Haocheng and Chen, Junyu and Han, Song and Cai, Han , journal=

  14. [14]

    arXiv preprint arXiv:2407.21783 , year=

  15. [15]

    Dao, Tri and Gu, Albert , journal=

  16. [16]

    Bae, Sangmin and Acun, Bilge and Habeeb, Haroun and Kim, Seungyeon and Lin, Chien-Yu and Luo, Liang and Wang, Junjie and Wu, Carole-Jean , journal=

  17. [17]

    Hambardzumyan, Karen and Baldwin, Nicolas and Toledo, Edan and Hazra, Rishi and Kuchnik, Michael and Omari, Bassel Al and Foster, Thomas Simon and Protopopov, Anton and Gagnon-Audet, Jean-Christophe and Mediratta, Ishita and others , journal=

  18. [18]

    Science , volume=

    Li, Yujia and Choi, David and Chung, Junyoung and Kushman, Nate and Schrittwieser, Julian and Leblond, R. Science , volume=. 2022 , publisher=

  19. [19]

    2023 , url =

    Leblond, R. 2023 , url =

  20. [20]

    Nathani, Deepak and Madaan, Lovish and Roberts, Nicholas and Bashlykov, Nikolay and Menon, Ajay and Moens, Vincent and Budhiraja, Amar and Magka, Despoina and Vorotilov, Vladislav and Chaurasia, Gaurav and others , journal=

  21. [21]

    The Art of Scaling Reinforcement Learning Compute for LLMs

    Devvrit Khatri and Lovish Madaan and Rishabh Tiwari and Rachit Bansal and Sai Surya Duvvuri and Manzil Zaheer and Inderjit S. Dhillon and David Brandfonbrener and Rishabh Agarwal , year=. 2510.13786 , archivePrefix=

  22. [22]

    Real, Esteban and Aggarwal, Alok and Huang, Yanping and Le, Quoc V , booktitle=

  23. [23]

    Javaheripi, Mojan and de Rosa, Gustavo and Mukherjee, Subhabrata and Shah, Shital and Religa, Tomasz and Teodoro Mendes, Caio Cesar and Bubeck, Sebastien and Koushanfar, Farinaz and Dey, Debadeepta , journal=

  24. [24]

    Jawahar, Ganesh and Mukherjee, Subhabrata and Liu, Xiaodong and Kim, Young Jin and Abdul-Mageed, Muhammad and Lakshmanan, Laks and Hassan, Ahmed and Bubeck, Sebastien and Gao, Jianfeng and others , booktitle=

  25. [25]

    White, Colin and Safari, Mahmoud and Sukthanker, Rhea and Ru, Binxin and Elsken, Thomas and Zela, Arber and Dey, Debadeepta and Hutter, Frank , journal=

  26. [26]

    Bercovich, Akhiad and Levy, Itay and Golan, Izik and Dabbah, Mohammad and El-Yaniv, Ran and Puny, Omri and Galil, Ido and Moshe, Zach and Ronen, Tomer and Nabwani, Najeeb and others , journal=

  27. [27]

    Elsken, Thomas and Metzen, Jan Hendrik and Hutter, Frank , journal=

  28. [28]

    Zoph, Barret and Vasudevan, Vijay and Shlens, Jonathon and Le, Quoc V , booktitle=

  29. [29]

    2509.17158 , archivePrefix=

    Pierre Andrews and Amine Benhalloum and Gerard Moreno-Torres Bertran and Matteo Bettini and Amar Budhiraja and Ricardo Silveira Cabral and Virginie Do and Romain Froger and Emilien Garreau and Jean-Baptiste Gaya and Hugo Laurençon and Maxime Lecanu and Kunal Malkan and Dheeraj Mekala and Pierre Ménard and Grégoire Mialon and Ulyana Piterbarg and Mikhail P...

  30. [30]

    2025 , publisher=

    Schmidgall, Samuel and Su, Yusheng and Wang, Ze and Sun, Ximeng and Wu, Jialian and Yu, Xiaodong and Liu, Jiang and Moor, Michael and Liu, Zicheng and Barsoum, Emad , journal=. 2025 , publisher=

  31. [31]

    2410.20424 , archiveprefix =

    Ziming Li and Qianbo Zang and David Ma and Jiawei Guo and Tuney Zheng and Minghao Liu and Xinyao Niu and Yue Wang and Jian Yang and Jiaheng Liu and Wanjun Zhong and Wangchunshu Zhou and Wenhao Huang and Ge Zhang , year =. 2410.20424 , archiveprefix =

  32. [32]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada and Robert Tjarko Lange and Cong Lu and Shengran Hu and Chris Lu and Jakob Foerster and Jeff Clune and David Ha , year=. 2504.08066 , archivePrefix=

  33. [33]

    Starace, Giulio and Jaffe, Oliver and Sherburn, Dane and Aung, James and Chan, Jun Shern and Maksin, Leon and Dias, Rachel and Mays, Evan and Kinsella, Benjamin and Thompson, Wyatt and others , journal=

  34. [34]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    Novikov, Alexander and V. arXiv preprint arXiv:2506.13131 , year=

  35. [35]

    Xiang, Yanzheng and Yan, Hanqi and Ouyang, Shuyin and Gui, Lin and He, Yulan , journal=

  36. [36]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , year=. 2310.06770 , archivePrefix=

  37. [37]

    Miller and Oisin Mac Aodha and Jakob Foerster and Yoram Bachrach , year=

    Bingchen Zhao and Despoina Magka and Minqi Jiang and Xian Li and Roberta Raileanu and Tatiana Shavrina and Jean-Christophe Gagnon-Audet and Kelvin Niu and Shagun Sodhani and Michael Shvartsman and Andrei Lupu and Alisia Lupidi and Edan Toledo and Karen Hambardzumyan and Martin Josifoski and Thomas Foster and Lucia Cipolina-Kun and Abhishek Charnalia and D...

  38. [38]

    Yin, Xunjian and Wang, Xinyi and Pan, Liangming and Lin, Li and Wan, Xiaojun and Wang, William Yang , booktitle=

  39. [39]

    2024 , publisher=

    Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and others , journal=. 2024 , publisher=

  40. [40]

    1966 , publisher=

    Good, Irving John , booktitle=. 1966 , publisher=

  41. [41]

    1987 , school=

    Schmidhuber, J. 1987 , school=

  42. [42]

    arXiv preprint cs.LO/0309048 , year=

    Schmidhuber, J. arXiv preprint cs.LO/0309048 , year=

  43. [43]

    Advances in neural information processing systems , volume=

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser,. Advances in neural information processing systems , volume=

  44. [44]

    Radford, Alec and Narasimhan, Karthik and Salimans, Tim and Sutskever, Ilya and others , year=

  45. [45]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle=

  46. [46]

    Adler, Bo and Agarwal, Niket and Aithal, Ashwath and Anh, Dong H and Bhattacharya, Pallab and Brundyn, Annika and Casper, Jared and Catanzaro, Bryan and Clay, Sharon and Cohen, Jonathan and others , journal=

  47. [47]

    arXiv preprint arXiv:2502.07827 , year=

    Rahmani, Babak and Sch. arXiv preprint arXiv:2502.07827 , year=

  48. [48]

    Peng, Bo and Alcaide, Eric and Anthony, Quentin and Albalak, Alon and Arcadinho, Samuel and Biderman, Stella and Cao, Huanqi and Cheng, Xin and Chung, Michael and Derczynski, Leon and others , booktitle=

  49. [49]

    Gu, Albert and Dao, Tri , booktitle=

  50. [50]

    Shazeer, Noam and Mirhoseini, Azalia and Maziarz, Krzysztof and Davis, Andy and Le, Quoc and Hinton, Geoffrey and Dean, Jeff , journal=

  51. [51]

    Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and others , journal=

  52. [52]

    2022 , publisher=

    Tay, Yi and Dehghani, Mostafa and Bahri, Dara and Metzler, Donald , journal=. 2022 , publisher=

  53. [53]

    Pope, Reiner and Douglas, Sholto and Chowdhery, Aakanksha and Devlin, Jacob and Bradbury, James and Heek, Jonathan and Xiao, Kefan and Agrawal, Shivani and Dean, Jeff , journal=

  54. [54]

    Sun, Yutao and Dong, Li and Huang, Shaohan and Ma, Shuming and Xia, Yuqing and Xue, Jilong and Wang, Jianyong and Wei, Furu , journal=

  55. [55]

    Lieber, Opher and Lenz, Barak and Bata, Hofit and Cohen, Gal and Osin, Jhonathan and Dalmedigos, Itay and Safahi, Erez and Meirom, Shaked and Belinkov, Yonatan and Shalev-Shwartz, Shai and others , journal=

  56. [56]

    Singhal, Soumye and Zeng, Jiaqi and Bukharin, Alexander and Zhang, Yian and Shen, Gerald and Mahabaleshwarkar, Ameya Sunil and Kartal, Bilal and Suhara, Yoshi and Bercovich, Akhiad and Levy, Itay and others , booktitle=

  57. [57]

    2021 , publisher=

    Ren, Pengzhen and Xiao, Yun and Chang, Xiaojun and Huang, Po-Yao and Li, Zhihui and Chen, Xiaojiang and Wang, Xin , journal=. 2021 , publisher=

  58. [58]

    IEEE transactions on neural networks and learning systems , volume=

    A survey on evolutionary neural architecture search , author=. IEEE transactions on neural networks and learning systems , volume=. 2021 , publisher=

  59. [59]

    Tay, Yi and Dehghani, Mostafa and Abnar, Samira and Shen, Yikang and Bahri, Dara and Pham, Philip and Rao, Jinfeng and Yang, Liu and Ruder, Sebastian and Metzler, Donald , journal=

  60. [60]

    Acun, Bilge and Sinha, Prasoon and Ardalani, Newsha and Bae, Sangmin and Golden, Alicia and Lin, Chien-Yu and Madhyastha, Meghana and Sun, Fei and Yadwadkar, Neeraja J and Wu, Carole-Jean , journal=

  61. [61]

    arXiv preprint arXiv:2507.02554 , year=

    Edan Toledo and Karen Hambardzumyan and Martin Josifoski and Rishi Hazra and Nicolas Baldwin and Alexis Audran-Reiss and Michael Kuchnik and Despoina Magka and Minqi Jiang and Alisia Maria Lupidi and Andrei Lupu and Roberta Raileanu and Kelvin Niu and Tatiana Shavrina and Jean-Christophe Gagnon-Audet and Michael Shvartsman and Shagun Sodhani and Alexander...

  62. [62]

    International Conference on Machine Learning , pages=

    Poli, Michael and Thomas, Armin W and Nguyen, Eric and Ponnusamy, Pragaash and Deiseroth, Bj. International Conference on Machine Learning , pages=. 2024 , organization=

  63. [63]

    2026 , howpublished =

    Karpathy, Andrej , title =. 2026 , howpublished =

  64. [64]

    The Thirteenth International Conference on Learning Representations , year=

    Memory Mosaics , author=. The Thirteenth International Conference on Learning Representations , year=

  65. [65]

    Eldan, Ronen and Li, Yuanzhi , journal=

  66. [66]

    PaperBench: Evaluating AI's Ability to Replicate AI Research

    Giulio Starace and Oliver Jaffe and Dane Sherburn and James Aung and Jun Shern Chan and Leon Maksin and Rachel Dias and Evan Mays and Benjamin Kinsella and Wyatt Thompson and Johannes Heidecke and Amelia Glaese and Tejal Patwardhan , year=. 2504.01848 , archivePrefix=

  67. [67]

    Chen, Mark and Tworek, Jerry and Jun, Heewoo and Yuan, Qiming and Pinto, Henrique Ponde De Oliveira and Kaplan, Jared and Edwards, Harri and Burda, Yuri and Joseph, Nicholas and Brockman, Greg and others , journal=

  68. [68]

    Li, Jeffrey and Fang, Alex and Smyrnis, Georgios and Ivgi, Maor and Jordan, Matt and Gadre, Samir Yitzhak and Bansal, Hritik and Guha, Etash and Keh, Sedrick Scott and Arora, Kushal and others , journal=

  69. [69]

    Karpathy, Andrej and Patel, Dwarkesh , title =

  70. [70]

    Ma, Xuezhe and Yang, Xiaomeng and Xiong, Wenhan and Chen, Beidi and Yu, Lili and Zhang, Hao and May, Jonathan and Zettlemoyer, Luke and Levy, Omer and Zhou, Chunting , journal=

  71. [71]

    Ren, Liliang and Liu, Yang and Lu, Yadong and Shen, Yelong and Liang, Chen and Chen, Weizhu , journal=

  72. [72]

    Team, Qwen , journal=

  73. [73]

    Blakeman, Aaron and Basant, Aarti and Khattar, Abhinav and Renduchintala, Adithya and Bercovich, Akhiad and Ficek, Aleksander and Bjorlin, Alexis and Taghibakhshi, Ali and Deshmukh, Amala Sanjay and Mahabaleshwarkar, Ameya Sunil and others , journal=

  74. [74]

    2026 , url=

    Alexiuk, Chris and Patel, Chintan and. 2026 , url=

  75. [75]

    Basant, Aarti and Khairnar, Abhijit and Paithankar, Abhijit and Khattar, Abhinav and Renduchintala, Adithya and Malte, Aditya and Bercovich, Akhiad and Hazare, Akshay and Rico, Alejandra and Ficek, Aleksander and others , journal=

  76. [76]

    Dong, Xin and others , journal=

  77. [77]

    Yang, Yufeng and others , journal=

  78. [78]

    2025 , publisher =

    Andrej Karpathy , title =. 2025 , publisher =

  79. [79]

    Seo, Minju and Baek, Jinheon and Lee, Seongyun and Hwang, Sung Ju , journal=

  80. [80]

    Liu, Hong and Li, Zhiyuan and Hall, David and Liang, Percy and Ma, Tengyu , journal=

Showing first 80 references.