pith. sign in

arxiv: 2605.17046 · v2 · pith:XI4ECRM4new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

Pith reviewed 2026-05-20 15:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords AI coding agentsmachine learning benchmarkmodel trainingsingle GPUperformance evaluationautonomous research agentstime-constrained tasks
0
0 comments X

The pith

A new benchmark with seven ML tasks shows current AI coding agents differ substantially in their ability to design and train models from scratch on a single GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 1GC-7RC, a benchmark with seven tasks covering language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task supplies locked data-preparation and evaluation scripts so that agents may edit only the training code, with no internet access, no pretrained weights except one controlled case, and a fixed wall-clock budget on one GPU. Tests of seven agents across five runs each produce large performance gaps that track differences in implicit machine-learning knowledge, planning, and time management. The authors release the benchmark, harness, and results publicly to enable direct comparisons as agents evolve. The modular setup also supports adding new tasks or studying multi-agent work.

Core claim

We introduce 1GC-7RC, a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels,

What carries the argument

1GC-7RC benchmark, which supplies locked data-preparation and evaluation scripts for seven ML tasks and restricts agents to changes only in the training script under single-GPU and time-budget rules.

If this is right

  • Agents differ in the depth of implicit machine-learning knowledge they apply to new tasks.
  • Planning ability and time-budget management vary markedly across the tested agents.
  • The modular benchmark design permits straightforward addition of new tasks or domains.
  • Public release of the harness supports reproducible head-to-head comparison of future agents.
  • The same setup can be used to study multi-agent collaboration on research problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Performance profiles on the benchmark could guide which agent to deploy for particular modeling projects.
  • Repeated evaluations over time could quantify progress in agents' capacity for independent model development.
  • Gaps in time management suggest that workflow-level planning is a distinct skill worth separate improvement in agent training.

Load-bearing premise

Locking data-preparation and evaluation scripts while allowing only training-code changes, with no-internet and no-pretrained-weights rules, gives a fair test of an agent's ability to design, implement, and train models from scratch.

What would settle it

Re-running the seven agents and finding that all reach similar high performance on every task, or that gaps disappear when pretrained weights or internet access are allowed, would indicate the benchmark does not measure meaningful differences in agent capability.

Figures

Figures reproduced from arXiv: 2605.17046 by Anna B\"o{\ss}end\"orfer, Fabian Deuser, Konrad Habel, Norbert Oswald, Robin-Nico Kampa.

Figure 1
Figure 1. Figure 1: Six ML domains covered by the seven 1GC-7RC tasks. NLP and computer vision (CV) are each [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Harness architecture. Each task runs in an isolated workspace with copied [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mean training scripts per task (7 tasks × 5 runs) vs. 1GC-7RC Agg score (Eq. 1). Higher iteration counts do not predict higher scores; the dashed line indicates the anti-correlation. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
read the original abstract

Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC-7RC-Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the 1GC-7RC benchmark consisting of seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task supplies locked data-preparation and evaluation scripts together with a baseline training script that agents may edit; agents operate under no-internet, no-pretrained-weights (with one controlled exception), and task-specific wall-clock limits on a single GPU. Seven coding agents (five proprietary, two open-source) are evaluated over five runs per agent-task pair, with the authors reporting substantial performance differences that they attribute to varying implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and artifacts are released publicly on GitHub with a modular design intended to support extensions.

Significance. If the reported performance differences are reproducible and the benchmark isolates the claimed capabilities, the work would supply a timely, reproducible platform for comparing autonomous coding agents on realistic single-GPU ML workloads. The public release of code and artifacts, together with the modular task structure, directly supports community reuse and extension to new domains or multi-agent settings.

major comments (2)
  1. [Abstract] Abstract: The central interpretation that performance differences reveal 'implicit ML knowledge, planning ability, and time-budget management' rests on the premise that agents must design and implement training procedures largely on their own. The described protocol instead supplies an explicit baseline training script that agents are permitted only to modify. This setup allows success via diagnosis and patching of the supplied baseline (which already encodes a model, optimizer, and loop) rather than de-novo construction; without evidence that the baselines are deliberately minimal or broken, the observed gaps are at least as plausibly explained by debugging proficiency as by the deeper capabilities claimed.
  2. [Results] Evaluation protocol: The abstract asserts 'substantial performance differences' across five runs per agent-task pair yet supplies no numerical metrics, tables, or variance measures in the provided text. To support the claim that these differences reliably indicate varying agent capabilities, the results section must report per-task, per-agent scores together with standard deviations or confidence intervals so that the magnitude and consistency of the gaps can be assessed.
minor comments (2)
  1. The title could be revised to emphasize that the benchmark evaluates code adaptation under constraints rather than open-ended research problem solving.
  2. A summary table listing the seven tasks, their time budgets, and key dataset characteristics would improve readability and allow readers to gauge relative difficulty at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the specific revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central interpretation that performance differences reveal 'implicit ML knowledge, planning ability, and time-budget management' rests on the premise that agents must design and implement training procedures largely on their own. The described protocol instead supplies an explicit baseline training script that agents are permitted only to modify. This setup allows success via diagnosis and patching of the supplied baseline (which already encodes a model, optimizer, and loop) rather than de-novo construction; without evidence that the baselines are deliberately minimal or broken, the observed gaps are at least as plausibly explained by debugging proficiency as by the deeper capabilities claimed.

    Authors: We agree that the benchmark supplies a baseline training script which agents may modify rather than requiring fully de-novo implementation of training loops. This design decision was made deliberately to ensure tasks remain solvable within the strict single-GPU wall-clock limits while still testing realistic ML engineering skills. Effective modification nevertheless requires diagnosing shortcomings in the baseline, planning which changes to prioritize, and managing the remaining time budget. We will revise the abstract and introduction to state more precisely that agents adapt and improve the supplied baselines, and we will note that the observed differences reflect capabilities in diagnosis, planning, and optimization under constraints. We do not claim purely from-scratch construction. revision: yes

  2. Referee: [Results] Evaluation protocol: The abstract asserts 'substantial performance differences' across five runs per agent-task pair yet supplies no numerical metrics, tables, or variance measures in the provided text. To support the claim that these differences reliably indicate varying agent capabilities, the results section must report per-task, per-agent scores together with standard deviations or confidence intervals so that the magnitude and consistency of the gaps can be assessed.

    Authors: We thank the referee for this observation. While the manuscript contains the underlying run-level data, we acknowledge that a consolidated table with per-agent, per-task means and standard deviations is not yet presented in a form that directly supports the abstract claim. We will add such a table (and associated confidence intervals where appropriate) to the results section and will reference the quantitative findings explicitly in the abstract and discussion. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and empirical results are self-contained

full rationale

The paper introduces a new benchmark (1GC-7RC) with locked scripts and reports direct empirical performance differences across agents on seven tasks. No equations, fitted parameters, predictions, or derivations appear in the presented material. The central claims rest on experimental runs rather than any reduction to prior self-citations or self-definitions. The work is therefore self-contained with no load-bearing steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides insufficient detail to enumerate specific free parameters or axioms; the main design choices (task selection and time budgets) are presented as part of the contribution rather than fitted quantities.

pith-pipeline@v0.9.0 · 5889 in / 1168 out tokens · 50064 ms · 2026-05-20T15:40:03.666691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 12 internal anchors

  1. [1]

    Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019

    Zoheb Abai and Nishad Rajmalwar. Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019

  2. [2]

    OpenCode: The open source coding agent, 2025

    Anomaly Innovations. OpenCode: The open source coding agent, 2025. URLhttps://github. com/anomalyco/opencode. Accessed: 2026-04-28

  3. [3]

    Claude code overview, 2025

    Anthropic. Claude code overview, 2025. URLhttps://code.claude.com/docs/en/overview. Accessed: 2026-04-28

  4. [4]

    Model system cards, 2026

    Anthropic. Model system cards, 2026. URLhttps://www.anthropic.com/system-cards. Ac- cessed: 2026-04-28

  5. [5]

    Tabnet: Attentive interpretable tabular learning

    Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 6679–6687, 2021

  6. [6]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  7. [7]

    Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.Computers and electronics in agriculture, 24(3):131–151, 1999

  8. [8]

    Multi-rf fusion with multi-gnn blending for molecular property prediction.arXiv preprint arXiv:2603.20724, 2026

    Zacharie Bugaud. Multi-rf fusion with multi-gnn blending for molecular property prediction.arXiv preprint arXiv:2603.20724, 2026

  9. [9]

    Mle-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, volume 2025, pages 50466–50494, 2025. 9

  10. [10]

    LangChain, 2022

    Harrison Chase. LangChain, 2022. URLhttps://github.com/langchain-ai/langchain. Re- leased: 2022-10-17; Accessed: 2026-04-28

  11. [11]

    Mlr-bench: Evaluating ai agents on open-ended machine learning research.Advances in Neural Information Processing Systems, 38, 2026

    Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research.Advances in Neural Information Processing Systems, 38, 2026

  12. [12]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  13. [13]

    Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

    Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified, 2024. URL https://openai.com/index/introducing-swe-bench-verified/

  14. [14]

    AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

    Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexan- der Smola. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505, 2020

  15. [15]

    The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2): 303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2): 303–338, 2010

  16. [16]

    Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

    Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

  17. [17]

    Deep residual learning for image recog- nition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  18. [18]

    Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering.Advances in Neural Information Processing Systems, 36:44753–44775, 2023

  19. [19]

    Automated design of agentic systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377, 2025

  20. [20]

    Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

    Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

  21. [21]

    Mlagentbench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InInternational Conference on Machine Learning, pages 20271–20309. PMLR, 2024

  22. [22]

    Aide: Ai-driven exploration in the space of code, 2025

    Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025

  23. [23]

    Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

  24. [24]

    char-rnn: Multi-layer recurrent neural networks for character-level language models.https://github.com/karpathy/char-rnn, 2015

    Andrej Karpathy. char-rnn: Multi-layer recurrent neural networks for character-level language models.https://github.com/karpathy/char-rnn, 2015. Accessed: 2026-04-28

  25. [25]

    nanoGPT: The simplest, fastest repository for training/fine-tuning medium-sized GPTs.https://github.com/karpathy/nanoGPT, 2022

    Andrej Karpathy. nanoGPT: The simplest, fastest repository for training/fine-tuning medium-sized GPTs.https://github.com/karpathy/nanoGPT, 2022. Accessed: 2026-04-28

  26. [26]

    autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026

    Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026. URLhttps://github.com/karpathy/autoresearch. Accessed: 2026-04-28

  27. [27]

    Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

    Yann Le, Xuan Yang, et al. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015. 10

  28. [28]

    DARTS: Differentiable Architecture Search

    Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.arXiv preprint arXiv:1806.09055, 2018

  29. [29]

    Agentbench: Evaluating llms as agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai- wen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024

  30. [30]

    ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

    Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, and Siheng Chen. Ml-agent: Reinforcing llm agents for autonomous machine learning engineering.arXiv preprint arXiv:2505.23723, 2025

  31. [31]

    The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

    Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

  32. [32]

    Kimi K2.6 tech blog: Advancing open-source coding, April 2026

    Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, April 2026. URLhttps: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-04-28

  33. [33]

    A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

  34. [34]

    Introducing codex.https://openai.com/index/introducing-codex/, 2025

    OpenAI. Introducing codex.https://openai.com/index/introducing-codex/, 2025. Accessed: 2026-05-02

  35. [35]

    Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026

    OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026. Accessed: 2026-05-02

  36. [36]

    Kimi K2.5 on OpenRouter, 2026

    OpenRouter. Kimi K2.5 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.5. Accessed: 2026-04-28

  37. [37]

    Kimi K2.6 on OpenRouter, 2026

    OpenRouter. Kimi K2.6 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.6. Accessed: 2026-04-28

  38. [38]

    Qwen3.6 Plus on OpenRouter, 2026

    OpenRouter. Qwen3.6 Plus on OpenRouter, 2026. URLhttps://openrouter.ai/qwen/qwen3. 6-plus. Accessed: 2026-04-28

  39. [39]

    Qwen3.6-Plus: Towards real world agents, April 2026

    QwenTeam. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6. Accessed: 2026-04-28

  40. [40]

    A self-improving coding agent.arXiv preprint arXiv:2504.15228,

    Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

  41. [41]

    DINOv3

    Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  42. [42]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

  43. [43]

    Flaml: A fast and lightweight automl library.Proceedings of machine learning and systems, 3:434–447, 2021

    Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu. Flaml: A fast and lightweight automl library.Proceedings of machine learning and systems, 3:434–447, 2021

  44. [44]

    Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026

    Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026. Ac- cessed: 2026-05-02

  45. [45]

    Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, YueqiSong, BowenLi, JaskiratSingh, etal. Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919, 2025

  46. [46]

    Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts

    Hjalmar Wijk, Tao Roa Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua M Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts. InInternational Conference on Machine Learning, pages 66772–66832. PMLR, 2025. 11

  47. [47]

    Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, ShaokunZhang, JialeLiu, etal. Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations. InFirst conference on language modeling, 2024

  48. [48]

    Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024

    Scott Wu. Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024

  49. [49]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  50. [50]

    How Powerful are Graph Neural Networks?

    Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?arXiv preprint arXiv:1810.00826, 2018

  51. [51]

    The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

  52. [52]

    Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

    Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

  53. [53]

    Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks

    LeiZhang, YugeZhang, KanRen, DongshengLi, andYuqingYang. Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2931–2959, 2024

  54. [54]

    Character-level convolutional networks for text clas- sification.Advances in neural information processing systems, 28, 2015

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text clas- sification.Advances in neural information processing systems, 28, 2015

  55. [55]

    Informer: Beyond efficient transformer for long sequence time-series forecasting

    Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

  56. [56]

    Webarena: A realistic web environment for building autonomous agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024

  57. [57]

    Neural Architecture Search with Reinforcement Learning

    Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578, 2016. Acknowledgments The authors gratefully acknowledge the Institute for Distributed Intelligent Systems (ETTI 2) and the Institute for Autonomous Systems Technology (LRT 8.1) at the University of the Bundeswehr Munich for granting access t...