1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
Pith reviewed 2026-05-20 15:40 UTC · model grok-4.3
The pith
A new benchmark with seven ML tasks shows current AI coding agents differ substantially in their ability to design and train models from scratch on a single GPU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce 1GC-7RC, a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels,
What carries the argument
1GC-7RC benchmark, which supplies locked data-preparation and evaluation scripts for seven ML tasks and restricts agents to changes only in the training script under single-GPU and time-budget rules.
If this is right
- Agents differ in the depth of implicit machine-learning knowledge they apply to new tasks.
- Planning ability and time-budget management vary markedly across the tested agents.
- The modular benchmark design permits straightforward addition of new tasks or domains.
- Public release of the harness supports reproducible head-to-head comparison of future agents.
- The same setup can be used to study multi-agent collaboration on research problems.
Where Pith is reading between the lines
- Performance profiles on the benchmark could guide which agent to deploy for particular modeling projects.
- Repeated evaluations over time could quantify progress in agents' capacity for independent model development.
- Gaps in time management suggest that workflow-level planning is a distinct skill worth separate improvement in agent training.
Load-bearing premise
Locking data-preparation and evaluation scripts while allowing only training-code changes, with no-internet and no-pretrained-weights rules, gives a fair test of an agent's ability to design, implement, and train models from scratch.
What would settle it
Re-running the seven agents and finding that all reach similar high performance on every task, or that gaps disappear when pretrained weights or internet access are allowed, would indicate the benchmark does not measure meaningful differences in agent capability.
Figures
read the original abstract
Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC-7RC-Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the 1GC-7RC benchmark consisting of seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task supplies locked data-preparation and evaluation scripts together with a baseline training script that agents may edit; agents operate under no-internet, no-pretrained-weights (with one controlled exception), and task-specific wall-clock limits on a single GPU. Seven coding agents (five proprietary, two open-source) are evaluated over five runs per agent-task pair, with the authors reporting substantial performance differences that they attribute to varying implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and artifacts are released publicly on GitHub with a modular design intended to support extensions.
Significance. If the reported performance differences are reproducible and the benchmark isolates the claimed capabilities, the work would supply a timely, reproducible platform for comparing autonomous coding agents on realistic single-GPU ML workloads. The public release of code and artifacts, together with the modular task structure, directly supports community reuse and extension to new domains or multi-agent settings.
major comments (2)
- [Abstract] Abstract: The central interpretation that performance differences reveal 'implicit ML knowledge, planning ability, and time-budget management' rests on the premise that agents must design and implement training procedures largely on their own. The described protocol instead supplies an explicit baseline training script that agents are permitted only to modify. This setup allows success via diagnosis and patching of the supplied baseline (which already encodes a model, optimizer, and loop) rather than de-novo construction; without evidence that the baselines are deliberately minimal or broken, the observed gaps are at least as plausibly explained by debugging proficiency as by the deeper capabilities claimed.
- [Results] Evaluation protocol: The abstract asserts 'substantial performance differences' across five runs per agent-task pair yet supplies no numerical metrics, tables, or variance measures in the provided text. To support the claim that these differences reliably indicate varying agent capabilities, the results section must report per-task, per-agent scores together with standard deviations or confidence intervals so that the magnitude and consistency of the gaps can be assessed.
minor comments (2)
- The title could be revised to emphasize that the benchmark evaluates code adaptation under constraints rather than open-ended research problem solving.
- A summary table listing the seven tasks, their time budgets, and key dataset characteristics would improve readability and allow readers to gauge relative difficulty at a glance.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the specific revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central interpretation that performance differences reveal 'implicit ML knowledge, planning ability, and time-budget management' rests on the premise that agents must design and implement training procedures largely on their own. The described protocol instead supplies an explicit baseline training script that agents are permitted only to modify. This setup allows success via diagnosis and patching of the supplied baseline (which already encodes a model, optimizer, and loop) rather than de-novo construction; without evidence that the baselines are deliberately minimal or broken, the observed gaps are at least as plausibly explained by debugging proficiency as by the deeper capabilities claimed.
Authors: We agree that the benchmark supplies a baseline training script which agents may modify rather than requiring fully de-novo implementation of training loops. This design decision was made deliberately to ensure tasks remain solvable within the strict single-GPU wall-clock limits while still testing realistic ML engineering skills. Effective modification nevertheless requires diagnosing shortcomings in the baseline, planning which changes to prioritize, and managing the remaining time budget. We will revise the abstract and introduction to state more precisely that agents adapt and improve the supplied baselines, and we will note that the observed differences reflect capabilities in diagnosis, planning, and optimization under constraints. We do not claim purely from-scratch construction. revision: yes
-
Referee: [Results] Evaluation protocol: The abstract asserts 'substantial performance differences' across five runs per agent-task pair yet supplies no numerical metrics, tables, or variance measures in the provided text. To support the claim that these differences reliably indicate varying agent capabilities, the results section must report per-task, per-agent scores together with standard deviations or confidence intervals so that the magnitude and consistency of the gaps can be assessed.
Authors: We thank the referee for this observation. While the manuscript contains the underlying run-level data, we acknowledge that a consolidated table with per-agent, per-task means and standard deviations is not yet presented in a form that directly supports the abstract claim. We will add such a table (and associated confidence intervals where appropriate) to the results section and will reference the quantitative findings explicitly in the abstract and discussion. revision: yes
Circularity Check
No circularity: benchmark and empirical results are self-contained
full rationale
The paper introduces a new benchmark (1GC-7RC) with locked scripts and reports direct empirical performance differences across agents on seven tasks. No equations, fitted parameters, predictions, or derivations appear in the presented material. The central claims rest on experimental runs rather than any reduction to prior self-citations or self-definitions. The work is therefore self-contained with no load-bearing steps matching the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019
Zoheb Abai and Nishad Rajmalwar. Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019
-
[2]
OpenCode: The open source coding agent, 2025
Anomaly Innovations. OpenCode: The open source coding agent, 2025. URLhttps://github. com/anomalyco/opencode. Accessed: 2026-04-28
work page 2025
-
[3]
Anthropic. Claude code overview, 2025. URLhttps://code.claude.com/docs/en/overview. Accessed: 2026-04-28
work page 2025
-
[4]
Anthropic. Model system cards, 2026. URLhttps://www.anthropic.com/system-cards. Ac- cessed: 2026-04-28
work page 2026
-
[5]
Tabnet: Attentive interpretable tabular learning
Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 6679–6687, 2021
work page 2021
-
[6]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.Computers and electronics in agriculture, 24(3):131–151, 1999
work page 1999
-
[8]
Zacharie Bugaud. Multi-rf fusion with multi-gnn blending for molecular property prediction.arXiv preprint arXiv:2603.20724, 2026
-
[9]
Mle-bench: Evaluating machine learning agents on machine learning engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, volume 2025, pages 50466–50494, 2025. 9
work page 2025
-
[10]
Harrison Chase. LangChain, 2022. URLhttps://github.com/langchain-ai/langchain. Re- leased: 2022-10-17; Accessed: 2026-04-28
work page 2022
-
[11]
Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research.Advances in Neural Information Processing Systems, 38, 2026
work page 2026
-
[12]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry
Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified, 2024. URL https://openai.com/index/introducing-swe-bench-verified/
work page 2024
-
[14]
AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data
Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexan- der Smola. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[15]
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2): 303–338, 2010
work page 2010
-
[16]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015
work page 2015
-
[17]
Deep residual learning for image recog- nition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[18]
Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering.Advances in Neural Information Processing Systems, 36:44753–44775, 2023
work page 2023
-
[19]
Automated design of agentic systems
Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377, 2025
work page 2025
-
[20]
Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020
work page 2020
-
[21]
Mlagentbench: Evaluating language agents on machine learning experimentation
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InInternational Conference on Machine Learning, pages 20271–20309. PMLR, 2024
work page 2024
-
[22]
Aide: Ai-driven exploration in the space of code, 2025
Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025
work page 2025
-
[23]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024
work page 2024
-
[24]
Andrej Karpathy. char-rnn: Multi-layer recurrent neural networks for character-level language models.https://github.com/karpathy/char-rnn, 2015. Accessed: 2026-04-28
work page 2015
-
[25]
Andrej Karpathy. nanoGPT: The simplest, fastest repository for training/fine-tuning medium-sized GPTs.https://github.com/karpathy/nanoGPT, 2022. Accessed: 2026-04-28
work page 2022
-
[26]
autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026
Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026. URLhttps://github.com/karpathy/autoresearch. Accessed: 2026-04-28
work page 2026
-
[27]
Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015
Yann Le, Xuan Yang, et al. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015. 10
work page 2015
-
[28]
DARTS: Differentiable Architecture Search
Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.arXiv preprint arXiv:1806.09055, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[29]
Agentbench: Evaluating llms as agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai- wen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024
work page 2024
-
[30]
ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering
Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, and Siheng Chen. Ml-agent: Reinforcing llm agents for autonomous machine learning engineering.arXiv preprint arXiv:2505.23723, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Kimi K2.6 tech blog: Advancing open-source coding, April 2026
Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, April 2026. URLhttps: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-04-28
work page 2026
-
[33]
A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Introducing codex.https://openai.com/index/introducing-codex/, 2025
OpenAI. Introducing codex.https://openai.com/index/introducing-codex/, 2025. Accessed: 2026-05-02
work page 2025
-
[35]
Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026
OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026. Accessed: 2026-05-02
work page 2026
-
[36]
OpenRouter. Kimi K2.5 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.5. Accessed: 2026-04-28
work page 2026
-
[37]
OpenRouter. Kimi K2.6 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.6. Accessed: 2026-04-28
work page 2026
-
[38]
Qwen3.6 Plus on OpenRouter, 2026
OpenRouter. Qwen3.6 Plus on OpenRouter, 2026. URLhttps://openrouter.ai/qwen/qwen3. 6-plus. Accessed: 2026-04-28
work page 2026
-
[39]
Qwen3.6-Plus: Towards real world agents, April 2026
QwenTeam. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6. Accessed: 2026-04-28
work page 2026
-
[40]
A self-improving coding agent.arXiv preprint arXiv:2504.15228,
Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025
-
[41]
Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu. Flaml: A fast and lightweight automl library.Proceedings of machine learning and systems, 3:434–447, 2021
work page 2021
-
[44]
Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026
Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026. Ac- cessed: 2026-05-02
work page 2026
-
[45]
Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, YueqiSong, BowenLi, JaskiratSingh, etal. Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919, 2025
work page 2025
-
[46]
Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts
Hjalmar Wijk, Tao Roa Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua M Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts. InInternational Conference on Machine Learning, pages 66772–66832. PMLR, 2025. 11
work page 2025
-
[47]
Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, ShaokunZhang, JialeLiu, etal. Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations. InFirst conference on language modeling, 2024
work page 2024
-
[48]
Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024
Scott Wu. Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024
work page 2024
-
[49]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[50]
How Powerful are Graph Neural Networks?
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?arXiv preprint arXiv:1810.00826, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[51]
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[52]
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023
work page 2023
-
[53]
Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks
LeiZhang, YugeZhang, KanRen, DongshengLi, andYuqingYang. Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2931–2959, 2024
work page 2024
-
[54]
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text clas- sification.Advances in neural information processing systems, 28, 2015
work page 2015
-
[55]
Informer: Beyond efficient transformer for long sequence time-series forecasting
Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021
work page 2021
-
[56]
Webarena: A realistic web environment for building autonomous agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024
work page 2024
-
[57]
Neural Architecture Search with Reinforcement Learning
Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578, 2016. Acknowledgments The authors gratefully acknowledge the Institute for Distributed Intelligent Systems (ETTI 2) and the Institute for Autonomous Systems Technology (LRT 8.1) at the University of the Bundeswehr Munich for granting access t...
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.