1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

Anna B\"o{\ss}end\"orfer; Fabian Deuser; Konrad Habel; Norbert Oswald; Robin-Nico Kampa

arxiv: 2605.17046 · v2 · pith:XI4ECRM4new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

Robin-Nico Kampa , Fabian Deuser , Anna B\"o{\ss}end\"orfer , Konrad Habel , Norbert Oswald This is my paper

Pith reviewed 2026-05-20 15:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords AI coding agentsmachine learning benchmarkmodel trainingsingle GPUperformance evaluationautonomous research agentstime-constrained tasks

0 comments

The pith

A new benchmark with seven ML tasks shows current AI coding agents differ substantially in their ability to design and train models from scratch on a single GPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 1GC-7RC, a benchmark with seven tasks covering language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task supplies locked data-preparation and evaluation scripts so that agents may edit only the training code, with no internet access, no pretrained weights except one controlled case, and a fixed wall-clock budget on one GPU. Tests of seven agents across five runs each produce large performance gaps that track differences in implicit machine-learning knowledge, planning, and time management. The authors release the benchmark, harness, and results publicly to enable direct comparisons as agents evolve. The modular setup also supports adding new tasks or studying multi-agent work.

Core claim

We introduce 1GC-7RC, a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels,

What carries the argument

1GC-7RC benchmark, which supplies locked data-preparation and evaluation scripts for seven ML tasks and restricts agents to changes only in the training script under single-GPU and time-budget rules.

If this is right

Agents differ in the depth of implicit machine-learning knowledge they apply to new tasks.
Planning ability and time-budget management vary markedly across the tested agents.
The modular benchmark design permits straightforward addition of new tasks or domains.
Public release of the harness supports reproducible head-to-head comparison of future agents.
The same setup can be used to study multi-agent collaboration on research problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Performance profiles on the benchmark could guide which agent to deploy for particular modeling projects.
Repeated evaluations over time could quantify progress in agents' capacity for independent model development.
Gaps in time management suggest that workflow-level planning is a distinct skill worth separate improvement in agent training.

Load-bearing premise

Locking data-preparation and evaluation scripts while allowing only training-code changes, with no-internet and no-pretrained-weights rules, gives a fair test of an agent's ability to design, implement, and train models from scratch.

What would settle it

Re-running the seven agents and finding that all reach similar high performance on every task, or that gaps disappear when pretrained weights or internet access are allowed, would indicate the benchmark does not measure meaningful differences in agent capability.

Figures

Figures reproduced from arXiv: 2605.17046 by Anna B\"o{\ss}end\"orfer, Fabian Deuser, Konrad Habel, Norbert Oswald, Robin-Nico Kampa.

**Figure 2.** Figure 2: Harness architecture. Each task runs in an isolated workspace with copied [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Mean training scripts per task (7 tasks × 5 runs) vs. 1GC-7RC Agg score (Eq. 1). Higher iteration counts do not predict higher scores; the dashed line indicates the anti-correlation. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC-7RC-Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark gives a concrete way to compare agents under single-GPU time limits, but the supplied baselines mean the gaps likely reflect editing skill more than building models from scratch.

read the letter

The main thing to know is that 1GC-7RC sets up seven ML tasks with locked data and evaluation scripts plus a baseline training script that agents are allowed only to edit, all under single-GPU wall-clock limits and no-internet rules. This produces a usable comparison across agents, yet the results probably measure how well agents patch and adapt given code rather than how well they design training procedures from empty files.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the 1GC-7RC benchmark consisting of seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task supplies locked data-preparation and evaluation scripts together with a baseline training script that agents may edit; agents operate under no-internet, no-pretrained-weights (with one controlled exception), and task-specific wall-clock limits on a single GPU. Seven coding agents (five proprietary, two open-source) are evaluated over five runs per agent-task pair, with the authors reporting substantial performance differences that they attribute to varying implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and artifacts are released publicly on GitHub with a modular design intended to support extensions.

Significance. If the reported performance differences are reproducible and the benchmark isolates the claimed capabilities, the work would supply a timely, reproducible platform for comparing autonomous coding agents on realistic single-GPU ML workloads. The public release of code and artifacts, together with the modular task structure, directly supports community reuse and extension to new domains or multi-agent settings.

major comments (2)

[Abstract] Abstract: The central interpretation that performance differences reveal 'implicit ML knowledge, planning ability, and time-budget management' rests on the premise that agents must design and implement training procedures largely on their own. The described protocol instead supplies an explicit baseline training script that agents are permitted only to modify. This setup allows success via diagnosis and patching of the supplied baseline (which already encodes a model, optimizer, and loop) rather than de-novo construction; without evidence that the baselines are deliberately minimal or broken, the observed gaps are at least as plausibly explained by debugging proficiency as by the deeper capabilities claimed.
[Results] Evaluation protocol: The abstract asserts 'substantial performance differences' across five runs per agent-task pair yet supplies no numerical metrics, tables, or variance measures in the provided text. To support the claim that these differences reliably indicate varying agent capabilities, the results section must report per-task, per-agent scores together with standard deviations or confidence intervals so that the magnitude and consistency of the gaps can be assessed.

minor comments (2)

The title could be revised to emphasize that the benchmark evaluates code adaptation under constraints rather than open-ended research problem solving.
A summary table listing the seven tasks, their time budgets, and key dataset characteristics would improve readability and allow readers to gauge relative difficulty at a glance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and indicate the specific revisions we plan to incorporate.

read point-by-point responses

Referee: [Abstract] Abstract: The central interpretation that performance differences reveal 'implicit ML knowledge, planning ability, and time-budget management' rests on the premise that agents must design and implement training procedures largely on their own. The described protocol instead supplies an explicit baseline training script that agents are permitted only to modify. This setup allows success via diagnosis and patching of the supplied baseline (which already encodes a model, optimizer, and loop) rather than de-novo construction; without evidence that the baselines are deliberately minimal or broken, the observed gaps are at least as plausibly explained by debugging proficiency as by the deeper capabilities claimed.

Authors: We agree that the benchmark supplies a baseline training script which agents may modify rather than requiring fully de-novo implementation of training loops. This design decision was made deliberately to ensure tasks remain solvable within the strict single-GPU wall-clock limits while still testing realistic ML engineering skills. Effective modification nevertheless requires diagnosing shortcomings in the baseline, planning which changes to prioritize, and managing the remaining time budget. We will revise the abstract and introduction to state more precisely that agents adapt and improve the supplied baselines, and we will note that the observed differences reflect capabilities in diagnosis, planning, and optimization under constraints. We do not claim purely from-scratch construction. revision: yes
Referee: [Results] Evaluation protocol: The abstract asserts 'substantial performance differences' across five runs per agent-task pair yet supplies no numerical metrics, tables, or variance measures in the provided text. To support the claim that these differences reliably indicate varying agent capabilities, the results section must report per-task, per-agent scores together with standard deviations or confidence intervals so that the magnitude and consistency of the gaps can be assessed.

Authors: We thank the referee for this observation. While the manuscript contains the underlying run-level data, we acknowledge that a consolidated table with per-agent, per-task means and standard deviations is not yet presented in a form that directly supports the abstract claim. We will add such a table (and associated confidence intervals where appropriate) to the results section and will reference the quantitative findings explicitly in the abstract and discussion. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and empirical results are self-contained

full rationale

The paper introduces a new benchmark (1GC-7RC) with locked scripts and reports direct empirical performance differences across agents on seven tasks. No equations, fitted parameters, predictions, or derivations appear in the presented material. The central claims rest on experimental runs rather than any reduction to prior self-citations or self-definitions. The work is therefore self-contained with no load-bearing steps matching the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides insufficient detail to enumerate specific free parameters or axioms; the main design choices (task selection and time budgets) are presented as part of the contribution rather than fitted quantities.

pith-pipeline@v0.9.0 · 5889 in / 1168 out tokens · 50064 ms · 2026-05-20T15:40:03.666691+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 12 internal anchors

[1]

Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019

Zoheb Abai and Nishad Rajmalwar. Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019

work page arXiv 1904
[2]

OpenCode: The open source coding agent, 2025

Anomaly Innovations. OpenCode: The open source coding agent, 2025. URLhttps://github. com/anomalyco/opencode. Accessed: 2026-04-28

work page 2025
[3]

Claude code overview, 2025

Anthropic. Claude code overview, 2025. URLhttps://code.claude.com/docs/en/overview. Accessed: 2026-04-28

work page 2025
[4]

Model system cards, 2026

Anthropic. Model system cards, 2026. URLhttps://www.anthropic.com/system-cards. Ac- cessed: 2026-04-28

work page 2026
[5]

Tabnet: Attentive interpretable tabular learning

Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 6679–6687, 2021

work page 2021
[6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.Computers and electronics in agriculture, 24(3):131–151, 1999

work page 1999
[8]

Multi-rf fusion with multi-gnn blending for molecular property prediction.arXiv preprint arXiv:2603.20724, 2026

Zacharie Bugaud. Multi-rf fusion with multi-gnn blending for molecular property prediction.arXiv preprint arXiv:2603.20724, 2026

work page arXiv 2026
[9]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, volume 2025, pages 50466–50494, 2025. 9

work page 2025
[10]

LangChain, 2022

Harrison Chase. LangChain, 2022. URLhttps://github.com/langchain-ai/langchain. Re- leased: 2022-10-17; Accessed: 2026-04-28

work page 2022
[11]

Mlr-bench: Evaluating ai agents on open-ended machine learning research.Advances in Neural Information Processing Systems, 38, 2026

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research.Advances in Neural Information Processing Systems, 38, 2026

work page 2026
[12]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified, 2024. URL https://openai.com/index/introducing-swe-bench-verified/

work page 2024
[14]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexan- der Smola. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[15]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2): 303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2): 303–338, 2010

work page 2010
[16]

Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

work page 2015
[17]

Deep residual learning for image recog- nition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[18]

Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering.Advances in Neural Information Processing Systems, 36:44753–44775, 2023

work page 2023
[19]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377, 2025

work page 2025
[20]

Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

work page 2020
[21]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InInternational Conference on Machine Learning, pages 20271–20309. PMLR, 2024

work page 2024
[22]

Aide: Ai-driven exploration in the space of code, 2025

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025

work page 2025
[23]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

work page 2024
[24]

char-rnn: Multi-layer recurrent neural networks for character-level language models.https://github.com/karpathy/char-rnn, 2015

Andrej Karpathy. char-rnn: Multi-layer recurrent neural networks for character-level language models.https://github.com/karpathy/char-rnn, 2015. Accessed: 2026-04-28

work page 2015
[25]

nanoGPT: The simplest, fastest repository for training/fine-tuning medium-sized GPTs.https://github.com/karpathy/nanoGPT, 2022

Andrej Karpathy. nanoGPT: The simplest, fastest repository for training/fine-tuning medium-sized GPTs.https://github.com/karpathy/nanoGPT, 2022. Accessed: 2026-04-28

work page 2022
[26]

autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026. URLhttps://github.com/karpathy/autoresearch. Accessed: 2026-04-28

work page 2026
[27]

Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

Yann Le, Xuan Yang, et al. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015. 10

work page 2015
[28]

DARTS: Differentiable Architecture Search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.arXiv preprint arXiv:1806.09055, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai- wen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024

work page 2024
[30]

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, and Siheng Chen. Ml-agent: Reinforcing llm agents for autonomous machine learning engineering.arXiv preprint arXiv:2505.23723, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Kimi K2.6 tech blog: Advancing open-source coding, April 2026

Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, April 2026. URLhttps: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-04-28

work page 2026
[33]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Introducing codex.https://openai.com/index/introducing-codex/, 2025

OpenAI. Introducing codex.https://openai.com/index/introducing-codex/, 2025. Accessed: 2026-05-02

work page 2025
[35]

Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026

OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026. Accessed: 2026-05-02

work page 2026
[36]

Kimi K2.5 on OpenRouter, 2026

OpenRouter. Kimi K2.5 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.5. Accessed: 2026-04-28

work page 2026
[37]

Kimi K2.6 on OpenRouter, 2026

OpenRouter. Kimi K2.6 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.6. Accessed: 2026-04-28

work page 2026
[38]

Qwen3.6 Plus on OpenRouter, 2026

OpenRouter. Qwen3.6 Plus on OpenRouter, 2026. URLhttps://openrouter.ai/qwen/qwen3. 6-plus. Accessed: 2026-04-28

work page 2026
[39]

Qwen3.6-Plus: Towards real world agents, April 2026

QwenTeam. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6. Accessed: 2026-04-28

work page 2026
[40]

A self-improving coding agent.arXiv preprint arXiv:2504.15228,

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

work page arXiv 2025
[41]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Flaml: A fast and lightweight automl library.Proceedings of machine learning and systems, 3:434–447, 2021

Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu. Flaml: A fast and lightweight automl library.Proceedings of machine learning and systems, 3:434–447, 2021

work page 2021
[44]

Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026

Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026. Ac- cessed: 2026-05-02

work page 2026
[45]

Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, YueqiSong, BowenLi, JaskiratSingh, etal. Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919, 2025

work page 2025
[46]

Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts

Hjalmar Wijk, Tao Roa Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua M Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts. InInternational Conference on Machine Learning, pages 66772–66832. PMLR, 2025. 11

work page 2025
[47]

Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, ShaokunZhang, JialeLiu, etal. Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations. InFirst conference on language modeling, 2024

work page 2024
[48]

Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024

Scott Wu. Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024

work page 2024
[49]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[50]

How Powerful are Graph Neural Networks?

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?arXiv preprint arXiv:1810.00826, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[52]

Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

work page 2023
[53]

Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks

LeiZhang, YugeZhang, KanRen, DongshengLi, andYuqingYang. Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2931–2959, 2024

work page 2024
[54]

Character-level convolutional networks for text clas- sification.Advances in neural information processing systems, 28, 2015

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text clas- sification.Advances in neural information processing systems, 28, 2015

work page 2015
[55]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

work page 2021
[56]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024

work page 2024
[57]

Neural Architecture Search with Reinforcement Learning

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578, 2016. Acknowledgments The authors gratefully acknowledge the Institute for Distributed Intelligent Systems (ETTI 2) and the Institute for Autonomous Systems Technology (LRT 8.1) at the University of the Bundeswehr Munich for granting access t...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019

Zoheb Abai and Nishad Rajmalwar. Densenet models for tiny imagenet classification.arXiv preprint arXiv:1904.10429, 2019

work page arXiv 1904

[2] [2]

OpenCode: The open source coding agent, 2025

Anomaly Innovations. OpenCode: The open source coding agent, 2025. URLhttps://github. com/anomalyco/opencode. Accessed: 2026-04-28

work page 2025

[3] [3]

Claude code overview, 2025

Anthropic. Claude code overview, 2025. URLhttps://code.claude.com/docs/en/overview. Accessed: 2026-04-28

work page 2025

[4] [4]

Model system cards, 2026

Anthropic. Model system cards, 2026. URLhttps://www.anthropic.com/system-cards. Ac- cessed: 2026-04-28

work page 2026

[5] [5]

Tabnet: Attentive interpretable tabular learning

Sercan Ö Arik and Tomas Pfister. Tabnet: Attentive interpretable tabular learning. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 6679–6687, 2021

work page 2021

[6] [6]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.Computers and electronics in agriculture, 24(3):131–151, 1999

work page 1999

[8] [8]

Multi-rf fusion with multi-gnn blending for molecular property prediction.arXiv preprint arXiv:2603.20724, 2026

Zacharie Bugaud. Multi-rf fusion with multi-gnn blending for molecular property prediction.arXiv preprint arXiv:2603.20724, 2026

work page arXiv 2026

[9] [9]

Mle-bench: Evaluating machine learning agents on machine learning engineering

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. InInternational Conference on Learning Representations, volume 2025, pages 50466–50494, 2025. 9

work page 2025

[10] [10]

LangChain, 2022

Harrison Chase. LangChain, 2022. URLhttps://github.com/langchain-ai/langchain. Re- leased: 2022-10-17; Accessed: 2026-04-28

work page 2022

[11] [11]

Mlr-bench: Evaluating ai agents on open-ended machine learning research.Advances in Neural Information Processing Systems, 38, 2026

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, and Bryan Hooi. Mlr-bench: Evaluating ai agents on open-ended machine learning research.Advances in Neural Information Processing Systems, 38, 2026

work page 2026

[12] [12]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry

Neil Chowdhury, James Aung, Chan Jun Shern, Oliver Jaffe, Dane Sherburn, Giulio Starace, Evan Mays, Rachel Dias, Marwan Aljubeh, Mia Glaese, Carlos E. Jimenez, John Yang, Leyton Ho, Tejal Patwardhan, Kevin Liu, and Aleksander Madry. Introducing SWE-bench verified, 2024. URL https://openai.com/index/introducing-swe-bench-verified/

work page 2024

[14] [14]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexan- der Smola. Autogluon-tabular: Robust and accurate automl for structured data.arXiv preprint arXiv:2003.06505, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003

[15] [15]

The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2): 303–338, 2010

Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2): 303–338, 2010

work page 2010

[16] [16]

Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. Efficient and robust automated machine learning.Advances in neural information processing systems, 28, 2015

work page 2015

[17] [17]

Deep residual learning for image recog- nition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[18] [18]

Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering.Advances in Neural Information Processing Systems, 36:44753–44775, 2023

work page 2023

[19] [19]

Automated design of agentic systems

Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems. InInternational Conference on Learning Representations, volume 2025, pages 21344–21377, 2025

work page 2025

[20] [20]

Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs.Advances in neural information processing systems, 33:22118–22133, 2020

work page 2020

[21] [21]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. InInternational Conference on Machine Learning, pages 20271–20309. PMLR, 2024

work page 2024

[22] [22]

Aide: Ai-driven exploration in the space of code, 2025

Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, and Yuxiang Wu. Aide: Ai-driven exploration in the space of code, 2025

work page 2025

[23] [23]

Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? InInternational Conference on Learning Representations, volume 2024, pages 54107–54157, 2024

work page 2024

[24] [24]

char-rnn: Multi-layer recurrent neural networks for character-level language models.https://github.com/karpathy/char-rnn, 2015

Andrej Karpathy. char-rnn: Multi-layer recurrent neural networks for character-level language models.https://github.com/karpathy/char-rnn, 2015. Accessed: 2026-04-28

work page 2015

[25] [25]

nanoGPT: The simplest, fastest repository for training/fine-tuning medium-sized GPTs.https://github.com/karpathy/nanoGPT, 2022

Andrej Karpathy. nanoGPT: The simplest, fastest repository for training/fine-tuning medium-sized GPTs.https://github.com/karpathy/nanoGPT, 2022. Accessed: 2026-04-28

work page 2022

[26] [26]

autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026

Andrej Karpathy. autoresearch: Ai agents running research on single-gpu nanochat training auto- matically, 2026. URLhttps://github.com/karpathy/autoresearch. Accessed: 2026-04-28

work page 2026

[27] [27]

Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015

Yann Le, Xuan Yang, et al. Tiny imagenet visual recognition challenge.CS 231N, 7(7):3, 2015. 10

work page 2015

[28] [28]

DARTS: Differentiable Architecture Search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.arXiv preprint arXiv:1806.09055, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Agentbench: Evaluating llms as agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kai- wen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. InInternational Conference on Learning Representations, volume 2024, pages 52989–53046, 2024

work page 2024

[30] [30]

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, and Siheng Chen. Ml-agent: Reinforcing llm agents for autonomous machine learning engineering.arXiv preprint arXiv:2505.23723, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist: Towards fully automated open-ended scientific discovery.arXiv preprint arXiv:2408.06292, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Kimi K2.6 tech blog: Advancing open-source coding, April 2026

Moonshot AI. Kimi K2.6 tech blog: Advancing open-source coding, April 2026. URLhttps: //www.kimi.com/blog/kimi-k2-6. Accessed: 2026-04-28

work page 2026

[33] [33]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Introducing codex.https://openai.com/index/introducing-codex/, 2025

OpenAI. Introducing codex.https://openai.com/index/introducing-codex/, 2025. Accessed: 2026-05-02

work page 2025

[35] [35]

Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026

OpenAI. Introducing GPT-5.5.https://openai.com/index/introducing-gpt-5-5/, April 2026. Accessed: 2026-05-02

work page 2026

[36] [36]

Kimi K2.5 on OpenRouter, 2026

OpenRouter. Kimi K2.5 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.5. Accessed: 2026-04-28

work page 2026

[37] [37]

Kimi K2.6 on OpenRouter, 2026

OpenRouter. Kimi K2.6 on OpenRouter, 2026. URLhttps://openrouter.ai/moonshotai/ kimi-k2.6. Accessed: 2026-04-28

work page 2026

[38] [38]

Qwen3.6 Plus on OpenRouter, 2026

OpenRouter. Qwen3.6 Plus on OpenRouter, 2026. URLhttps://openrouter.ai/qwen/qwen3. 6-plus. Accessed: 2026-04-28

work page 2026

[39] [39]

Qwen3.6-Plus: Towards real world agents, April 2026

QwenTeam. Qwen3.6-Plus: Towards real world agents, April 2026. URLhttps://qwen.ai/blog? id=qwen3.6. Accessed: 2026-04-28

work page 2026

[40] [40]

A self-improving coding agent.arXiv preprint arXiv:2504.15228,

Maxime Robeyns, Martin Szummer, and Laurence Aitchison. A self-improving coding agent.arXiv preprint arXiv:2504.15228, 2025

work page arXiv 2025

[41] [41]

DINOv3

Oriane Siméoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

Flaml: A fast and lightweight automl library.Proceedings of machine learning and systems, 3:434–447, 2021

Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu. Flaml: A fast and lightweight automl library.Proceedings of machine learning and systems, 3:434–447, 2021

work page 2021

[44] [44]

Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026

Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, and Dawn Song. Trustworthy bench- marks (cont.).https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/, 2026. Ac- cessed: 2026-05-02

work page 2026

[45] [45]

Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, YueqiSong, BowenLi, JaskiratSingh, etal. Openhands: Anopenplatformforaisoftwaredevelopers as generalist agents. InInternational Conference on Learning Representations, volume 2025, pages 65882–65919, 2025

work page 2025

[46] [46]

Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts

Hjalmar Wijk, Tao Roa Lin, Joel Becker, Sami Jawhar, Neev Parikh, Thomas Broadley, Lawrence Chan, Michael Chen, Joshua M Clymer, Jai Dhyani, et al. Re-bench: Evaluating frontier ai r&d ca- pabilities of language model agents against human experts. InInternational Conference on Machine Learning, pages 66772–66832. PMLR, 2025. 11

work page 2025

[47] [47]

Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, ShaokunZhang, JialeLiu, etal. Autogen: Enablingnext-genllmapplicationsviamulti-agent conversations. InFirst conference on language modeling, 2024

work page 2024

[48] [48]

Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024

Scott Wu. Introducing devin, the first ai software engineer.Cognition Labs Blog, 2024

work page 2024

[49] [49]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024

[50] [50]

How Powerful are Graph Neural Networks?

Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks?arXiv preprint arXiv:1810.00826, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[51] [51]

The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, and David Ha. The ai scientist-v2: Workshop-level automated scientific discovery via agentic tree search.arXiv preprint arXiv:2504.08066, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [52]

Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. Are transformers effective for time series forecasting? InProceedings of the AAAI conference on artificial intelligence, volume 37, pages 11121–11128, 2023

work page 2023

[53] [53]

Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks

LeiZhang, YugeZhang, KanRen, DongshengLi, andYuqingYang. Mlcopilot: Unleashingthepower of large language models in solving machine learning tasks. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2931–2959, 2024

work page 2024

[54] [54]

Character-level convolutional networks for text clas- sification.Advances in neural information processing systems, 28, 2015

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text clas- sification.Advances in neural information processing systems, 28, 2015

work page 2015

[55] [55]

Informer: Beyond efficient transformer for long sequence time-series forecasting

Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. InProceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106–11115, 2021

work page 2021

[56] [56]

Webarena: A realistic web environment for building autonomous agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations, volume 2024, pages 15585–15606, 2024

work page 2024

[57] [57]

Neural Architecture Search with Reinforcement Learning

Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning.arXiv preprint arXiv:1611.01578, 2016. Acknowledgments The authors gratefully acknowledge the Institute for Distributed Intelligent Systems (ETTI 2) and the Institute for Autonomous Systems Technology (LRT 8.1) at the University of the Bundeswehr Munich for granting access t...

work page internal anchor Pith review Pith/arXiv arXiv 2016