pith. sign in

arxiv: 2410.07095 · v6 · pith:QF4DHV3Vnew · submitted 2024-10-09 · 💻 cs.CL

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Pith reviewed 2026-05-23 19:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords machine learning agentsbenchmarkKaggle competitionsAI engineeringlanguage model scaffoldingmodel evaluation
0
0 comments X

The pith

AI agents using o1-preview with AIDE reach Kaggle bronze medal level in 16.9 percent of ML engineering competitions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MLE-bench from 75 Kaggle competitions to measure how well AI agents perform full machine learning engineering work including model training, data preparation, and experiments. Human performance is anchored to public Kaggle leaderboards so agent results can be compared directly to medal thresholds. The strongest result comes from pairing OpenAI's o1-preview model with AIDE scaffolding, which hits at least bronze in 16.9 percent of the tasks. The authors also examine how extra compute and pre-training data affect outcomes. The benchmark code is released publicly so others can run and extend the tests.

Core claim

MLE-bench shows that current frontier agents complete real Kaggle competitions at bronze-medal level in 16.9 percent of cases when using o1-preview plus AIDE scaffolding, while lower-performing model-scaffold combinations achieve lower success rates against the same human baselines.

What carries the argument

MLE-bench, a set of 75 curated Kaggle competitions that test agents on end-to-end ML engineering tasks scored against public leaderboards.

If this is right

  • Agents that clear the bronze threshold on these tasks can be expected to complete some practical ML pipelines without human intervention.
  • Differences in performance across model-scaffold pairs give a direct signal for which combinations are worth scaling further.
  • The public release of the benchmark allows systematic study of how added compute or reduced contamination changes agent success rates.
  • Future agent designs can be compared on the same fixed set of competitions rather than ad-hoc toy problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If agents continue to improve on this benchmark, more of the day-to-day work of training and tuning models could shift from human engineers to automated systems.
  • Extending the benchmark to competitions posted after the training cutoff of the tested models would isolate the effect of data contamination.
  • Success on Kaggle-style tasks may indicate readiness for other structured engineering domains that share the same workflow of data handling, model iteration, and evaluation.

Load-bearing premise

The 75 selected Kaggle competitions capture the skills and challenges that define real-world machine learning engineering.

What would settle it

Re-running the same agent setups on a fresh collection of Kaggle competitions that were never used in the original curation would show whether the 16.9 percent bronze rate holds outside the benchmark set.

Figures

Figures reproduced from arXiv: 2410.07095 by Aleksander M\k{a}dry, Dane Sherburn, Evan Mays, Giulio Starace, James Aung, Jun Shern Chan, Kevin Liu, Leon Maksin, Lilian Weng, Neil Chowdhury, Oliver Jaffe, Tejal Patwardhan.

Figure 1
Figure 1. Figure 1: MLE-bench is an offline Kaggle competition environment for AI agents. Each competi [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Excerpts of real trajectories from 3 different agent frameworks attempting competitions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The percentage of medals achieved increases with the number of attempts allowed. GPT [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: We observe no positive relationship between GPT-4o’s familiarity with the competi￾tion and its performance (score normalized be￾tween the sample submission score and the gold medal score for that competition). time limit gives agents more time to iterate on their solutions, and permits more time for model￾training. We run an experiment providing GPT-4o (AIDE) with a longer time limit of 100 hours per compe… view at source ↗
Figure 6
Figure 6. Figure 6: MLE-bench contains a total of 75 competitions spanning 15 diverse problem categories. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: For every medal-winning submission of gpt-4o AIDE and o1-preview AIDE, we take the [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The prompt with the overall instructions that we initiate all scaffolds with. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The percentage of attempts where models achieved any medal on each competition, plot [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗
read the original abstract

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MLE-bench, a benchmark of 75 curated Kaggle competitions designed to evaluate AI agents on machine learning engineering tasks including model training, dataset preparation, and experimentation. Human baselines are established from public Kaggle leaderboards. Evaluations of frontier models using open-source scaffolds show that o1-preview with AIDE scaffolding reaches at least bronze-medal performance in 16.9% of the competitions. The work additionally examines resource scaling and pre-training contamination effects and releases the benchmark code.

Significance. If the 75 competitions constitute a representative sample, the benchmark supplies an externally validated measure of agent performance against real human competitors on Kaggle, avoiding circularity in scoring. The open-sourcing of the code and the use of public leaderboards are concrete strengths that enable reproducibility and future extensions.

major comments (2)
  1. [Benchmark construction / curation section] The curation description states that the authors selected a 'diverse set' of 75 competitions but provides no explicit inclusion/exclusion criteria, no quantitative breakdown of task types (tabular vs. image vs. NLP), dataset sizes, or competition age, and no comparison against the full Kaggle corpus. This selection process directly determines the denominator of every reported success rate and is therefore load-bearing for the claim that the 16.9% bronze figure reflects general ML-engineering capability.
  2. [Evaluation protocol and results sections] The abstract and evaluation sections supply no details on the precise agent interaction protocols (e.g., number of turns, tool-use constraints, or termination conditions), the exact procedure for mapping agent submissions to bronze thresholds, or the quantitative checks performed for contamination. Without these, the support for the headline 16.9% result cannot be fully assessed.
minor comments (2)
  1. Figure captions and legends would benefit from explicit mapping of each bar or line to the corresponding model-plus-scaffold combination.
  2. A short table summarizing the distribution of competition types and medal thresholds across the 75 tasks would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details on curation and evaluation protocols.

read point-by-point responses
  1. Referee: [Benchmark construction / curation section] The curation description states that the authors selected a 'diverse set' of 75 competitions but provides no explicit inclusion/exclusion criteria, no quantitative breakdown of task types (tabular vs. image vs. NLP), dataset sizes, or competition age, and no comparison against the full Kaggle corpus. This selection process directly determines the denominator of every reported success rate and is therefore load-bearing for the claim that the 16.9% bronze figure reflects general ML-engineering capability.

    Authors: We agree that explicit criteria and breakdowns are needed to support the representativeness claim. In the revision we will add a dedicated subsection with: (1) explicit inclusion criteria (ML-focused competitions with public leaderboards and adequate participation) and exclusion criteria (non-ML tasks, deprecated or low-activity competitions); (2) a quantitative table breaking down the 75 tasks by type (tabular/image/NLP), dataset size bins, and competition age; and (3) a short comparison of the selected set against the broader Kaggle corpus in terms of popularity and difficulty distribution. These additions will clarify how the 16.9% figure should be interpreted. revision: yes

  2. Referee: [Evaluation protocol and results sections] The abstract and evaluation sections supply no details on the precise agent interaction protocols (e.g., number of turns, tool-use constraints, or termination conditions), the exact procedure for mapping agent submissions to bronze thresholds, or the quantitative checks performed for contamination. Without these, the support for the headline 16.9% result cannot be fully assessed.

    Authors: We agree that more granular protocol details are required for full assessment. Although the manuscript references open-source scaffolds and Kaggle leaderboards, the revision will expand the evaluation section to specify: agent interaction parameters (turn limits, tool constraints, termination rules); the precise mapping from agent submissions to bronze thresholds using the public leaderboards; and quantitative contamination analysis (methods and results of pre-training overlap checks). These changes will strengthen reproducibility and support for the reported performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central metric anchored to external Kaggle leaderboards

full rationale

The paper's headline result (16.9% bronze-medal rate for o1-preview + AIDE) is obtained by direct comparison of agent submissions against publicly available Kaggle leaderboards for the 75 curated competitions. This external reference prevents any reduction of the reported percentage to an internally fitted parameter, self-defined threshold, or self-citation chain. The curation step itself is an input choice rather than a derived claim, and no equations or uniqueness theorems are invoked that collapse back onto the paper's own definitions. Minor self-citations (e.g., to prior OpenAI agent work) appear but are not load-bearing for the performance numbers. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that Kaggle competitions constitute a valid proxy for ML engineering capability and that bronze medal placement is a meaningful success threshold; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Kaggle competitions are representative of real-world ML engineering tasks
    The paper states it curates competitions to test 'real-world ML engineering skills' and uses Kaggle leaderboards as human baselines.

pith-pipeline@v0.9.0 · 5735 in / 1238 out tokens · 34009 ms · 2026-05-23T19:08:42.716037+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 56 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

    cs.AI 2026-05 conditional novelty 7.0

    IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

  2. Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.

  3. What Do Evolutionary Coding Agents Evolve?

    cs.NE 2026-05 unverdicted novelty 7.0

    Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.

  4. WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents

    cs.CV 2026-05 accept novelty 7.0

    WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.

  5. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

    cs.AI 2026-05 unverdicted novelty 7.0

    DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...

  6. WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

    cs.AI 2026-05 unverdicted novelty 7.0

    WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...

  7. WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

    cs.AI 2026-05 unverdicted novelty 7.0

    WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% exce...

  8. DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

    cs.SE 2026-05 conditional novelty 7.0

    DiagEval is a new diagnostic protocol that conditions on failed trajectories to attribute GUI-agent evaluation failures, recovering 45-62% of misattributed cases and lifting accuracy 8-16 points on two benchmarks.

  9. BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

    cs.CE 2026-05 unverdicted novelty 7.0

    BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

  10. SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.

  11. FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

    cs.LG 2026-05 conditional novelty 7.0

    FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

  12. Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

    cs.LG 2026-05 unverdicted novelty 7.0

    Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

  13. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

    cs.AI 2026-05 conditional novelty 7.0

    BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

  14. TeamBench: Evaluating Agent Coordination under Enforced Role Separation

    cs.AI 2026-05 unverdicted novelty 7.0

    Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.

  15. AcademiClaw: When Students Set Challenges for AI Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

  16. SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

    cs.AI 2026-04 unverdicted novelty 7.0

    LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.

  17. FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

    cs.CL 2026-04 unverdicted novelty 7.0

    FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

  18. KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems

    cs.AI 2025-08 unverdicted novelty 7.0

    KompeteAI accelerates AutoML pipeline evaluation 6.9 times and beats prior systems by 3% on MLE-Bench through candidate merging, external RAG, and predictive early scoring.

  19. Frontier Models are Capable of In-context Scheming

    cs.AI 2024-12 conditional novelty 7.0

    Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.

  20. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

    cs.AI 2026-05 unverdicted novelty 6.0

    AutoResearchClaw presents a multi-agent autonomous research pipeline with debate, self-healing execution, verifiable reporting, human-in-the-loop modes, and cross-run evolution that outperforms AI Scientist v2 by 54.7...

  21. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.

  22. How Far Are We From True Auto-Research?

    cs.AI 2026-05 unverdicted novelty 6.0

    ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.

  23. DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

    cs.SE 2026-05 unverdicted novelty 6.0

    DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% o...

  24. FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

    cs.LG 2026-05 accept novelty 6.0

    FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.

  25. MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

    cs.LG 2026-05 conditional novelty 6.0

    MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...

  26. SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

    cs.CR 2026-05 unverdicted novelty 6.0

    SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.

  27. DataMaster: Data-Centric Autonomous AI Research

    cs.LG 2026-05 unverdicted novelty 6.0

    DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...

  28. DataMaster: Data-Centric Autonomous AI Research

    cs.LG 2026-05 unverdicted novelty 6.0

    DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.

  29. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  30. Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 6.0

    Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.

  31. On Benchmark Hacking in ML Contests: Modeling, Insights and Design

    econ.GN 2026-04 unverdicted novelty 6.0

    In a game-theoretic model of ML contests, low-type contestants engage in benchmark hacking while high-types focus on creative effort, with more skewed rewards improving overall outcomes.

  32. Evaluation-driven Scaling for Scientific Discovery

    cs.LG 2026-04 unverdicted novelty 6.0

    SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...

  33. TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

    cs.AI 2026-04 unverdicted novelty 6.0

    TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.

  34. Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization

    cs.AI 2026-04 unverdicted novelty 6.0

    Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...

  35. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  36. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  37. Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation

    cs.CL 2026-04 unverdicted novelty 6.0

    Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.

  38. Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search

    cs.LG 2026-03 unverdicted novelty 6.0

    Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.

  39. What Makes AI Research Replicable? Executable Knowledge Graphs as Scientific Knowledge Representations

    cs.CL 2025-10 unverdicted novelty 6.0

    xKG is a paper-centric knowledge base that extracts code and insights to improve LLM agent performance on AI research replication by 10.9% on PaperBench.

  40. MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining

    cs.CL 2025-09 unverdicted novelty 6.0

    MachineLearningLM uses continued pretraining on SCM-synthesized ML tasks with random-forest distillation to give LLMs robust many-shot in-context learning on tabular classification, reaching random-forest accuracy lev...

  41. RExBench: Can coding agents autonomously implement AI research extensions?

    cs.CL 2025-06 unverdicted novelty 6.0

    RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.

  42. DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    cs.CL 2025-06 conditional novelty 6.0

    DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

  43. VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

    cs.AI 2025-06 unverdicted novelty 6.0

    VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasonin...

  44. PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

    cs.LG 2026-05 unverdicted novelty 5.0

    PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...

  45. EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

    cs.AI 2026-04 unverdicted novelty 5.0

    EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.

  46. Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

    cs.AI 2026-04 unverdicted novelty 5.0

    Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.

  47. From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

    cs.AI 2026-03 unverdicted novelty 5.0

    An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.

  48. AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering

    cs.LG 2026-02 unverdicted novelty 5.0

    AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.

  49. End-to-end PDDL Planning with Hardcoded and Dynamic Agents

    cs.AI 2025-12 unverdicted novelty 5.0

    An end-to-end LLM framework refines natural language into valid PDDL domains and problems via hardcoded and dynamic agents, generates plans with standard engines, and returns readable output.

  50. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  51. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  52. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

  53. From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

    cs.AI 2025-04 accept novelty 4.0

    A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.

  54. Europe and the Geopolitics of AGI: The Need for a Preparedness Plan

    cs.CY 2026-05 unverdicted novelty 3.0

    AGI may arrive by 2030-2040 and reshape global power balances, requiring Europe to close gaps in compute, talent retention, industrial adoption, and unified policy responses through a coordinated preparedness agenda.

  55. A Survey of Reinforcement Learning for Large Reasoning Models

    cs.CL 2025-09 accept novelty 3.0

    A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.

  56. Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    cs.CL 2025-03 accept novelty 3.0

    A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 53 Pith papers · 10 internal anchors

  1. [1]

    Anthropic's Responsible Scaling Policy , Version 1.0, September 2023

    Anthropic . Anthropic's Responsible Scaling Policy , Version 1.0, September 2023

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models , August 2021. URL http://arxiv.org/abs/2108.07732. arXiv:2108.07732 [cs]

  3. [3]

    Quantifying Memorization Across Neural Language Models

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying Memorization Across Neural Language Models , March 2023. URL http://arxiv.org/abs/2202.07646. arXiv:2202.07646 [cs]

  4. [4]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  5. [5]

    Cognition Introducing Devin , the first AI software engineer, March 2024

    cognition.ai . Cognition Introducing Devin , the first AI software engineer, March 2024. URL https://cognition.ai/

  6. [6]

    Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020

    Rhiju Das, H Wayment-Steele, Do Soon Kim, Christian Choe, Bojan Tunguz, Walter Reade, and Maggie Demkin. Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020. URL https://kaggle.com/competitions/stanford-covid-vaccine

  7. [7]

    ConStat : Performance - Based Contamination Detection in Large Language Models , May 2024

    Jasper Dekoninck, Mark Niklas Müller, and Martin Vechev. ConStat : Performance - Based Contamination Detection in Large Language Models , May 2024. URL http://arxiv.org/abs/2405.16281. arXiv:2405.16281 [cs]

  8. [8]

    GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024

    Thomas Dohmke. GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024. URL https://github.blog/news-insights/product-news/github-copilot-workspace/

  9. [9]

    Code Droid Technical Report , June 2024

    factory.ai . Code Droid Technical Report , June 2024. URL https://www.factory.ai/news/code-droid-technical-report

  10. [10]

    AgentQuest : A Modular Benchmark Framework to Measure Progress and Improve LLM Agents , April 2024

    Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. AgentQuest : A Modular Benchmark Framework to Measure Progress and Improve LLM Agents , April 2024. URL http://arxiv.org/abs/2404.06411. arXiv:2404.06411 [cs]

  11. [11]

    Frontier Safety Framework , May 2024

    Google DeepMind . Frontier Safety Framework , May 2024

  12. [12]

    Measuring Coding Challenge Competence With APPS

    Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring Coding Challenge Competence With APPS , November 2021. URL http://arxiv.org/abs/2105.09938. arXiv:2105.09938 [cs]

  13. [13]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. AgentCoder : Multi - Agent -based Code Generation with Iterative Testing and Optimisation , May 2024 a . URL http://arxiv.org/abs/2312.13010. arXiv:2312.13010 [cs]

  14. [14]

    MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation . In Forty-first International Conference on Machine Learning, June 2024 b . URL https://openreview.net/forum?id=1Fs1LvjYQW

  15. [15]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench : Holistic and Contamination Free Evaluation of Large Language Models for Code , June 2024. URL http://arxiv.org/abs/2403.07974. arXiv:2403.07974 [cs]

  16. [16]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can Language Models Resolve Real - World GitHub Issues ?, April 2024. URL http://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]

  17. [17]

    DSBench : How Far Are Data Science Agents to Becoming Data Science Experts ?, September 2024

    Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench : How Far Are Data Science Agents to Becoming Data Science Experts ?, September 2024. URL http://arxiv.org/abs/2409.07703. arXiv:2409.07703 [cs]

  18. [18]

    Kaggle Progression System Kaggle , 2024

    Kaggle . Kaggle Progression System Kaggle , 2024. URL https://www.kaggle.com/progression

  19. [19]

    Research: quantifying GitHub Copilot ’s impact on developer productivity and happiness, September 2022

    Eirini Kalliamvakou. Research: quantifying GitHub Copilot ’s impact on developer productivity and happiness, September 2022. URL https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

  20. [20]

    Siegel, Nitya Nadgir, and Arvind Narayanan

    Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI Agents That Matter , July 2024. URL http://arxiv.org/abs/2407.01502. arXiv:2407.01502 [cs]

  21. [21]

    Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, and Oriol Vinyals

    Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien De Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

  22. [22]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench : Evaluating LLMs as Agents , October 2023. URL http://arxiv.org/abs/2308.03688....

  23. [23]

    Vesuvius challenge - ink detection, 2023

    Alex Lourenco, Brent Seales, Christy Chapman, Daniel Havir, Ian Janicki, JP Posma, Nat Friedman, Ryan Holbrook, Seth P., Stephen Parsons, and Will Cukierski. Vesuvius challenge - ink detection, 2023. URL https://kaggle.com/competitions/vesuvius-challenge-ink-detection

  24. [24]

    Discovering and exploring cases of educational source code plagiarism with Dolos , 2024

    Rien Maertens, Maarten Van Neyghem, Maxiem Geldhof, Charlotte Van Petegem, Niko Strijbol, Peter Dawyndt, and Bart Mesuere. Discovering and exploring cases of educational source code plagiarism with Dolos , 2024. URL https://github.com/dodona-edu/dolos. Publication Title: SoftwareX original-date: 2019-06-23T15:12:32Z

  25. [25]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA : a benchmark for General AI Assistants , November 2023. URL http://arxiv.org/abs/2311.12983. arXiv:2311.12983 [cs]

  26. [26]

    Preparedness Framework , December 2023

    OpenAI . Preparedness Framework , December 2023

  27. [27]

    Introducing Weco AIDE , April 2024

    Dominik Schmidt, Zhengyao Jiang, and Yuxiang Wu. Introducing Weco AIDE , April 2024. URL https://www.weco.ai/blog/technical-report

  28. [28]

    ML - Bench : Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository - Level Code , August 2024

    Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. ML - Bench : Evaluating Large Language Models and A...

  29. [29]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenDevin: An Open Platform for AI Soft...

  30. [30]

    The shift from models to compound ai systems, 2024

    Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems, 2024. URL http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

  31. [31]

    AutoCodeRover : Autonomous Program Improvement , July 2024

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover : Autonomous Program Improvement , July 2024. URL http://arxiv.org/abs/2404.05427. arXiv:2404.05427 [cs]

  32. [32]

    Can GPT -4 Perform Neural Architecture Search ?, August 2023

    Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, and Samuel Albanie. Can GPT -4 Perform Neural Architecture Search ?, August 2023. URL http://arxiv.org/abs/2304.10970. arXiv:2304.10970 [cs]

  33. [33]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  34. [34]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  35. [35]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  36. [36]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...