MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Aleksander M\k{a}dry; Dane Sherburn; Evan Mays; Giulio Starace; James Aung; Jun Shern Chan; Kevin Liu; Leon Maksin; Lilian Weng; Neil Chowdhury

REVIEW 2 major objections 2 minor 60 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

AI agents using o1-preview with AIDE reach Kaggle bronze medal level in 16.9 percent of ML engineering competitions.

2026-05-23 19:08 UTC pith:QF4DHV3V

load-bearing objection MLE-bench assembles 75 Kaggle competitions into an agent benchmark and reports a 16.9% bronze rate for o1-preview plus AIDE, but the result hinges on how the competitions were chosen. the 2 major comments →

arxiv 2410.07095 v6 pith:QF4DHV3V submitted 2024-10-09 cs.CL

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

Jun Shern Chan , Neil Chowdhury , Oliver Jaffe , James Aung , Dane Sherburn , Evan Mays , Giulio Starace , Kevin Liu

show 4 more authors

Leon Maksin Tejal Patwardhan Lilian Weng Aleksander M\k{a}dry

This is my paper

classification cs.CL

keywords machine learning agentsbenchmarkKaggle competitionsAI engineeringlanguage model scaffoldingmodel evaluation

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MLE-bench from 75 Kaggle competitions to measure how well AI agents perform full machine learning engineering work including model training, data preparation, and experiments. Human performance is anchored to public Kaggle leaderboards so agent results can be compared directly to medal thresholds. The strongest result comes from pairing OpenAI's o1-preview model with AIDE scaffolding, which hits at least bronze in 16.9 percent of the tasks. The authors also examine how extra compute and pre-training data affect outcomes. The benchmark code is released publicly so others can run and extend the tests.

Core claim

MLE-bench shows that current frontier agents complete real Kaggle competitions at bronze-medal level in 16.9 percent of cases when using o1-preview plus AIDE scaffolding, while lower-performing model-scaffold combinations achieve lower success rates against the same human baselines.

What carries the argument

MLE-bench, a set of 75 curated Kaggle competitions that test agents on end-to-end ML engineering tasks scored against public leaderboards.

Load-bearing premise

The 75 selected Kaggle competitions capture the skills and challenges that define real-world machine learning engineering.

What would settle it

Re-running the same agent setups on a fresh collection of Kaggle competitions that were never used in the original curation would show whether the 16.9 percent bronze rate holds outside the benchmark set.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Agents that clear the bronze threshold on these tasks can be expected to complete some practical ML pipelines without human intervention.
Differences in performance across model-scaffold pairs give a direct signal for which combinations are worth scaling further.
The public release of the benchmark allows systematic study of how added compute or reduced contamination changes agent success rates.
Future agent designs can be compared on the same fixed set of competitions rather than ad-hoc toy problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If agents continue to improve on this benchmark, more of the day-to-day work of training and tuning models could shift from human engineers to automated systems.
Extending the benchmark to competitions posted after the training cutoff of the tested models would isolate the effect of data contamination.
Success on Kaggle-style tasks may indicate readiness for other structured engineering domains that share the same workflow of data handling, model iteration, and evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

MLE-bench assembles 75 Kaggle competitions into an agent benchmark and reports a 16.9% bronze rate for o1-preview plus AIDE, but the result hinges on how the competitions were chosen.

read the letter

The paper's main contribution is a new benchmark built from 75 Kaggle competitions, with human baselines taken directly from public leaderboards and an open-sourced evaluation harness. The headline result is that the strongest agent setup reaches at least bronze in 16.9% of the tasks. They also run scaling experiments and check for pre-training contamination, which adds some useful data points beyond the raw success rate. That combination of external grounding and released code is the part that actually moves the field forward for people who want to measure agent progress on applied ML work. The curation step is the soft spot. The abstract describes the set as diverse but gives no inclusion criteria, no breakdown by task type or data size, and no comparison to the broader Kaggle distribution. If the selected competitions skew toward smaller datasets or metrics that current agents handle quickly, the 16.9% figure will not generalize even within Kaggle. That concern from the stress-test note lands directly on the abstract and looks load-bearing for any claim about real-world ML engineering. The measurement itself avoids circularity because scores come from external leaderboards rather than internal fits. The paper shows straightforward engagement with the evaluation problem rather than overclaiming. Researchers working on agent scaffolds or on automating ML pipelines would find the numbers and the released code worth looking at. It is worth sending to peer review so the curation details and any additional validation can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper introduces MLE-bench, a benchmark of 75 curated Kaggle competitions designed to evaluate AI agents on machine learning engineering tasks including model training, dataset preparation, and experimentation. Human baselines are established from public Kaggle leaderboards. Evaluations of frontier models using open-source scaffolds show that o1-preview with AIDE scaffolding reaches at least bronze-medal performance in 16.9% of the competitions. The work additionally examines resource scaling and pre-training contamination effects and releases the benchmark code.

Significance. If the 75 competitions constitute a representative sample, the benchmark supplies an externally validated measure of agent performance against real human competitors on Kaggle, avoiding circularity in scoring. The open-sourcing of the code and the use of public leaderboards are concrete strengths that enable reproducibility and future extensions.

major comments (2)

[Benchmark construction / curation section] The curation description states that the authors selected a 'diverse set' of 75 competitions but provides no explicit inclusion/exclusion criteria, no quantitative breakdown of task types (tabular vs. image vs. NLP), dataset sizes, or competition age, and no comparison against the full Kaggle corpus. This selection process directly determines the denominator of every reported success rate and is therefore load-bearing for the claim that the 16.9% bronze figure reflects general ML-engineering capability.
[Evaluation protocol and results sections] The abstract and evaluation sections supply no details on the precise agent interaction protocols (e.g., number of turns, tool-use constraints, or termination conditions), the exact procedure for mapping agent submissions to bronze thresholds, or the quantitative checks performed for contamination. Without these, the support for the headline 16.9% result cannot be fully assessed.

minor comments (2)

Figure captions and legends would benefit from explicit mapping of each bar or line to the corresponding model-plus-scaffold combination.
A short table summarizing the distribution of competition types and medal thresholds across the 75 tasks would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details on curation and evaluation protocols.

read point-by-point responses

Referee: [Benchmark construction / curation section] The curation description states that the authors selected a 'diverse set' of 75 competitions but provides no explicit inclusion/exclusion criteria, no quantitative breakdown of task types (tabular vs. image vs. NLP), dataset sizes, or competition age, and no comparison against the full Kaggle corpus. This selection process directly determines the denominator of every reported success rate and is therefore load-bearing for the claim that the 16.9% bronze figure reflects general ML-engineering capability.

Authors: We agree that explicit criteria and breakdowns are needed to support the representativeness claim. In the revision we will add a dedicated subsection with: (1) explicit inclusion criteria (ML-focused competitions with public leaderboards and adequate participation) and exclusion criteria (non-ML tasks, deprecated or low-activity competitions); (2) a quantitative table breaking down the 75 tasks by type (tabular/image/NLP), dataset size bins, and competition age; and (3) a short comparison of the selected set against the broader Kaggle corpus in terms of popularity and difficulty distribution. These additions will clarify how the 16.9% figure should be interpreted. revision: yes
Referee: [Evaluation protocol and results sections] The abstract and evaluation sections supply no details on the precise agent interaction protocols (e.g., number of turns, tool-use constraints, or termination conditions), the exact procedure for mapping agent submissions to bronze thresholds, or the quantitative checks performed for contamination. Without these, the support for the headline 16.9% result cannot be fully assessed.

Authors: We agree that more granular protocol details are required for full assessment. Although the manuscript references open-source scaffolds and Kaggle leaderboards, the revision will expand the evaluation section to specify: agent interaction parameters (turn limits, tool constraints, termination rules); the precise mapping from agent submissions to bronze thresholds using the public leaderboards; and quantitative contamination analysis (methods and results of pre-training overlap checks). These changes will strengthen reproducibility and support for the reported performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central metric anchored to external Kaggle leaderboards

full rationale

The paper's headline result (16.9% bronze-medal rate for o1-preview + AIDE) is obtained by direct comparison of agent submissions against publicly available Kaggle leaderboards for the 75 curated competitions. This external reference prevents any reduction of the reported percentage to an internally fitted parameter, self-defined threshold, or self-citation chain. The curation step itself is an input choice rather than a derived claim, and no equations or uniqueness theorems are invoked that collapse back onto the paper's own definitions. Minor self-citations (e.g., to prior OpenAI agent work) appear but are not load-bearing for the performance numbers. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the premise that Kaggle competitions constitute a valid proxy for ML engineering capability and that bronze medal placement is a meaningful success threshold; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Kaggle competitions are representative of real-world ML engineering tasks
The paper states it curates competitions to test 'real-world ML engineering skills' and uses Kaggle leaderboards as human baselines.

pith-pipeline@v0.9.0 · 5735 in / 1238 out tokens · 34009 ms · 2026-05-23T19:08:42.716037+00:00 · methodology

0 comments

read the original abstract

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.

Figures

Figures reproduced from arXiv: 2410.07095 by Aleksander M\k{a}dry, Dane Sherburn, Evan Mays, Giulio Starace, James Aung, Jun Shern Chan, Kevin Liu, Leon Maksin, Lilian Weng, Neil Chowdhury, Oliver Jaffe, Tejal Patwardhan.

**Figure 2.** Figure 2: Excerpts of real trajectories from 3 different agent frameworks attempting competitions [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The percentage of medals achieved increases with the number of attempts allowed. GPT [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: We observe no positive relationship between GPT-4o’s familiarity with the competition and its performance (score normalized between the sample submission score and the gold medal score for that competition). time limit gives agents more time to iterate on their solutions, and permits more time for modeltraining. We run an experiment providing GPT-4o (AIDE) with a longer time limit of 100 hours per compe… view at source ↗

**Figure 6.** Figure 6: MLE-bench contains a total of 75 competitions spanning 15 diverse problem categories. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: For every medal-winning submission of gpt-4o AIDE and o1-preview AIDE, we take the [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: The prompt with the overall instructions that we initiate all scaffolds with. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: The percentage of attempts where models achieved any medal on each competition, plot [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
cs.AI 2026-06 unverdicted novelty 8.0

The Meta-Agent Challenge shows frontier AI models rarely match human-engineered agent baselines when tasked with autonomous development, with proprietary models succeeding most often and some exhibiting cheating under...
RuBench: A Repository-Level Agentic Coding Benchmark with Natively Authored Russian Task Specifications
cs.SE 2026-07 conditional novelty 7.0

RuBench introduces 25 repository-level coding tasks with natively authored Russian specifications graded by private maintainer oracles, evaluates four deployed agent products, and documents a product-level safeguard t...
Glite ARF: Verifier-Driven Research with Parallel LLM Coding Agents
cs.MA 2026-06 accept novelty 7.0

Glite ARF introduces a verifier-driven three-role framework for parallel LLM coding agents, demonstrated by first- and second-place finishes in the BEA 2026 vocabulary-difficulty shared task across three languages wit...
Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering
cs.AI 2026-06 unverdicted novelty 7.0

Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable AI agent evaluation using deterministic constraint-checking evaluators.
Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering
cs.AI 2026-06 unverdicted novelty 7.0

Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.
Agentic AutoResearch forSpace Autonomy: An Auditable, LLM-Driven Research Agent for Aerospace Control Problems
cs.RO 2026-06 unverdicted novelty 7.0

An LLM-driven agent with built-in seed-noise audits develops control policies for two aerospace problems that outperform undirected search and pass verification checks.
AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility
cs.AI 2026-06 unverdicted novelty 7.0

AgentBeats implements agentified evaluation of diverse AI agents through standardized interfaces, validated at scale in a five-month competition with 298 judges and 467 subjects plus a coding case study.
Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories
cs.CV 2026-06 unverdicted novelty 7.0

Data2Story is a multi-agent framework that generates evidence-grounded multimodal articles from data, evaluated on 18 articles against human pieces for verifiability, angle coverage, and quality across human, rubric, ...
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
cs.AI 2026-06 conditional novelty 7.0

AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
cs.AI 2026-05 conditional novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
cs.AI 2026-05 unverdicted novelty 7.0

DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.
What Do Evolutionary Coding Agents Evolve?
cs.NE 2026-05 unverdicted novelty 7.0

Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
cs.CV 2026-05 accept novelty 7.0

WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
cs.AI 2026-05 unverdicted novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
cs.AI 2026-05 unverdicted novelty 7.0

WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% exce...
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
cs.AI 2026-05 unverdicted novelty 7.0

WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
cs.SE 2026-05 conditional novelty 7.0

DiagEval is a new diagnostic protocol that conditions on failed trajectories to attribute GUI-agent evaluation failures, recovering 45-62% of misattributed cases and lifting accuracy 8-16 points on two benchmarks.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
cs.LG 2026-05 unverdicted novelty 7.0

FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
cs.CE 2026-05 unverdicted novelty 7.0

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution
cs.AI 2026-05 unverdicted novelty 7.0

SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation
cs.CL 2026-05 unverdicted novelty 7.0

GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
cs.LG 2026-05 conditional novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
cs.LG 2026-05 unverdicted novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
cs.AI 2026-05 conditional novelty 7.0

BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 7.0

SkillSafetyBench is a benchmark of 155 cases across 47 tasks and 6 risk domains showing that non-user attacks via skills, artifacts, or environments can consistently induce unsafe agent behavior.
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
cs.AI 2026-05 unverdicted novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
AcademiClaw: When Students Set Challenges for AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
cs.AI 2026-04 unverdicted novelty 7.0

LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
cs.CL 2026-04 unverdicted novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems
cs.AI 2025-08 unverdicted novelty 7.0

KompeteAI accelerates AutoML pipeline evaluation 6.9 times and beats prior systems by 3% on MLE-Bench through candidate merging, external RAG, and predictive early scoring.
Frontier Models are Capable of In-context Scheming
cs.AI 2024-12 conditional novelty 7.0

Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
cs.AI 2026-06 unverdicted novelty 6.0

OSWorld 2.0 is a benchmark of 108 realistic long-horizon computer-use tasks where current agents achieve only 20.6% binary completion, struggling with state inference and constraint tracking.
Learning the ARTS of Search for Automated Discovery
cs.AI 2026-06 unverdicted novelty 6.0

ARTS improves automated scientific discovery by using reasoning LMs with test-time training to separate hypothesis merit from execution quality in tree search, achieving 15.3% relative gains on 22 MLGym and MLEBench tasks.
Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization
cs.MA 2026-06 unverdicted novelty 6.0

An LLM-orchestrated multi-agent framework for end-to-end BDaaS automation with drift awareness is proposed and evaluated on tabular benchmarks for improved lifecycle reliability over baselines.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
cs.CL 2026-06 unverdicted novelty 6.0

Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Can Generalist Agents Automate Data Curation?
cs.AI 2026-06 unverdicted novelty 6.0

Generalist agents reach published data-selection baselines but require scaffolds forcing method adaptation to autonomously compose a policy that outperforms baselines at one-tenth the data budget.
VESTA: Visual Exploration with Statistical Tool Agents
cs.AI 2026-05 unverdicted novelty 6.0

VESTA introduces dynamic tool creation for VLMs that outperforms static-tool and no-tool baselines on distribution fitting, time series, and astronomy tasks in the new DAWN benchmark.
SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?
cs.LG 2026-05 conditional novelty 6.0

SoundnessBench shows frontier LLMs exhibit pervasive optimism bias when rating the soundness of ML research proposals, frequently calling low-soundness ideas sound under standard prompts.
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
cs.AI 2026-05 unverdicted novelty 6.0

ScientistOne introduces Chain-of-Evidence and an audit system that achieves zero hallucinated references, perfect score verification, and top method-code alignment while matching or beating human experts on five front...
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
cs.AI 2026-05 unverdicted novelty 6.0

AutoResearchClaw presents a multi-agent autonomous research pipeline with debate, self-healing execution, verifiable reporting, human-in-the-loop modes, and cross-run evolution that outperforms AI Scientist v2 by 54.7...
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
cs.CV 2026-05 unverdicted novelty 6.0

WildRoadBench is a new dual-track benchmark on professionally annotated wild UAV road-damage images showing closed-source VLMs lead but leave over half the AP_50 metric on the table while agents lag and open-source mo...
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
cs.AI 2026-05 unverdicted novelty 6.0

SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
How Far Are We From True Auto-Research?
cs.AI 2026-05 unverdicted novelty 6.0

ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
cs.SE 2026-05 unverdicted novelty 6.0

DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% o...
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
cs.LG 2026-05 accept novelty 6.0

FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
cs.LG 2026-05 conditional novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
cs.CR 2026-05 unverdicted novelty 6.0

SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
DataMaster: Data-Centric Autonomous AI Research
cs.LG 2026-05 unverdicted novelty 6.0

DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...
DataMaster: Data-Centric Autonomous AI Research
cs.LG 2026-05 unverdicted novelty 6.0

DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
On Benchmark Hacking in ML Contests: Modeling, Insights and Design
econ.GN 2026-04 unverdicted novelty 6.0

In a game-theoretic model of ML contests, low-type contestants engage in benchmark hacking while high-types focus on creative effort, with more skewed rewards improving overall outcomes.
Evaluation-driven Scaling for Scientific Discovery
cs.LG 2026-04 unverdicted novelty 6.0

SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
cs.AI 2026-04 unverdicted novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
cs.AI 2026-04 unverdicted novelty 6.0

Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
In-Place Test-Time Training
cs.LG 2026-04 conditional novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
cs.CL 2026-04 unverdicted novelty 6.0

Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
cs.LG 2026-03 unverdicted novelty 6.0

Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
What Makes AI Research Replicable? Executable Knowledge Graphs as Scientific Knowledge Representations
cs.CL 2025-10 unverdicted novelty 6.0

xKG is a paper-centric knowledge base that extracts code and insights to improve LLM agent performance on AI research replication by 10.9% on PaperBench.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 76 Pith papers · 10 internal anchors

[1]

Anthropic's Responsible Scaling Policy , Version 1.0, September 2023

Anthropic . Anthropic's Responsible Scaling Policy , Version 1.0, September 2023

work page 2023
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models , August 2021. URL http://arxiv.org/abs/2108.07732. arXiv:2108.07732 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying Memorization Across Neural Language Models , March 2023. URL http://arxiv.org/abs/2202.07646. arXiv:2202.07646 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Cognition Introducing Devin , the first AI software engineer, March 2024

cognition.ai . Cognition Introducing Devin , the first AI software engineer, March 2024. URL https://cognition.ai/

work page 2024
[6]

Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020

Rhiju Das, H Wayment-Steele, Do Soon Kim, Christian Choe, Bojan Tunguz, Walter Reade, and Maggie Demkin. Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020. URL https://kaggle.com/competitions/stanford-covid-vaccine

work page 2020
[7]

ArXivabs/2405.16281(2024)

Jasper Dekoninck, Mark Niklas Müller, and Martin Vechev. ConStat : Performance - Based Contamination Detection in Large Language Models , May 2024. URL http://arxiv.org/abs/2405.16281. arXiv:2405.16281 [cs]

work page arXiv 2024
[8]

GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024

Thomas Dohmke. GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024. URL https://github.blog/news-insights/product-news/github-copilot-workspace/

work page 2024
[9]

Code Droid Technical Report , June 2024

factory.ai . Code Droid Technical Report , June 2024. URL https://www.factory.ai/news/code-droid-technical-report

work page 2024
[10]

AgentQuest: A modular benchmark framework to measure progress and improve LLM agents,

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. AgentQuest : A Modular Benchmark Framework to Measure Progress and Improve LLM Agents , April 2024. URL http://arxiv.org/abs/2404.06411. arXiv:2404.06411 [cs]

work page arXiv 2024
[11]

Frontier Safety Framework , May 2024

Google DeepMind . Frontier Safety Framework , May 2024

work page 2024
[12]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring Coding Challenge Competence With APPS , November 2021. URL http://arxiv.org/abs/2105.09938. arXiv:2105.09938 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. AgentCoder : Multi - Agent -based Code Generation with Iterative Testing and Optimisation , May 2024 a . URL http://arxiv.org/abs/2312.13010. arXiv:2312.13010 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation . In Forty-first International Conference on Machine Learning, June 2024 b . URL https://openreview.net/forum?id=1Fs1LvjYQW

work page 2024
[15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench : Holistic and Contamination Free Evaluation of Large Language Models for Code , June 2024. URL http://arxiv.org/abs/2403.07974. arXiv:2403.07974 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can Language Models Resolve Real - World GitHub Issues ?, April 2024. URL http://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench : How Far Are Data Science Agents to Becoming Data Science Experts ?, September 2024. URL http://arxiv.org/abs/2409.07703. arXiv:2409.07703 [cs]

work page arXiv 2024
[18]

Kaggle Progression System Kaggle , 2024

Kaggle . Kaggle Progression System Kaggle , 2024. URL https://www.kaggle.com/progression

work page 2024
[19]

Research: quantifying GitHub Copilot ’s impact on developer productivity and happiness, September 2022

Eirini Kalliamvakou. Research: quantifying GitHub Copilot ’s impact on developer productivity and happiness, September 2022. URL https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

work page 2022
[20]

AI Agents That Matter

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI Agents That Matter , July 2024. URL http://arxiv.org/abs/2407.01502. arXiv:2407.01502 [cs]

work page Pith review arXiv 2024
[21]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien De Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022
[22]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench : Evaluating LLMs as Agents , October 2023. URL http://arxiv.org/abs/2308.03688....

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Vesuvius challenge - ink detection, 2023

Alex Lourenco, Brent Seales, Christy Chapman, Daniel Havir, Ian Janicki, JP Posma, Nat Friedman, Ryan Holbrook, Seth P., Stephen Parsons, and Will Cukierski. Vesuvius challenge - ink detection, 2023. URL https://kaggle.com/competitions/vesuvius-challenge-ink-detection

work page 2023
[24]

Discovering and exploring cases of educational source code plagiarism with Dolos , 2024

Rien Maertens, Maarten Van Neyghem, Maxiem Geldhof, Charlotte Van Petegem, Niko Strijbol, Peter Dawyndt, and Bart Mesuere. Discovering and exploring cases of educational source code plagiarism with Dolos , 2024. URL https://github.com/dodona-edu/dolos. Publication Title: SoftwareX original-date: 2019-06-23T15:12:32Z

work page 2024
[25]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA : a benchmark for General AI Assistants , November 2023. URL http://arxiv.org/abs/2311.12983. arXiv:2311.12983 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Preparedness Framework , December 2023

OpenAI . Preparedness Framework , December 2023

work page 2023
[27]

Introducing Weco AIDE , April 2024

Dominik Schmidt, Zhengyao Jiang, and Yuxiang Wu. Introducing Weco AIDE , April 2024. URL https://www.weco.ai/blog/technical-report

work page 2024
[28]

Ml-bench: Evaluating large language models and agents for machine learning tasks on repository-level code.arXiv preprint arXiv:2311.09835, 2023

Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. ML - Bench : Evaluating Large Language Models and A...

work page arXiv 2024
[29]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenDevin: An Open Platform for AI Soft...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

The shift from models to compound ai systems, 2024

Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems, 2024. URL http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

work page 2024
[31]

AutoCodeRover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover : Autonomous Program Improvement , July 2024. URL http://arxiv.org/abs/2404.05427. arXiv:2404.05427 [cs]

work page arXiv 2024
[32]

Can gpt-4 perform neural architecture search?arXiv preprint arXiv:2304.10970, 2023

Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, and Samuel Albanie. Can GPT -4 Perform Neural Architecture Search ?, August 2023. URL http://arxiv.org/abs/2304.10970. arXiv:2304.10970 [cs]

work page arXiv 2023
[33]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[34]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[35]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[36]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

Anthropic's Responsible Scaling Policy , Version 1.0, September 2023

Anthropic . Anthropic's Responsible Scaling Policy , Version 1.0, September 2023

work page 2023

[2] [2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models , August 2021. URL http://arxiv.org/abs/2108.07732. arXiv:2108.07732 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying Memorization Across Neural Language Models , March 2023. URL http://arxiv.org/abs/2202.07646. arXiv:2202.07646 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Cognition Introducing Devin , the first AI software engineer, March 2024

cognition.ai . Cognition Introducing Devin , the first AI software engineer, March 2024. URL https://cognition.ai/

work page 2024

[6] [6]

Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020

Rhiju Das, H Wayment-Steele, Do Soon Kim, Christian Choe, Bojan Tunguz, Walter Reade, and Maggie Demkin. Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020. URL https://kaggle.com/competitions/stanford-covid-vaccine

work page 2020

[7] [7]

ArXivabs/2405.16281(2024)

Jasper Dekoninck, Mark Niklas Müller, and Martin Vechev. ConStat : Performance - Based Contamination Detection in Large Language Models , May 2024. URL http://arxiv.org/abs/2405.16281. arXiv:2405.16281 [cs]

work page arXiv 2024

[8] [8]

GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024

Thomas Dohmke. GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024. URL https://github.blog/news-insights/product-news/github-copilot-workspace/

work page 2024

[9] [9]

Code Droid Technical Report , June 2024

factory.ai . Code Droid Technical Report , June 2024. URL https://www.factory.ai/news/code-droid-technical-report

work page 2024

[10] [10]

AgentQuest: A modular benchmark framework to measure progress and improve LLM agents,

Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. AgentQuest : A Modular Benchmark Framework to Measure Progress and Improve LLM Agents , April 2024. URL http://arxiv.org/abs/2404.06411. arXiv:2404.06411 [cs]

work page arXiv 2024

[11] [11]

Frontier Safety Framework , May 2024

Google DeepMind . Frontier Safety Framework , May 2024

work page 2024

[12] [12]

Measuring Coding Challenge Competence With APPS

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring Coding Challenge Competence With APPS , November 2021. URL http://arxiv.org/abs/2105.09938. arXiv:2105.09938 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. AgentCoder : Multi - Agent -based Code Generation with Iterative Testing and Optimisation , May 2024 a . URL http://arxiv.org/abs/2312.13010. arXiv:2312.13010 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation . In Forty-first International Conference on Machine Learning, June 2024 b . URL https://openreview.net/forum?id=1Fs1LvjYQW

work page 2024

[15] [15]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench : Holistic and Contamination Free Evaluation of Large Language Models for Code , June 2024. URL http://arxiv.org/abs/2403.07974. arXiv:2403.07974 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can Language Models Resolve Real - World GitHub Issues ?, April 2024. URL http://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Dsbench: How far are data science agents from becoming data science experts? arXiv preprint arXiv:2409.07703, 2024

Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench : How Far Are Data Science Agents to Becoming Data Science Experts ?, September 2024. URL http://arxiv.org/abs/2409.07703. arXiv:2409.07703 [cs]

work page arXiv 2024

[18] [18]

Kaggle Progression System Kaggle , 2024

Kaggle . Kaggle Progression System Kaggle , 2024. URL https://www.kaggle.com/progression

work page 2024

[19] [19]

Research: quantifying GitHub Copilot ’s impact on developer productivity and happiness, September 2022

Eirini Kalliamvakou. Research: quantifying GitHub Copilot ’s impact on developer productivity and happiness, September 2022. URL https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

work page 2022

[20] [20]

AI Agents That Matter

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI Agents That Matter , July 2024. URL http://arxiv.org/abs/2407.01502. arXiv:2407.01502 [cs]

work page Pith review arXiv 2024

[21] [21]

Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien De Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...

work page doi:10.1126/science.abq1158 2022

[22] [22]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench : Evaluating LLMs as Agents , October 2023. URL http://arxiv.org/abs/2308.03688....

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Vesuvius challenge - ink detection, 2023

Alex Lourenco, Brent Seales, Christy Chapman, Daniel Havir, Ian Janicki, JP Posma, Nat Friedman, Ryan Holbrook, Seth P., Stephen Parsons, and Will Cukierski. Vesuvius challenge - ink detection, 2023. URL https://kaggle.com/competitions/vesuvius-challenge-ink-detection

work page 2023

[24] [24]

Discovering and exploring cases of educational source code plagiarism with Dolos , 2024

Rien Maertens, Maarten Van Neyghem, Maxiem Geldhof, Charlotte Van Petegem, Niko Strijbol, Peter Dawyndt, and Bart Mesuere. Discovering and exploring cases of educational source code plagiarism with Dolos , 2024. URL https://github.com/dodona-edu/dolos. Publication Title: SoftwareX original-date: 2019-06-23T15:12:32Z

work page 2024

[25] [25]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA : a benchmark for General AI Assistants , November 2023. URL http://arxiv.org/abs/2311.12983. arXiv:2311.12983 [cs]

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Preparedness Framework , December 2023

OpenAI . Preparedness Framework , December 2023

work page 2023

[27] [27]

Introducing Weco AIDE , April 2024

Dominik Schmidt, Zhengyao Jiang, and Yuxiang Wu. Introducing Weco AIDE , April 2024. URL https://www.weco.ai/blog/technical-report

work page 2024

[28] [28]

Ml-bench: Evaluating large language models and agents for machine learning tasks on repository-level code.arXiv preprint arXiv:2311.09835, 2023

Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. ML - Bench : Evaluating Large Language Models and A...

work page arXiv 2024

[29] [29]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenDevin: An Open Platform for AI Soft...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

The shift from models to compound ai systems, 2024

Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems, 2024. URL http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

work page 2024

[31] [31]

AutoCodeRover: Autonomous program improvement

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover : Autonomous Program Improvement , July 2024. URL http://arxiv.org/abs/2404.05427. arXiv:2404.05427 [cs]

work page arXiv 2024

[32] [32]

Can gpt-4 perform neural architecture search?arXiv preprint arXiv:2304.10970, 2023

Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, and Samuel Albanie. Can GPT -4 Perform Neural Architecture Search ?, August 2023. URL http://arxiv.org/abs/2304.10970. arXiv:2304.10970 [cs]

work page arXiv 2023

[33] [33]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[34] [34]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[35] [35]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[36] [36]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page