MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Pith reviewed 2026-05-23 19:08 UTC · model grok-4.3
The pith
AI agents using o1-preview with AIDE reach Kaggle bronze medal level in 16.9 percent of ML engineering competitions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MLE-bench shows that current frontier agents complete real Kaggle competitions at bronze-medal level in 16.9 percent of cases when using o1-preview plus AIDE scaffolding, while lower-performing model-scaffold combinations achieve lower success rates against the same human baselines.
What carries the argument
MLE-bench, a set of 75 curated Kaggle competitions that test agents on end-to-end ML engineering tasks scored against public leaderboards.
If this is right
- Agents that clear the bronze threshold on these tasks can be expected to complete some practical ML pipelines without human intervention.
- Differences in performance across model-scaffold pairs give a direct signal for which combinations are worth scaling further.
- The public release of the benchmark allows systematic study of how added compute or reduced contamination changes agent success rates.
- Future agent designs can be compared on the same fixed set of competitions rather than ad-hoc toy problems.
Where Pith is reading between the lines
- If agents continue to improve on this benchmark, more of the day-to-day work of training and tuning models could shift from human engineers to automated systems.
- Extending the benchmark to competitions posted after the training cutoff of the tested models would isolate the effect of data contamination.
- Success on Kaggle-style tasks may indicate readiness for other structured engineering domains that share the same workflow of data handling, model iteration, and evaluation.
Load-bearing premise
The 75 selected Kaggle competitions capture the skills and challenges that define real-world machine learning engineering.
What would settle it
Re-running the same agent setups on a fresh collection of Kaggle competitions that were never used in the original curation would show whether the 16.9 percent bronze rate holds outside the benchmark set.
Figures
read the original abstract
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering. To this end, we curate 75 ML engineering-related competitions from Kaggle, creating a diverse set of challenging tasks that test real-world ML engineering skills such as training models, preparing datasets, and running experiments. We establish human baselines for each competition using Kaggle's publicly available leaderboards. We use open-source agent scaffolds to evaluate several frontier language models on our benchmark, finding that the best-performing setup--OpenAI's o1-preview with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in 16.9% of competitions. In addition to our main results, we investigate various forms of resource scaling for AI agents and the impact of contamination from pre-training. We open-source our benchmark code (github.com/openai/mle-bench/) to facilitate future research in understanding the ML engineering capabilities of AI agents.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MLE-bench, a benchmark of 75 curated Kaggle competitions designed to evaluate AI agents on machine learning engineering tasks including model training, dataset preparation, and experimentation. Human baselines are established from public Kaggle leaderboards. Evaluations of frontier models using open-source scaffolds show that o1-preview with AIDE scaffolding reaches at least bronze-medal performance in 16.9% of the competitions. The work additionally examines resource scaling and pre-training contamination effects and releases the benchmark code.
Significance. If the 75 competitions constitute a representative sample, the benchmark supplies an externally validated measure of agent performance against real human competitors on Kaggle, avoiding circularity in scoring. The open-sourcing of the code and the use of public leaderboards are concrete strengths that enable reproducibility and future extensions.
major comments (2)
- [Benchmark construction / curation section] The curation description states that the authors selected a 'diverse set' of 75 competitions but provides no explicit inclusion/exclusion criteria, no quantitative breakdown of task types (tabular vs. image vs. NLP), dataset sizes, or competition age, and no comparison against the full Kaggle corpus. This selection process directly determines the denominator of every reported success rate and is therefore load-bearing for the claim that the 16.9% bronze figure reflects general ML-engineering capability.
- [Evaluation protocol and results sections] The abstract and evaluation sections supply no details on the precise agent interaction protocols (e.g., number of turns, tool-use constraints, or termination conditions), the exact procedure for mapping agent submissions to bronze thresholds, or the quantitative checks performed for contamination. Without these, the support for the headline 16.9% result cannot be fully assessed.
minor comments (2)
- Figure captions and legends would benefit from explicit mapping of each bar or line to the corresponding model-plus-scaffold combination.
- A short table summarizing the distribution of competition types and medal thresholds across the 75 tasks would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details on curation and evaluation protocols.
read point-by-point responses
-
Referee: [Benchmark construction / curation section] The curation description states that the authors selected a 'diverse set' of 75 competitions but provides no explicit inclusion/exclusion criteria, no quantitative breakdown of task types (tabular vs. image vs. NLP), dataset sizes, or competition age, and no comparison against the full Kaggle corpus. This selection process directly determines the denominator of every reported success rate and is therefore load-bearing for the claim that the 16.9% bronze figure reflects general ML-engineering capability.
Authors: We agree that explicit criteria and breakdowns are needed to support the representativeness claim. In the revision we will add a dedicated subsection with: (1) explicit inclusion criteria (ML-focused competitions with public leaderboards and adequate participation) and exclusion criteria (non-ML tasks, deprecated or low-activity competitions); (2) a quantitative table breaking down the 75 tasks by type (tabular/image/NLP), dataset size bins, and competition age; and (3) a short comparison of the selected set against the broader Kaggle corpus in terms of popularity and difficulty distribution. These additions will clarify how the 16.9% figure should be interpreted. revision: yes
-
Referee: [Evaluation protocol and results sections] The abstract and evaluation sections supply no details on the precise agent interaction protocols (e.g., number of turns, tool-use constraints, or termination conditions), the exact procedure for mapping agent submissions to bronze thresholds, or the quantitative checks performed for contamination. Without these, the support for the headline 16.9% result cannot be fully assessed.
Authors: We agree that more granular protocol details are required for full assessment. Although the manuscript references open-source scaffolds and Kaggle leaderboards, the revision will expand the evaluation section to specify: agent interaction parameters (turn limits, tool constraints, termination rules); the precise mapping from agent submissions to bronze thresholds using the public leaderboards; and quantitative contamination analysis (methods and results of pre-training overlap checks). These changes will strengthen reproducibility and support for the reported performance. revision: yes
Circularity Check
No significant circularity; central metric anchored to external Kaggle leaderboards
full rationale
The paper's headline result (16.9% bronze-medal rate for o1-preview + AIDE) is obtained by direct comparison of agent submissions against publicly available Kaggle leaderboards for the 75 curated competitions. This external reference prevents any reduction of the reported percentage to an internally fitted parameter, self-defined threshold, or self-citation chain. The curation step itself is an input choice rather than a derived claim, and no equations or uniqueness theorems are invoked that collapse back onto the paper's own definitions. Minor self-citations (e.g., to prior OpenAI agent work) appear but are not load-bearing for the performance numbers. The derivation therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Kaggle competitions are representative of real-world ML engineering tasks
Forward citations
Cited by 56 Pith papers
-
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
-
Declarative Data Services: Structured Agentic Discovery for Composing Data Systems
DDS decomposes agentic data-system composition into bounded sub-searches via intent, operator DAG, per-system skills, and runtime attribution contracts, turning runtime failures into cited skill patches.
-
What Do Evolutionary Coding Agents Evolve?
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
-
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
WildRoadBench provides a professionally annotated UAV corpus and dual-track protocol showing frontier VLMs and LLM agents achieve limited performance on wild aerial road-damage grounding under unified metrics.
-
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under p...
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a benchmark that evaluates coding agents by having them generate browser-native games from specifications, then running those games in a real browser to assign EXCELLENT, USABLE, or UNUSABLE labels, wi...
-
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
WebGameBench is a new benchmark that evaluates coding agents on building browser-native games from frozen specifications, with runtime browser evaluation showing best agents reach 76.9% usable rate but only 20.2% exce...
-
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
DiagEval is a new diagnostic protocol that conditions on failed trajectories to attribute GUI-agent evaluation failures, recovering 45-62% of misattributed cases and lifting accuracy 8-16 points on two benchmarks.
-
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
-
SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution
SMCEvolve applies Sequential Monte Carlo sampling to LLM program search with adaptive resampling, mutation mixtures, and convergence control, delivering finite-sample complexity bounds and benchmark gains over prior systems.
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
-
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.
-
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
-
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
-
SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
LLMs predict outcomes of real scientific experiments at 14-26% accuracy, comparable to human experts, but lack calibration on prediction reliability while humans demonstrate strong calibration.
-
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
-
KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems
KompeteAI accelerates AutoML pipeline evaluation 6.9 times and beats prior systems by 3% on MLE-Bench through candidate merging, external RAG, and predictive early scoring.
-
Frontier Models are Capable of In-context Scheming
Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.
-
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
AutoResearchClaw presents a multi-agent autonomous research pipeline with debate, self-healing execution, verifiable reporting, human-in-the-loop modes, and cross-run evolution that outperforms AI Scientist v2 by 54.7...
-
What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents
SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.
-
How Far Are We From True Auto-Research?
ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
-
DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents
DiagEval applies trajectory-conditioned diagnostic probes to recover 45.6-62.1% of misattributed failures in GUI-agent software evaluation, raising accuracy from 69.9% to 78.3% on WebDevJudge-Unit and 65.0% to 81.6% o...
-
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
-
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...
-
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench shows that localized non-user attacks via skills and artifacts can consistently induce unsafe agent behavior across domains and model backends, independent of user intent.
-
DataMaster: Data-Centric Autonomous AI Research
DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...
-
DataMaster: Data-Centric Autonomous AI Research
DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
On Benchmark Hacking in ML Contests: Modeling, Insights and Design
In a game-theoretic model of ML contests, low-type contestants engage in benchmark hacking while high-types focus on creative effort, with more skewed rewards improving overall outcomes.
-
Evaluation-driven Scaling for Scientific Discovery
SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster ...
-
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
In-Place Test-Time Training
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
-
Multilingual Prompt Localization for Agent-as-a-Judge: Language and Backbone Sensitivity in Requirement-Level Evaluation
Localizing judge prompts to five languages shows that LLM backbones interact with language in agent-as-a-judge evaluations, inverting rankings and revealing no universal best model with low inter-judge agreement.
-
Reasoning as Gradient: Scaling MLE Agents Beyond Tree Search
Gome reaches 35.1% any-medal rate on MLE-Bench by mapping reasoning to gradient-based updates, outperforming tree search once models are sufficiently capable.
-
What Makes AI Research Replicable? Executable Knowledge Graphs as Scientific Knowledge Representations
xKG is a paper-centric knowledge base that extracts code and insights to improve LLM agent performance on AI research replication by 10.9% on PaperBench.
-
MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
MachineLearningLM uses continued pretraining on SCM-synthesized ML tasks with random-forest distillation to give LLMs robust many-shot in-context learning on tabular classification, reaching random-forest accuracy lev...
-
RExBench: Can coding agents autonomously implement AI research extensions?
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasonin...
-
PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents
PACEvolve++ uses a phase-adaptive reinforcement learning advisor to decouple hypothesis selection from execution in LLM-driven evolutionary search, delivering faster convergence than prior frameworks on load balancing...
-
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
-
Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks
Spatial Atlas implements compute-grounded reasoning via a structured scene graph engine and deterministic computations to deliver competitive accuracy on spatial QA and Kaggle ML benchmarks while preserving interpretability.
-
From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments
An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.
-
AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering
AceGRPO trains 30B-parameter LLM agents to achieve 100% valid submissions and competitive performance on MLE-Bench-Lite through evolving data buffers and adaptive task sampling.
-
End-to-end PDDL Planning with Hardcoded and Dynamic Agents
An end-to-end LLM framework refines natural language into valid PDDL domains and problems via hardcoded and dynamic agents, generates plans with standard engines, and returns readable output.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
-
Europe and the Geopolitics of AGI: The Need for a Preparedness Plan
AGI may arrive by 2030-2040 and reshape global power balances, requiring Europe to close gaps in compute, talent retention, industrial adoption, and unified policy responses through a coordinated preparedness agenda.
-
A Survey of Reinforcement Learning for Large Reasoning Models
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
-
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
Reference graph
Works this paper leans on
-
[1]
Anthropic's Responsible Scaling Policy , Version 1.0, September 2023
Anthropic . Anthropic's Responsible Scaling Policy , Version 1.0, September 2023
work page 2023
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program Synthesis with Large Language Models , August 2021. URL http://arxiv.org/abs/2108.07732. arXiv:2108.07732 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying Memorization Across Neural Language Models , March 2023. URL http://arxiv.org/abs/2202.07646. arXiv:2202.07646 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Cognition Introducing Devin , the first AI software engineer, March 2024
cognition.ai . Cognition Introducing Devin , the first AI software engineer, March 2024. URL https://cognition.ai/
work page 2024
-
[6]
Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020
Rhiju Das, H Wayment-Steele, Do Soon Kim, Christian Choe, Bojan Tunguz, Walter Reade, and Maggie Demkin. Openvaccine: Covid-19 mrna vaccine degradation prediction, 2020. URL https://kaggle.com/competitions/stanford-covid-vaccine
work page 2020
-
[7]
ConStat : Performance - Based Contamination Detection in Large Language Models , May 2024
Jasper Dekoninck, Mark Niklas Müller, and Martin Vechev. ConStat : Performance - Based Contamination Detection in Large Language Models , May 2024. URL http://arxiv.org/abs/2405.16281. arXiv:2405.16281 [cs]
-
[8]
GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024
Thomas Dohmke. GitHub Copilot Workspace : Welcome to the Copilot -native developer environment, April 2024. URL https://github.blog/news-insights/product-news/github-copilot-workspace/
work page 2024
-
[9]
Code Droid Technical Report , June 2024
factory.ai . Code Droid Technical Report , June 2024. URL https://www.factory.ai/news/code-droid-technical-report
work page 2024
-
[10]
AgentQuest : A Modular Benchmark Framework to Measure Progress and Improve LLM Agents , April 2024
Luca Gioacchini, Giuseppe Siracusano, Davide Sanvito, Kiril Gashteovski, David Friede, Roberto Bifulco, and Carolin Lawrence. AgentQuest : A Modular Benchmark Framework to Measure Progress and Improve LLM Agents , April 2024. URL http://arxiv.org/abs/2404.06411. arXiv:2404.06411 [cs]
-
[11]
Frontier Safety Framework , May 2024
Google DeepMind . Frontier Safety Framework , May 2024
work page 2024
-
[12]
Measuring Coding Challenge Competence With APPS
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring Coding Challenge Competence With APPS , November 2021. URL http://arxiv.org/abs/2105.09938. arXiv:2105.09938 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Dong Huang, Jie M. Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, and Heming Cui. AgentCoder : Multi - Agent -based Code Generation with Iterative Testing and Optimisation , May 2024 a . URL http://arxiv.org/abs/2312.13010. arXiv:2312.13010 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench : Evaluating Language Agents on Machine Learning Experimentation . In Forty-first International Conference on Machine Learning, June 2024 b . URL https://openreview.net/forum?id=1Fs1LvjYQW
work page 2024
-
[15]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench : Holistic and Contamination Free Evaluation of Large Language Models for Code , June 2024. URL http://arxiv.org/abs/2403.07974. arXiv:2403.07974 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE -bench: Can Language Models Resolve Real - World GitHub Issues ?, April 2024. URL http://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
DSBench : How Far Are Data Science Agents to Becoming Data Science Experts ?, September 2024
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, and Dong Yu. DSBench : How Far Are Data Science Agents to Becoming Data Science Experts ?, September 2024. URL http://arxiv.org/abs/2409.07703. arXiv:2409.07703 [cs]
-
[18]
Kaggle Progression System Kaggle , 2024
Kaggle . Kaggle Progression System Kaggle , 2024. URL https://www.kaggle.com/progression
work page 2024
-
[19]
Eirini Kalliamvakou. Research: quantifying GitHub Copilot ’s impact on developer productivity and happiness, September 2022. URL https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
work page 2022
-
[20]
Siegel, Nitya Nadgir, and Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI Agents That Matter , July 2024. URL http://arxiv.org/abs/2407.01502. arXiv:2407.01502 [cs]
-
[21]
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien De Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Pushm...
-
[22]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench : Evaluating LLMs as Agents , October 2023. URL http://arxiv.org/abs/2308.03688....
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Vesuvius challenge - ink detection, 2023
Alex Lourenco, Brent Seales, Christy Chapman, Daniel Havir, Ian Janicki, JP Posma, Nat Friedman, Ryan Holbrook, Seth P., Stephen Parsons, and Will Cukierski. Vesuvius challenge - ink detection, 2023. URL https://kaggle.com/competitions/vesuvius-challenge-ink-detection
work page 2023
-
[24]
Discovering and exploring cases of educational source code plagiarism with Dolos , 2024
Rien Maertens, Maarten Van Neyghem, Maxiem Geldhof, Charlotte Van Petegem, Niko Strijbol, Peter Dawyndt, and Bart Mesuere. Discovering and exploring cases of educational source code plagiarism with Dolos , 2024. URL https://github.com/dodona-edu/dolos. Publication Title: SoftwareX original-date: 2019-06-23T15:12:32Z
work page 2024
-
[25]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA : a benchmark for General AI Assistants , November 2023. URL http://arxiv.org/abs/2311.12983. arXiv:2311.12983 [cs]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
Preparedness Framework , December 2023
OpenAI . Preparedness Framework , December 2023
work page 2023
-
[27]
Introducing Weco AIDE , April 2024
Dominik Schmidt, Zhengyao Jiang, and Yuxiang Wu. Introducing Weco AIDE , April 2024. URL https://www.weco.ai/blog/technical-report
work page 2024
-
[28]
Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein. ML - Bench : Evaluating Large Language Models and A...
-
[29]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. OpenDevin: An Open Platform for AI Soft...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
The shift from models to compound ai systems, 2024
Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. The shift from models to compound ai systems, 2024. URL http://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
work page 2024
-
[31]
AutoCodeRover : Autonomous Program Improvement , July 2024
Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. AutoCodeRover : Autonomous Program Improvement , July 2024. URL http://arxiv.org/abs/2404.05427. arXiv:2404.05427 [cs]
-
[32]
Can GPT -4 Perform Neural Architecture Search ?, August 2023
Mingkai Zheng, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu, and Samuel Albanie. Can GPT -4 Perform Neural Architecture Search ?, August 2023. URL http://arxiv.org/abs/2304.10970. arXiv:2304.10970 [cs]
-
[33]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[34]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[35]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[36]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.