MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
Avo: Agentic variation operators for autonomous evolutionary search, 2026
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
background 4polarities
background 4representative citing papers
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.
CUDABEAVER benchmark and pass@k(M,C,A) metric show LLM CUDA debugging success drops by up to 40 percentage points under strict performance requirements.
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
citing papers explorer
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
-
CUDAHercules: Benchmarking Hardware-Aware Expert-level CUDA Optimization for LLMs
CUDAHercules benchmark demonstrates that leading LLMs generate functional CUDA code but fail to recover expert-level optimization strategies needed for peak performance on Ampere, Hopper, and Blackwell GPUs.
-
GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization
LLMs can forecast GPU kernel performance accurately enough to serve as selective surrogates, allowing kernel searches to consider more candidates and recover faster kernels under fixed GPU evaluation budgets.
-
CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging
CUDABEAVER benchmark and pass@k(M,C,A) metric show LLM CUDA debugging success drops by up to 40 percentage points under strict performance requirements.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.