super hub Baseline reference

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Alex Gu, Fanjia Yan, King Han, Naman Jain, Tianjun Zhang, Wen-Ding Li · 2024 · cs.SE · arXiv 2403.07974

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

255 Pith papers citing it

Baseline 55% of classified citations

open full Pith review browse 255 citing papers more from Alex Gu arXiv PDF

abstract

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 25 background 16 baseline 1 contradiction 1 method 1

citation-polarity summary

use dataset 23 background 18 baseline 1 contest 1 unclear 1

claims ledger

abstract Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchma

authors

Alex Gu Fanjia Yan King Han Naman Jain Tianjun Zhang Wen-Ding Li

co-cited works

representative citing papers

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

cs.AI · 2026-05-13 · accept · novelty 8.0

AI model builders mostly highlight unique benchmarks that act as flexible narrative tools for market positioning rather than standardized scientific measurements.

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

FlowCompile performs compile-time design space exploration on structured LLM workflows to produce reusable high-quality configuration sets that outperform routing baselines with up to 6.4x speedup.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

LiveBench: A Challenging, Contamination-Limited LLM Benchmark

cs.CL · 2024-06-27 · unverdicted · novelty 8.0

LiveBench is a contamination-limited LLM benchmark with auto-scored challenging tasks from recent sources across math, coding, reasoning and more, where top models score below 70%.

TestEvo-Bench: An Executable and Live Benchmark for Test and Code Co-Evolution

cs.SE · 2026-07-02 · unverdicted · novelty 7.0

TestEvo-Bench supplies 746 test-generation and 509 test-update tasks from 152 Java repositories, each tied to actual commits and packaged for execution-based scoring, with current agents reaching 77.5% and 74.6% success respectively.

DecompRL: Solving Harder Problems by Learning Modular Code Generation

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

DecompRL is an RL method that learns modular code decomposition for LLMs, enabling exponential candidate generation via recombination to solve harder coding problems with lower GPU cost.

AxDafny: Agentic Verified Code Generation in Dafny

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

AxDafny achieves 92.7% verification success on DafnyBench (6.5 points above prior proof-hint baselines) via verifier-guided repair and introduces the LCB-Pro-Dafny benchmark of 250 problems.

AlgoBench: Benchmarking Algorithmic Adaptation in Code Generation

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

AlgoBench creates traceable variants of competitive programming problems via constraint shifts that invalidate original algorithms, paired with complexity metrics that reveal LLMs often produce functionally correct but asymptotically unsuitable solutions.

RigorBench: Benchmarking Engineering Process Discipline in Autonomous AI Coding Agents

cs.SE · 2026-06-21 · unverdicted · novelty 7.0

RigorBench evaluates AI coding agents on process discipline via five pillars and reports 41% higher process scores and 17% better outcome correctness with structured approaches on 30 tasks.

Power Systems Agent Benchmark: Executable Evaluation of AI Agents in Electric Power Engineering

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Introduces the Power Systems Agent Benchmark with 41 task families across eight power engineering areas for executable evaluation of AI agents using deterministic feasibility checks.

BIM-Edit: Benchmarking Large Language Models for IFC-Based Building Information Modeling

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

BIM-Edit benchmark finds best LLM scores only 49.5% average across geometric, semantic, and topological metrics on 324 IFC editing tasks, with no model fully solving more than 3.4%.

Flaws in the LLM Automation Narrative

stat.OT · 2026-06-09 · unverdicted · novelty 7.0

A new code-writing data analysis benchmark shows human experts outperforming a frontier LLM on average with lower performance variance.

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

PRISM is a contrastive, policy-aware training framework for process reward models that reduces false positives by 22% on PRMBench and boosts downstream accuracy up to 33% in Best-of-N selection by learning reliable relative comparisons instead of pointwise labels.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

SkelDPO: A Skeleton-Guided Direct Preference Optimization Framework for Efficient Code Generation

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

SkelDPO improves code generation efficiency by 2-7% over prior DPO methods via joint preference losses on full code and efficiency-critical skeletons.

Reinforcement Learning from Rich Feedback with Distributional DAgger

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

DistIL applies distributional DAgger with forward cross-entropy to achieve monotonic policy improvement and better Pass@N from rich feedback in RL for reasoning tasks.

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

ResMerge: Residual-based Spectral Merging of Large Language Models

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

ResMerge improves merging of RL expert LLMs via a stable residual consensus backbone plus gated head correction, outperforming task-vector and spectral baselines in capability preservation.

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

cs.SE · 2026-05-25 · unverdicted · novelty 7.0

RepoMirage uses semantics-preserving perturbations on SWE-Bench to show code agents lack repository context reasoning, with performance falling sharply on extended structure tasks, and introduces RepoAnchor as a structure-first fix.

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

citing papers explorer

Showing 1 of 1 citing paper after filters.

A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 192 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer