pith. sign in

super hub Baseline reference

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Baseline reference. 55% of citing Pith papers use this work as a benchmark or comparison.

259 Pith papers citing it
Baseline 55% of classified citations
abstract

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchmark also focuses on a broader range of code related capabilities, such as self-repair, code execution, and test output prediction, beyond just code generation. Currently, LiveCodeBench hosts four hundred high-quality coding problems that were published between May 2023 and May 2024. We have evaluated 18 base LLMs and 34 instruction-tuned LLMs on LiveCodeBench. We present empirical findings on contamination, holistic performance comparisons, potential overfitting in existing benchmarks as well as individual model comparisons. We will release all prompts and model completions for further community analysis, along with a general toolkit for adding new scenarios and model

hub tools

citation-role summary

dataset 25 background 16 baseline 1 contradiction 1 method 1

citation-polarity summary

claims ledger

  • abstract Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation benchmarks (e.g., HumanEval, MBPP) are no longer sufficient for assessing their capabilities. In this work, we propose LiveCodeBench, a comprehensive and contamination-free evaluation of LLMs for code, which continuously collects new problems over time from contests across three competition platforms, namely LeetCode, AtCoder, and CodeForces. Notably, our benchma

authors

co-cited works

clear filters

representative citing papers

AxDafny: Agentic Verified Code Generation in Dafny

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

AxDafny achieves 92.7% verification success on DafnyBench (6.5 points above prior proof-hint baselines) via verifier-guided repair and introduces the LCB-Pro-Dafny benchmark of 250 problems.

AlgoBench: Benchmarking Algorithmic Adaptation in Code Generation

cs.SE · 2026-06-30 · unverdicted · novelty 7.0

AlgoBench creates traceable variants of competitive programming problems via constraint shifts that invalidate original algorithms, paired with complexity metrics that reveal LLMs often produce functionally correct but asymptotically unsuitable solutions.

Flaws in the LLM Automation Narrative

stat.OT · 2026-06-09 · unverdicted · novelty 7.0

A new code-writing data analysis benchmark shows human experts outperforming a frontier LLM on average with lower performance variance.

On the Geometry of On-Policy Distillation

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

OPD updates occupy a relaxed off-principal regime and rapidly lock into a low-dimensional subspace that is functionally sufficient for its performance, distinct from SFT and RLVR trajectories.

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Think Anywhere in Code Generation cs.SE · 2026-03-31 · unverdicted · none · ref 11 · internal anchor

    Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

  • You Don't Need Public Tests to Generate Correct Code cs.SE · 2026-04-23 · unverdicted · none · ref 9 · internal anchor

    DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.

  • Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation cs.SE · 2026-04-20 · unverdicted · none · ref 13 · internal anchor

    Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.

  • Generating Verifiable Chain of Thoughts from Exection-Traces cs.SE · 2025-11-28 · unverdicted · none · ref 11 · internal anchor

    A pipeline produces 54,000 execution-trace-verified bi-directional Chain-of-Thought rationales for code, and fine-tuning on them yields gains up to 26.6 points on LiveCodeBench-Exec and similar benchmarks.