arXiv preprint arXiv:2401.15963 , year =

Debugbench: Evaluating debugging capability of large language models · 2024 · arXiv 2401.15963

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

cs.SE · 2026-04-19 · unverdicted · novelty 7.0 · 2 refs

The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

citing papers explorer

Showing 2 of 2 citing papers.

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating? cs.SE · 2026-04-19 · unverdicted · none · ref 4 · 2 links
The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 233
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

arXiv preprint arXiv:2401.15963 , year =

fields

years

verdicts

representative citing papers

citing papers explorer