The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.
arXiv preprint arXiv:2401.15963 , year =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.SE 2verdicts
UNVERDICTED 2representative citing papers
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
citing papers explorer
-
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.