The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.SE 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
The Precise Debugging Benchmark reveals that frontier LLMs achieve over 76% unit-test pass rates but below 45% edit precision when debugging, often regenerating rather than making minimal fixes.