Recognition: unknown
Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?
Pith reviewed 2026-05-10 06:02 UTC · model grok-4.3
The pith
Frontier LLMs pass debugging tests above 76 percent yet edit with precision below 45 percent, often regenerating entire solutions instead of making targeted fixes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Frontier models such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking achieve unit-test pass rates above 76 percent but exhibit precision below 45 percent, even when explicitly instructed to perform minimal debugging. The Precise Debugging Benchmark framework automatically converts coding datasets into debugging benchmarks by synthesizing verified atomic bugs and composing them into multi-bug programs, then evaluates outputs with edit-level precision and bug-level recall metrics.
What carries the argument
The Precise Debugging Benchmark (PDB) framework that generates buggy programs from verified atomic bugs and evaluates models with edit-level precision (how many necessary edits are performed) and bug-level recall (how many bugs are resolved).
If this is right
- Unit-test pass rates alone fail to distinguish precise debugging from full regeneration of a correct program.
- Explicit instructions to minimize edits do not raise precision above 45 percent on either single-bug or multi-bug tasks.
- Iterative and agentic debugging workflows produce no meaningful improvement in either precision or recall.
- Post-training pipelines for coding models must be redesigned to favor minimal localized edits over complete rewrites.
Where Pith is reading between the lines
- Training objectives that optimize only for test outcomes may be systematically rewarding any passing code rather than rewarding small correct changes.
- Developers who rely on these models for bug repair will frequently receive large diffs containing many unnecessary modifications.
- Future model training could incorporate human minimal-edit traces or explicit edit-distance penalties to close the observed precision gap.
Load-bearing premise
The automatically synthesized verified atomic bugs and their compositions accurately represent real-world debugging scenarios, and the new edit-level precision and bug-level recall metrics validly measure precise debugging behavior.
What would settle it
A direct comparison of model behavior on the PDB benchmarks versus a large collection of human-written buggy programs that come with known minimal correct fixes, testing whether the low-precision pattern disappears on the human bugs.
Figures
read the original abstract
Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Precise Debugging Benchmark (PDB) framework that automatically converts coding datasets into debugging tasks by synthesizing verified atomic bugs and composing them into multi-bug programs. It defines edit-level precision and bug-level recall metrics to distinguish precise, minimal debugging from code regeneration. Two benchmarks (PDB-Single-Hard for single-line bugs and PDB-Multi for multi-line bugs) are released, and experiments show frontier models (e.g., GPT-5.1-Codex, DeepSeek-V3.2-Thinking) achieve >76% unit-test pass rates but <45% precision even when instructed to debug minimally; iterative and agentic strategies yield little improvement.
Significance. If the synthetic bugs and compositions are shown to be representative of real debugging edit distributions, the results would be significant for highlighting a systematic tendency in current coding LLMs to over-edit or regenerate rather than apply targeted fixes. The new precision-aware metrics and open benchmarks provide concrete tools for future evaluation and could inform post-training improvements in software engineering applications.
major comments (1)
- [PDB framework and benchmark generation] Benchmark construction (PDB synthesis process): The headline finding that high pass rates reflect regeneration rather than debugging is load-bearing on the assumption that the automatically synthesized atomic bugs and their compositions produce minimal-edit solutions whose statistical structure matches human debugging scenarios. No quantitative comparison (e.g., edit-distance histograms, bug-interaction graphs, or direct matching to a human-curated corpus) is reported, so the low precision (<45%) could be an artifact of localized, non-interacting synthetic bugs rather than a general model property.
minor comments (1)
- [Abstract] Abstract: The reported thresholds ('above 76%' pass rate, 'below 45%' precision) are given without the underlying sample sizes, exact model versions, or dataset sources, making it harder to assess robustness or reproduce the quantitative claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment on the PDB framework and benchmark generation below.
read point-by-point responses
-
Referee: The headline finding that high pass rates reflect regeneration rather than debugging is load-bearing on the assumption that the automatically synthesized atomic bugs and their compositions produce minimal-edit solutions whose statistical structure matches human debugging scenarios. No quantitative comparison (e.g., edit-distance histograms, bug-interaction graphs, or direct matching to a human-curated corpus) is reported, so the low precision (<45%) could be an artifact of localized, non-interacting synthetic bugs rather than a general model property.
Authors: We acknowledge the referee's concern that the headline result relies on the synthetic bugs producing minimal-edit ground truths representative of real debugging. By construction, each atomic bug in PDB is the smallest change that causes test failure (verified by re-running tests after the edit), and multi-bug programs are formed by controlled composition of such atoms. This design provides explicit minimal-edit oracles for the precision metric. We did not report quantitative comparisons (edit-distance histograms or matches to human corpora) in the submitted manuscript. While we maintain that the observed low precision demonstrates a regeneration bias even when minimal fixes are feasible, we agree a direct distributional comparison would better address potential artifacts from synthetic localization. In the revision we will add a dedicated analysis subsection with edit-distance histograms drawn from PDB alongside those from publicly available human fix collections (e.g., Defects4J and ManyBugs) and will discuss observed differences in interaction complexity. revision: yes
Circularity Check
No circularity: new benchmark construction with external model evaluation
full rationale
The paper's chain consists of (1) defining a synthesis procedure for atomic bugs and their compositions to create PDB-Single-Hard and PDB-Multi, (2) introducing edit-level precision and bug-level recall as new metrics, and (3) running empirical evaluations of external frontier models (GPT-5.1-Codex, DeepSeek-V3.2-Thinking, etc.) on unit-test pass rates versus the new metrics. None of these steps reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central observation (high pass-rate, low precision) is a direct empirical measurement on held-out models and does not presuppose its own conclusion. The mapping from synthetic bugs to real debugging is an external validity assumption, not a circular derivation internal to the paper's equations or citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Synthesized verified atomic bugs accurately simulate real debugging scenarios when composed into multi-bug programs
Reference graph
Works this paper leans on
-
[1]
A study on robustness and reliability of large language model code generation.arXiv preprint arXiv:2308.13888. Robert L Glass. 2002.Facts and fallacies of software engineering. Addison-Wesley Professional. Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, and 1 others. 2024. Codeedi- torbe...
-
[2]
Measuring Coding Challenge Competence With APPS
Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938. Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, and Libo Qin. 2025. Mldebugging: Towards benchmarking code debugging across multi-library scenarios.arXiv preprint arXiv:2506.13824. Binyuan Hui, Jian Yang, Zeyu Cui, Jia...
work page internal anchor Pith review arXiv 2025
-
[3]
StarCoder: may the source be with you!
Spoc: Search-based pseudocode to code.Ad- vances in Neural Information Processing Systems, 32. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, and 1 others. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161. Yujia Li, David Choi, Juny...
work page internal anchor Pith review arXiv 2023
-
[4]
arXiv preprint arXiv:2401.15963 , year =
Debugbench: Evaluating debugging capability of large language models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657. Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. No- funeval: Funny how code lms falter on require- ments beyond functional correctness.arX...
-
[5]
Vibe checker: Aligning code evaluation with human preference.arXiv preprint arXiv:2510.07315. Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang. 2024. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. InProceedings of the 12th International Con- ference on...
-
[6]
demonstrates autonomous repair capabilities, and COAST (Yang et al., 2024b) enhances debug- ging through communicative agent-based data syn- thesis. These approaches utilize techniques ranging from zero-shot prompting to multi-turn conversa- tional agents (Chen et al., 2023; Fan and Xia, 2024; Xia and Zhang, 2023; Zhong et al., 2024; Islam et al., 2024). ...
2023
-
[7]
examine automatic bug fixing performance on existing datasets. A common limitation of these benchmarks, however, is their reliance on a sim- ple, binary pass/fail metric on test cases (Zhang Model Precision Recall Unit (%) Claude-Sonnet-4.578.1±0.785.7±0.6 81.9±0.9 Gemini-2.5-Pro 77.9±0.787.5±0.683.8±0.8 Qwen3-Coder-480B 73.5±0.8 82.4±0.7 77.4±0.9 Kimi-K2...
2024
-
[10]
config.read(config_file_path) 8
-
[11]
project_dir = config.get('Project', 'directory') 10
-
[13]
raise FileNotFoundError(f'Directory {project_dir} does not exist.') 13
-
[14]
archive_file = f'{archieve_dir}/{project_dir}.z ip' A (Modify) 15
-
[16]
shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 18
-
[18]
Failed to create archive {archive_fi le}
raise Exception(f"Failed to create archive {archive_fi le}") 21
-
[19]
return True Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: gpt-5.1-codex Model Generated Solution
-
[20]
def task_func(config_file_path, archieve_dir ='/home/user/arch ive'): 5
-
[21]
if not os.path.isfile(config_file_path):
-
[22]
Config file {config_file_pat h} does not exist
raise FileNotFoundError(f"Config file {config_file_pat h} does not exist.") 8
-
[23]
config = configparser.ConfigParser()
-
[24]
config.read(config_file_path) 11
-
[25]
project_dir = config.get('Project', 'directory') 13
-
[26]
if not os.path.isdir(project_dir):
-
[27]
raise FileNotFoundError(f'Directory {project_dir} does not exist.') 16
-
[28]
archive_file = f'{archieve_dir}/{os.path.basename(proj ect_dir)}.zip' A 18
-
[29]
# Using shutil to create the zip archive
-
[30]
shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 21
-
[31]
if not os.path.isfile(archive_file):
-
[32]
Failed to create archive {archive_fi le}
raise Exception(f"Failed to create archive {archive_fi le}") 24
-
[33]
📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks
return True Fixed bug Extra editIntroduced bug Debug Model: gemini-2.5-pro Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect correctness. 📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks. This function manipulates plot data using matplotlib. Note: If the ...
-
[37]
if not isinstance(ax, plt.PolarAxes):
-
[41]
ax.set_rlabel_position(radius * 45)
-
[42]
return ax Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution
-
[44]
def task_func(ax, radius): 4
-
[45]
raise ValueError('Radius must be non-negative')
-
[46]
if ax.name != 'polar':
-
[47]
raise TypeError('ax must be a polar plot') 9
-
[48]
theta = np.linspace(0, 2 * np.pi, 1000)
-
[49]
ax.plot(theta, radius * np.ones_like(theta))
-
[50]
ax.set_rlabel_position(45) — ax.draw()
-
[51]
📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets
return ax Fixed bug Extra editIntroduced bug Debug Model: Kimi-K2-Thinking Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks beyond what is required to fix the bug. 📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets. Move the files to su...
-
[56]
with open(os.path.join(directory, filename), 'r') as f ile:
-
[58]
match = re.search('(.*?)\[.*?\]', content)
-
[59]
subdirectory = match.group(1).strip() 15
-
[61]
os.makedirs(os.path.join(directory, subdir ectory)) 18
-
[62]
{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}
new_filename = f"{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}"
-
[63]
shutil.move(os.path.join(directory, filename), os.path.join(directory, subdirectory, new_filename)) 21
-
[66]
moved_files[subdirectory].append(new_filename) 25
-
[67]
return directory.resolve(), moved_files A (Modify) Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution
-
[68]
from datetime import datetime
-
[69]
def task_func(directory): 6
-
[70]
DATE_FORMAT = '%Y%m%d%H%M%S'
-
[71]
for filename in os.listdir(directory):
-
[72]
filepath = os.path.join(directory, filename)
-
[73]
if os.path.isfile(filepath):
-
[74]
with open(filepath, 'r') as file:
-
[75]
content = file.read()
-
[76]
match = re.search(r'^(.*?)\[(.*?)\](.*)$', con tent, re.DOTALL)
-
[77]
match = re.search(r'\[(.*?)\](.*)$', conte nt, re.DOTALL)
-
[78]
before_brackets = content[:match.start ()]
-
[79]
first_non_bracket_text = before_bracke ts.split()[-1] if before_brackets.split() else ""
-
[80]
subdirectory = first_non_bracket_text
-
[81]
before_first_bracket = match.group(1)
-
[82]
parts = before_first_bracket.strip().split ()
-
[83]
subdirectory = parts[-1] if parts else "" 27
-
[84]
if not subdirectory:
-
[85]
if not os.path.exists(os.path.join(directory, subdirectory)):
-
[86]
os.makedirs(os.path.join(directory, subdir ectory)) 33
-
[87]
name_part, ext_part = os.path.splitext(filenam e)
-
[88]
{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}
new_filename = f"{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}"
-
[89]
shutil.move(filepath, os.path.join(directory, subdirectory, new_filename)) 37
-
[90]
if subdirectory not in moved_files:
-
[91]
moved_files[subdirectory] = []
-
[92]
moved_files[subdirectory].append(new_filename) 41
-
[93]
return os.path.abspath(directory), moved_files A Fixed bug Extra editIntroduced bug Debug Model: Qwen3-Coder-480B-A35B-Instruct-FP8 Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal targeted fixes. 📋Task Description Unzip a list of objects and their 3D coordinates, run PCA to reduce the dimensiona...
-
[104]
if plot_path is not None == True: A (Modify)
-
[108]
return coordinates_2d Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution
-
[109]
from sklearn.decomposition import PCA
-
[111]
def task_func(data, save_plot=False, plot_path=None): 5
-
[112]
items, x_values, y_values, z_values = zip(*data)
-
[113]
coordinates = np.array(list(zip(x_values, y_values, z_valu es))) 8
-
[114]
pca = PCA(n_components=2)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.