Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Honghua Dong; Miaosen Chai; Robin Jia; Shangshang Wang; Song Bian; Wang Bill Zhu; Willie Neiswanger; Yejia Liu

arxiv: 2604.17338 · v4 · pith:P3SW5TQDnew · submitted 2026-04-19 · 💻 cs.SE · cs.CL

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Wang Bill Zhu , Miaosen Chai , Shangshang Wang , Yejia Liu , Song Bian , Honghua Dong , Willie Neiswanger , Robin Jia This is my paper

Pith reviewed 2026-05-20 23:57 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords LLM code debuggingprecise code editingbug synthesis benchmarkedit precision metricfault localizationregeneration vs editingsoftware engineering evaluation

0 comments

The pith

Frontier LLMs achieve high test-pass rates on debugging but low edit precision because they regenerate rather than minimally fix code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Precise Debugging Benchmark to test whether LLMs localize faults and apply only necessary edits or simply overwrite with new correct code. It synthesizes atomic bugs into single-line and multi-line faulty programs, then scores models on edit-level precision (how many required changes they actually make) and bug-level recall (how many bugs they eliminate). Even when told to edit minimally, models such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking reach above 76 percent unit-test pass rates yet stay below 45 percent precision. Iterative and agentic prompting strategies also fail to raise these figures. The result shows that high functional correctness can hide a lack of precise debugging behavior.

Core claim

Frontier models regenerate correct but over-edited solutions during debugging tasks; the PDB framework converts existing coding datasets into precision-aware benchmarks by injecting verified atomic bugs, and the resulting metrics reveal that unit-test success above 76 percent coincides with edit precision below 45 percent even under explicit minimal-edit instructions.

What carries the argument

PDB framework that synthesizes verified atomic bugs and composes them into multi-bug programs, paired with edit-level precision and bug-level recall metrics that count only the necessary changes.

If this is right

Post-training pipelines for coding models must be redesigned to reward localized, minimal edits rather than full regeneration.
Iterative or agentic loops do not automatically produce more precise debugging behavior.
Benchmarks that only measure final test-pass rates will continue to overestimate debugging skill.
Single-line and multi-line bug variants expose different precision gaps that unit tests alone miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that penalize unnecessary token changes could directly improve the observed precision gap.
The same synthesis method could be applied to other languages or domains to test whether the regeneration pattern is universal.
Real bug-fix datasets from open-source repositories could serve as an external check on whether synthetic atomic bugs capture typical fault distributions.

Load-bearing premise

The atomic bugs created by the synthesis process behave like the faults that appear in actual developer workflows.

What would settle it

A direct measurement on real-world bug reports showing that models achieve edit precision above 60 percent when explicitly prompted for minimal changes.

Figures

Figures reproduced from arXiv: 2604.17338 by Honghua Dong, Miaosen Chai, Robin Jia, Shangshang Wang, Song Bian, Wang Bill Zhu, Willie Neiswanger, Yejia Liu.

**Figure 1.** Figure 1: Real example from GPT-5.2 debugging a binary search program, where the model rewrites the entire solution. Green lines mark precise edits; gray lines highlight over-edits. debugging and maintenance (Glass, 2002). When applied to debugging tasks, we observe that frontier LLMs often default to regeneration, i.e., rewriting large portions, or even the entirety, of a program when presented with buggy code ( [… view at source ↗

**Figure 2.** Figure 2: PDB pipeline. Generation: LLMs first synthesize and verify single-line bugs from existing coding datasets, which are then composed into multi-bug programs. Evaluation: Automated debugging systems are evaluated on these programs using both unit-test accuracy and edit-level precision and bug-level recall. as map, which pairs each Ei with the closest edits in Eˆ. For each bug i, we construct a pseudorevisio… view at source ↗

**Figure 3.** Figure 3: Data distribution of PDB-SINGLE-HARD. Bug composition. To create more challenging debugging scenarios, we compose multiple atomic bugs into a single program. For each (x, Cgt) pair and a target bug count k, we randomly sample k distinct block edits from the generated bugs. To encourage independence between bugs, we enforce a stride constraint, requiring any two selected edits to be at least s lines apart. … view at source ↗

**Figure 4.** Figure 4: Correlation between precision, recall, and unit-test score across bug counts. Results are shown on subsets [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Both iterative and agentic setups on PDB-SINGLE-HARD improve unit-test pass rates and recall over single-shot debugging, indicating higher functional success. However, edit-level precision does not improve and sometimes degrades. Notably, even Claude-Code with access to unit-test and execution feedback exhibits only 50% precision. Model Precision Recall Unit (%) Claude-Sonnet-4.5 65.9 73.9 64.8 Gemini-2.5-… view at source ↗

**Figure 6.** Figure 6: Comparison of model performance under minimal-debug and freeform prompting on a subset of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Recall distribution over bug categories. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Model breakdown performance on PDB-SINGLE-HARD rewriting with the same generator, or a different generator. ODC Category Sub-category Brief Description Assignment Mutability Trap Mutable default arguments cause unintended shared state across calls. Late Binding in Closures Loop variables captured by reference, yielding unexpected final values. List Multiplication Surprise List multiplication creates multip… view at source ↗

**Figure 9.** Figure 9: Model averaged performance on PDBSINGLE-HARD over distribution of buggy code length. All metrics show a similar performance drop. of large quantities of plausible faulty code with minimal surface changes. Such capability may be misused to degrade software reliability in collaborative development settings, increase the review burden on maintainers, or seed low-quality code into shared repositories. Anothe… view at source ↗

**Figure 10.** Figure 10: Correlation between precision, recall, and unit-test score across bug counts. Results are shown on subsets [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Discovering bugs missed by ground-truth (1.9%): The model identifies and fixes bugs that were [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Functionally correct but undetected (70% of recall [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Multiple minimal fixes (20% of recall<1 cases): A single bug can have multiple minimal correct fixes, and the model chose a different valid fix than the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Bug composition issue (10% of recall<1 cases): Compounding bugs introduced during bug-composition stage where one injected bug changes program logic affecting other bugs [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Under-repair (31.4%): The model fixes some bugs without introducing unnecessary edits but fails to [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Regressive repair (39.2%): The model fixes all original bugs (recall [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Bug injection prompt for benchmark construction. [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Minimal debugging prompt with problem description and buggy code. [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗

**Figure 22.** Figure 22: Minimal debugging prompt with unit tests. [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗

**Figure 23.** Figure 23: Minimal debugging prompt with execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗

**Figure 24.** Figure 24: Minimal debugging prompt with unit tests and execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Free-form debugging prompt without minimal edit constraint. [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗

**Figure 26.** Figure 26: Free-form debugging prompt with unit tests. [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗

**Figure 27.** Figure 27: Free-form debugging prompt with execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗

**Figure 28.** Figure 28: Free-form debugging prompt with unit tests and execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

**Figure 29.** Figure 29: External API template for minimal debugging. [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: External API template for free-form debugging. [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗

**Figure 31.** Figure 31: Solution rewriting prompt for benchmark construction. [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗

read the original abstract

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows frontier coding models mostly regenerate solutions rather than making minimal targeted fixes, and it supplies a framework plus metrics to measure the difference.

read the letter

The main takeaway is that LLMs often pass unit tests on debugging tasks by rewriting large chunks of code instead of applying precise edits. The work introduces the Precise Debugging Benchmark to turn ordinary coding datasets into tests that track this behavior more closely than standard correctness checks do. They synthesize verified atomic bugs, insert them into clean programs, and compose multi-bug versions for harder cases. Two new metrics follow: edit-level precision counts only the changes that match the inserted faults, while bug-level recall tracks how many bugs actually disappear. They release PDB-Single-Hard and PDB-Multi and run frontier models including GPT-5.1-Codex and DeepSeek-V3.2-Thinking, which clear 76 percent pass rates but stay below 45 percent precision even when told to keep edits small. Iterative and agentic strategies also fail to lift those numbers much. This separation of passing tests from doing minimal work is a useful observation for anyone training or evaluating code models. The synthetic bug construction is described clearly and the metrics are defined without circularity or free parameters. The main limitation is that the atomic bugs are clean and localized by design. Real developer faults often involve semantic ambiguity, cross-file effects, or multiple valid repairs, so the precision scores could look worse than they would on natural data. The abstract gives no direct comparison to existing bug corpora, which leaves that assumption untested for now. Researchers working on post-training for coding assistants or on evaluation methods that care about edit efficiency will get the most from this. The central claim is internally consistent and the experimental gap is large enough to matter, so the paper deserves referee time to check the benchmark details and the representativeness question.

Referee Report

3 major / 1 minor

Summary. The paper introduces the Precise Debugging Benchmark (PDB) framework, which automatically converts coding datasets into debugging benchmarks by synthesizing verified atomic bugs and composing them into multi-bug programs. It defines edit-level precision (measuring necessary targeted edits) and bug-level recall metrics to distinguish precise debugging from over-editing or regeneration. Experiments on PDB-Single-Hard and PDB-Multi show frontier models (e.g., GPT-5.1-Codex, DeepSeek-V3.2-Thinking) achieving >76% unit-test pass rates but <45% precision, even under minimal-debugging instructions; iterative and agentic strategies yield no substantial gains.

Significance. If the synthetic bugs prove representative of real faults, the work identifies a key gap in LLM post-training for coding: models solve tests via regeneration rather than localized fixes. The release of two concrete benchmarks (PDB-Single-Hard, PDB-Multi) and the precision/recall metrics supplies a reproducible evaluation resource that the community can use to measure and improve targeted editing behavior.

major comments (3)

[Abstract] Abstract (PDB generation paragraph): the claim that synthesized atomic bugs and their compositions serve as faithful proxies for real developer debugging workflows is load-bearing for interpreting low edit-level precision as evidence of 'regeneration' rather than debugging, yet the manuscript supplies no comparison against established real-world bug corpora (e.g., Defects4J or QuixBugs).
[Abstract] Abstract (bug synthesis description): verification procedures for the atomic bugs (e.g., how each insertion is confirmed to be a genuine fault that requires a specific edit rather than an arbitrary change) are not described, which directly affects the soundness of the edit-level precision metric.
[Abstract] Abstract (experimental results): the reported precision gap (<45%) lacks any mention of statistical significance testing, confidence intervals, or variance across runs, making it difficult to assess whether the difference from unit-test pass rates is robust.

minor comments (1)

Clarify in the metric definitions whether edit-level precision counts only exact string matches to the inserted bug locations or allows semantically equivalent but differently located fixes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below in a point-by-point fashion and indicate the changes planned for the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (PDB generation paragraph): the claim that synthesized atomic bugs and their compositions serve as faithful proxies for real developer debugging workflows is load-bearing for interpreting low edit-level precision as evidence of 'regeneration' rather than debugging, yet the manuscript supplies no comparison against established real-world bug corpora (e.g., Defects4J or QuixBugs).

Authors: We agree that validating the representativeness of the synthetic bugs against real-world corpora strengthens the interpretation of the precision results. The synthesis approach was chosen to enable atomic, verifiable bugs that support the edit-level precision metric, which is difficult to obtain from existing corpora. In the revision we will add a dedicated limitations and validation subsection that compares bug characteristics (e.g., edit span, failure mode distribution) with Defects4J and QuixBugs using publicly available metadata, while noting that a full end-to-end evaluation on real bugs remains future work. revision: partial
Referee: [Abstract] Abstract (bug synthesis description): verification procedures for the atomic bugs (e.g., how each insertion is confirmed to be a genuine fault that requires a specific edit rather than an arbitrary change) are not described, which directly affects the soundness of the edit-level precision metric.

Authors: The verification procedure—inserting a candidate edit, confirming the original tests pass, the modified program fails, and the minimal edit restores passage—is described in Section 3.2 of the full manuscript. To address the abstract-level concern we will add a concise sentence in the abstract summarizing the verification step. revision: yes
Referee: [Abstract] Abstract (experimental results): the reported precision gap (<45%) lacks any mention of statistical significance testing, confidence intervals, or variance across runs, making it difficult to assess whether the difference from unit-test pass rates is robust.

Authors: We concur that statistical details improve interpretability. The revised manuscript will report 95% confidence intervals and standard deviation across five independent runs for both unit-test pass rate and edit precision on the main results tables and will briefly note this in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or metrics

full rationale

The paper constructs PDB by synthesizing verified atomic bugs into programs and directly defines edit-level precision and bug-level recall from counts of matching edits and resolved bugs. These are definitional choices for the benchmark rather than fitted parameters or equations that reduce outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the abstract or described framework. Results are empirical measurements of model behavior on the self-generated benchmark, making the central claims self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the reliability of automatically synthesized and verified atomic bugs as proxies for real debugging tasks; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Synthesized atomic bugs can be verified as causing the observed failure and remain minimal.
The PDB generation process depends on this property to create controlled debugging instances.

invented entities (1)

edit-level precision metric no independent evidence
purpose: Quantifies the fraction of necessary edits performed by the model.
Newly defined evaluation measure introduced to distinguish precise debugging from regeneration.

pith-pipeline@v0.9.0 · 5745 in / 1220 out tokens · 28950 ms · 2026-05-20T23:57:20.423767+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · 2 internal anchors

[1]

Robert L Glass

A study on robustness and reliability of large language model code generation.arXiv preprint arXiv:2308.13888. Robert L Glass. 2002.Facts and fallacies of software engineering. Addison-Wesley Professional. Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, and 1 others. 2024. Codeedi- torbe...

work page arXiv 2002
[2]

Measuring Coding Challenge Competence With APPS

Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938. Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, and Libo Qin. 2025. Mldebugging: Towards benchmarking code debugging across multi-library scenarios.arXiv preprint arXiv:2506.13824. Binyuan Hui, Jian Yang, Zeyu Cui, Jia...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

StarCoder: may the source be with you!

Spoc: Search-based pseudocode to code.Ad- vances in Neural Information Processing Systems, 32. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, and 1 others. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161. Yujia Li, David Choi, Juny...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657

Debugbench: Evaluating debugging capability of large language models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657. Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. No- funeval: Funny how code lms falter on require- ments beyond functional correctness.arX...

work page arXiv 2024
[5]

Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang

Vibe checker: Aligning code evaluation with human preference.arXiv preprint arXiv:2510.07315. Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang. 2024. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. InProceedings of the 12th International Con- ference on...

work page arXiv 2024
[6]

demonstrates autonomous repair capabilities, and COAST (Yang et al., 2024b) enhances debug- ging through communicative agent-based data syn- thesis. These approaches utilize techniques ranging from zero-shot prompting to multi-turn conversa- tional agents (Chen et al., 2023; Fan and Xia, 2024; Xia and Zhang, 2023; Zhong et al., 2024; Islam et al., 2024). ...

work page 2023
[7]

examine automatic bug fixing performance on existing datasets. A common limitation of these benchmarks, however, is their reliance on a sim- ple, binary pass/fail metric on test cases (Zhang Model Precision Recall Unit (%) Claude-Sonnet-4.578.1±0.785.7±0.6 81.9±0.9 Gemini-2.5-Pro 77.9±0.787.5±0.683.8±0.8 Qwen3-Coder-480B 73.5±0.8 82.4±0.7 77.4±0.9 Kimi-K2...

work page 2024
[10]

config.read(config_file_path) 8

work page
[11]

project_dir = config.get('Project', 'directory') 10

work page
[13]

raise FileNotFoundError(f'Directory {project_dir} does not exist.') 13

work page
[14]

archive_file = f'{archieve_dir}/{project_dir}.z ip' A (Modify) 15

work page
[16]

shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 18

work page
[18]

Failed to create archive {archive_fi le}

raise Exception(f"Failed to create archive {archive_fi le}") 21

work page
[19]

return True Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: gpt-5.1-codex Model Generated Solution

work page
[20]

def task_func(config_file_path, archieve_dir ='/home/user/arch ive'): 5

work page
[21]

if not os.path.isfile(config_file_path):

work page
[22]

Config file {config_file_pat h} does not exist

raise FileNotFoundError(f"Config file {config_file_pat h} does not exist.") 8

work page
[23]

config = configparser.ConfigParser()

work page
[24]

config.read(config_file_path) 11

work page
[25]

project_dir = config.get('Project', 'directory') 13

work page
[26]

if not os.path.isdir(project_dir):

work page
[27]

raise FileNotFoundError(f'Directory {project_dir} does not exist.') 16

work page
[28]

archive_file = f'{archieve_dir}/{os.path.basename(proj ect_dir)}.zip' A 18

work page
[29]

# Using shutil to create the zip archive

work page
[30]

shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 21

work page
[31]

if not os.path.isfile(archive_file):

work page
[32]

Failed to create archive {archive_fi le}

raise Exception(f"Failed to create archive {archive_fi le}") 24

work page
[33]

📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks

return True Fixed bug Extra editIntroduced bug Debug Model: gemini-2.5-pro Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect correctness. 📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks. This function manipulates plot data using matplotlib. Note: If the ...

work page
[37]

if not isinstance(ax, plt.PolarAxes):

work page
[41]

ax.set_rlabel_position(radius * 45)

work page
[42]

return ax Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

work page
[44]

def task_func(ax, radius): 4

work page
[45]

raise ValueError('Radius must be non-negative')

work page
[46]

if ax.name != 'polar':

work page
[47]

raise TypeError('ax must be a polar plot') 9

work page
[48]

theta = np.linspace(0, 2 * np.pi, 1000)

work page
[49]

ax.plot(theta, radius * np.ones_like(theta))

work page
[50]

ax.set_rlabel_position(45) — ax.draw()

work page
[51]

📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets

return ax Fixed bug Extra editIntroduced bug Debug Model: Kimi-K2-Thinking Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks beyond what is required to fix the bug. 📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets. Move the files to su...

work page
[56]

with open(os.path.join(directory, filename), 'r') as f ile:

work page
[58]

match = re.search('(.*?)\[.*?\]', content)

work page
[59]

subdirectory = match.group(1).strip() 15

work page
[61]

os.makedirs(os.path.join(directory, subdir ectory)) 18

work page
[62]

{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}

new_filename = f"{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}"

work page
[63]

shutil.move(os.path.join(directory, filename), os.path.join(directory, subdirectory, new_filename)) 21

work page
[66]

moved_files[subdirectory].append(new_filename) 25

work page
[67]

return directory.resolve(), moved_files A (Modify) Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

work page
[68]

from datetime import datetime

work page
[69]

def task_func(directory): 6

work page
[70]

DATE_FORMAT = '%Y%m%d%H%M%S'

work page
[71]

for filename in os.listdir(directory):

work page
[72]

filepath = os.path.join(directory, filename)

work page
[73]

if os.path.isfile(filepath):

work page
[74]

with open(filepath, 'r') as file:

work page
[75]

content = file.read()

work page
[76]

match = re.search(r'^(.*?)\[(.*?)\](.*)$', con tent, re.DOTALL)

work page
[77]

match = re.search(r'\[(.*?)\](.*)$', conte nt, re.DOTALL)

work page
[78]

before_brackets = content[:match.start ()]

work page
[79]

first_non_bracket_text = before_bracke ts.split()[-1] if before_brackets.split() else ""

work page
[80]

subdirectory = first_non_bracket_text

work page
[81]

before_first_bracket = match.group(1)

work page
[82]

parts = before_first_bracket.strip().split ()

work page
[83]

subdirectory = parts[-1] if parts else "" 27

work page
[84]

if not subdirectory:

work page
[85]

if not os.path.exists(os.path.join(directory, subdirectory)):

work page
[86]

os.makedirs(os.path.join(directory, subdir ectory)) 33

work page
[87]

name_part, ext_part = os.path.splitext(filenam e)

work page
[88]

{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}

new_filename = f"{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}"

work page
[89]

shutil.move(filepath, os.path.join(directory, subdirectory, new_filename)) 37

work page
[90]

if subdirectory not in moved_files:

work page
[91]

moved_files[subdirectory] = []

work page
[92]

moved_files[subdirectory].append(new_filename) 41

work page
[93]

return os.path.abspath(directory), moved_files A Fixed bug Extra editIntroduced bug Debug Model: Qwen3-Coder-480B-A35B-Instruct-FP8 Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal targeted fixes. 📋Task Description Unzip a list of objects and their 3D coordinates, run PCA to reduce the dimensiona...

work page
[104]

if plot_path is not None == True: A (Modify)

work page
[108]

return coordinates_2d Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

work page
[109]

from sklearn.decomposition import PCA

work page
[111]

def task_func(data, save_plot=False, plot_path=None): 5

work page
[112]

items, x_values, y_values, z_values = zip(*data)

work page
[113]

coordinates = np.array(list(zip(x_values, y_values, z_valu es))) 8

work page
[114]

pca = PCA(n_components=2)

work page

Showing first 80 references.

[1] [1]

Robert L Glass

A study on robustness and reliability of large language model code generation.arXiv preprint arXiv:2308.13888. Robert L Glass. 2002.Facts and fallacies of software engineering. Addison-Wesley Professional. Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, and 1 others. 2024. Codeedi- torbe...

work page arXiv 2002

[2] [2]

Measuring Coding Challenge Competence With APPS

Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938. Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, and Libo Qin. 2025. Mldebugging: Towards benchmarking code debugging across multi-library scenarios.arXiv preprint arXiv:2506.13824. Binyuan Hui, Jian Yang, Zeyu Cui, Jia...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

StarCoder: may the source be with you!

Spoc: Search-based pseudocode to code.Ad- vances in Neural Information Processing Systems, 32. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, and 1 others. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161. Yujia Li, David Choi, Juny...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657

Debugbench: Evaluating debugging capability of large language models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657. Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. No- funeval: Funny how code lms falter on require- ments beyond functional correctness.arX...

work page arXiv 2024

[5] [5]

Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang

Vibe checker: Aligning code evaluation with human preference.arXiv preprint arXiv:2510.07315. Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang. 2024. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. InProceedings of the 12th International Con- ference on...

work page arXiv 2024

[6] [6]

demonstrates autonomous repair capabilities, and COAST (Yang et al., 2024b) enhances debug- ging through communicative agent-based data syn- thesis. These approaches utilize techniques ranging from zero-shot prompting to multi-turn conversa- tional agents (Chen et al., 2023; Fan and Xia, 2024; Xia and Zhang, 2023; Zhong et al., 2024; Islam et al., 2024). ...

work page 2023

[7] [7]

examine automatic bug fixing performance on existing datasets. A common limitation of these benchmarks, however, is their reliance on a sim- ple, binary pass/fail metric on test cases (Zhang Model Precision Recall Unit (%) Claude-Sonnet-4.578.1±0.785.7±0.6 81.9±0.9 Gemini-2.5-Pro 77.9±0.787.5±0.683.8±0.8 Qwen3-Coder-480B 73.5±0.8 82.4±0.7 77.4±0.9 Kimi-K2...

work page 2024

[8] [10]

config.read(config_file_path) 8

work page

[9] [11]

project_dir = config.get('Project', 'directory') 10

work page

[10] [13]

raise FileNotFoundError(f'Directory {project_dir} does not exist.') 13

work page

[11] [14]

archive_file = f'{archieve_dir}/{project_dir}.z ip' A (Modify) 15

work page

[12] [16]

shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 18

work page

[13] [18]

Failed to create archive {archive_fi le}

raise Exception(f"Failed to create archive {archive_fi le}") 21

work page

[14] [19]

return True Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: gpt-5.1-codex Model Generated Solution

work page

[15] [20]

def task_func(config_file_path, archieve_dir ='/home/user/arch ive'): 5

work page

[16] [21]

if not os.path.isfile(config_file_path):

work page

[17] [22]

Config file {config_file_pat h} does not exist

raise FileNotFoundError(f"Config file {config_file_pat h} does not exist.") 8

work page

[18] [23]

config = configparser.ConfigParser()

work page

[19] [24]

config.read(config_file_path) 11

work page

[20] [25]

project_dir = config.get('Project', 'directory') 13

work page

[21] [26]

if not os.path.isdir(project_dir):

work page

[22] [27]

raise FileNotFoundError(f'Directory {project_dir} does not exist.') 16

work page

[23] [28]

archive_file = f'{archieve_dir}/{os.path.basename(proj ect_dir)}.zip' A 18

work page

[24] [29]

# Using shutil to create the zip archive

work page

[25] [30]

shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 21

work page

[26] [31]

if not os.path.isfile(archive_file):

work page

[27] [32]

Failed to create archive {archive_fi le}

raise Exception(f"Failed to create archive {archive_fi le}") 24

work page

[28] [33]

📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks

return True Fixed bug Extra editIntroduced bug Debug Model: gemini-2.5-pro Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect correctness. 📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks. This function manipulates plot data using matplotlib. Note: If the ...

work page

[29] [37]

if not isinstance(ax, plt.PolarAxes):

work page

[30] [41]

ax.set_rlabel_position(radius * 45)

work page

[31] [42]

return ax Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

work page

[32] [44]

def task_func(ax, radius): 4

work page

[33] [45]

raise ValueError('Radius must be non-negative')

work page

[34] [46]

if ax.name != 'polar':

work page

[35] [47]

raise TypeError('ax must be a polar plot') 9

work page

[36] [48]

theta = np.linspace(0, 2 * np.pi, 1000)

work page

[37] [49]

ax.plot(theta, radius * np.ones_like(theta))

work page

[38] [50]

ax.set_rlabel_position(45) — ax.draw()

work page

[39] [51]

📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets

return ax Fixed bug Extra editIntroduced bug Debug Model: Kimi-K2-Thinking Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks beyond what is required to fix the bug. 📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets. Move the files to su...

work page

[40] [56]

with open(os.path.join(directory, filename), 'r') as f ile:

work page

[41] [58]

match = re.search('(.*?)\[.*?\]', content)

work page

[42] [59]

subdirectory = match.group(1).strip() 15

work page

[43] [61]

os.makedirs(os.path.join(directory, subdir ectory)) 18

work page

[44] [62]

{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}

new_filename = f"{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}"

work page

[45] [63]

shutil.move(os.path.join(directory, filename), os.path.join(directory, subdirectory, new_filename)) 21

work page

[46] [66]

moved_files[subdirectory].append(new_filename) 25

work page

[47] [67]

return directory.resolve(), moved_files A (Modify) Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

work page

[48] [68]

from datetime import datetime

work page

[49] [69]

def task_func(directory): 6

work page

[50] [70]

DATE_FORMAT = '%Y%m%d%H%M%S'

work page

[51] [71]

for filename in os.listdir(directory):

work page

[52] [72]

filepath = os.path.join(directory, filename)

work page

[53] [73]

if os.path.isfile(filepath):

work page

[54] [74]

with open(filepath, 'r') as file:

work page

[55] [75]

content = file.read()

work page

[56] [76]

match = re.search(r'^(.*?)\[(.*?)\](.*)$', con tent, re.DOTALL)

work page

[57] [77]

match = re.search(r'\[(.*?)\](.*)$', conte nt, re.DOTALL)

work page

[58] [78]

before_brackets = content[:match.start ()]

work page

[59] [79]

first_non_bracket_text = before_bracke ts.split()[-1] if before_brackets.split() else ""

work page

[60] [80]

subdirectory = first_non_bracket_text

work page

[61] [81]

before_first_bracket = match.group(1)

work page

[62] [82]

parts = before_first_bracket.strip().split ()

work page

[63] [83]

subdirectory = parts[-1] if parts else "" 27

work page

[64] [84]

if not subdirectory:

work page

[65] [85]

if not os.path.exists(os.path.join(directory, subdirectory)):

work page

[66] [86]

os.makedirs(os.path.join(directory, subdir ectory)) 33

work page

[67] [87]

name_part, ext_part = os.path.splitext(filenam e)

work page

[68] [88]

{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}

new_filename = f"{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}"

work page

[69] [89]

shutil.move(filepath, os.path.join(directory, subdirectory, new_filename)) 37

work page

[70] [90]

if subdirectory not in moved_files:

work page

[71] [91]

moved_files[subdirectory] = []

work page

[72] [92]

moved_files[subdirectory].append(new_filename) 41

work page

[73] [93]

return os.path.abspath(directory), moved_files A Fixed bug Extra editIntroduced bug Debug Model: Qwen3-Coder-480B-A35B-Instruct-FP8 Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal targeted fixes. 📋Task Description Unzip a list of objects and their 3D coordinates, run PCA to reduce the dimensiona...

work page

[74] [104]

if plot_path is not None == True: A (Modify)

work page

[75] [108]

return coordinates_2d Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

work page

[76] [109]

from sklearn.decomposition import PCA

work page

[77] [111]

def task_func(data, save_plot=False, plot_path=None): 5

work page

[78] [112]

items, x_values, y_values, z_values = zip(*data)

work page

[79] [113]

coordinates = np.array(list(zip(x_values, y_values, z_valu es))) 8

work page

[80] [114]

pca = PCA(n_components=2)

work page