arxiv: 2604.17338 · v3 · submitted 2026-04-19 · 💻 cs.SE · cs.CL

Recognition: unknown

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Wang Bill Zhu , Miaosen Chai , Shangshang Wang , Yejia Liu , Song Bian , Honghua Dong , Willie Neiswanger , Robin Jia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:02 UTC · model grok-4.3

classification 💻 cs.SE cs.CL

keywords precise debuggingLLM code repairdebugging benchmarkedit-level precisionatomic bugsunit test evaluationcode regeneration

0 comments

The pith

Frontier LLMs pass debugging tests above 76 percent yet edit with precision below 45 percent, often regenerating entire solutions instead of making targeted fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Precise Debugging Benchmark to determine whether language models actually debug code or simply regenerate working versions. It converts any coding dataset into debugging tasks by inserting verified atomic bugs, then scores models on two new metrics that count only the necessary edits and resolved bugs rather than overall correctness. Experiments on the resulting single-bug and multi-bug suites show that even frontier models reach high unit-test pass rates while keeping edit precision low, and that telling them to change as little as possible or using iterative agent workflows does not close the gap. This reveals that current evaluation practices and training signals reward any passing output more than minimal, localized repairs.

Core claim

Frontier models such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking achieve unit-test pass rates above 76 percent but exhibit precision below 45 percent, even when explicitly instructed to perform minimal debugging. The Precise Debugging Benchmark framework automatically converts coding datasets into debugging benchmarks by synthesizing verified atomic bugs and composing them into multi-bug programs, then evaluates outputs with edit-level precision and bug-level recall metrics.

What carries the argument

The Precise Debugging Benchmark (PDB) framework that generates buggy programs from verified atomic bugs and evaluates models with edit-level precision (how many necessary edits are performed) and bug-level recall (how many bugs are resolved).

If this is right

Unit-test pass rates alone fail to distinguish precise debugging from full regeneration of a correct program.
Explicit instructions to minimize edits do not raise precision above 45 percent on either single-bug or multi-bug tasks.
Iterative and agentic debugging workflows produce no meaningful improvement in either precision or recall.
Post-training pipelines for coding models must be redesigned to favor minimal localized edits over complete rewrites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training objectives that optimize only for test outcomes may be systematically rewarding any passing code rather than rewarding small correct changes.
Developers who rely on these models for bug repair will frequently receive large diffs containing many unnecessary modifications.
Future model training could incorporate human minimal-edit traces or explicit edit-distance penalties to close the observed precision gap.

Load-bearing premise

The automatically synthesized verified atomic bugs and their compositions accurately represent real-world debugging scenarios, and the new edit-level precision and bug-level recall metrics validly measure precise debugging behavior.

What would settle it

A direct comparison of model behavior on the PDB benchmarks versus a large collection of human-written buggy programs that come with known minimal correct fixes, testing whether the low-precision pattern disappears on the human bugs.

Figures

Figures reproduced from arXiv: 2604.17338 by Honghua Dong, Miaosen Chai, Robin Jia, Shangshang Wang, Song Bian, Wang Bill Zhu, Willie Neiswanger, Yejia Liu.

**Figure 1.** Figure 1: Real example from GPT-5.2 debugging a binary search program, where the model rewrites the entire solution. Green lines mark precise edits; gray lines highlight over-edits. debugging and maintenance (Glass, 2002). When applied to debugging tasks, we observe that frontier LLMs often default to regeneration, i.e., rewriting large portions, or even the entirety, of a program when presented with buggy code ( [… view at source ↗

**Figure 2.** Figure 2: PDB pipeline. Generation: LLMs first synthesize and verify single-line bugs from existing coding datasets, which are then composed into multi-bug programs. Evaluation: Automated debugging systems are evaluated on these programs using both unit-test accuracy and edit-level precision and bug-level recall. as map, which pairs each Ei with the closest edits in Eˆ. For each bug i, we construct a pseudorevisio… view at source ↗

**Figure 3.** Figure 3: Data distribution of PDB-SINGLE-HARD. Bug composition. To create more challenging debugging scenarios, we compose multiple atomic bugs into a single program. For each (x, Cgt) pair and a target bug count k, we randomly sample k distinct block edits from the generated bugs. To encourage independence between bugs, we enforce a stride constraint, requiring any two selected edits to be at least s lines apart. … view at source ↗

**Figure 4.** Figure 4: Correlation between precision, recall, and unit-test score across bug counts. Results are shown on subsets [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Both iterative and agentic setups on PDB-SINGLE-HARD improve unit-test pass rates and recall over single-shot debugging, indicating higher functional success. However, edit-level precision does not improve and sometimes degrades. Notably, even Claude-Code with access to unit-test and execution feedback exhibits only 50% precision. Model Precision Recall Unit (%) Claude-Sonnet-4.5 65.9 73.9 64.8 Gemini-2.5-… view at source ↗

**Figure 6.** Figure 6: Comparison of model performance under minimal-debug and freeform prompting on a subset of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Recall distribution over bug categories. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Model breakdown performance on PDB-SINGLE-HARD rewriting with the same generator, or a different generator. ODC Category Sub-category Brief Description Assignment Mutability Trap Mutable default arguments cause unintended shared state across calls. Late Binding in Closures Loop variables captured by reference, yielding unexpected final values. List Multiplication Surprise List multiplication creates multip… view at source ↗

**Figure 9.** Figure 9: Model averaged performance on PDBSINGLE-HARD over distribution of buggy code length. All metrics show a similar performance drop. of large quantities of plausible faulty code with minimal surface changes. Such capability may be misused to degrade software reliability in collaborative development settings, increase the review burden on maintainers, or seed low-quality code into shared repositories. Anothe… view at source ↗

**Figure 10.** Figure 10: Correlation between precision, recall, and unit-test score across bug counts. Results are shown on subsets [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Discovering bugs missed by ground-truth (1.9%): The model identifies and fixes bugs that were [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Functionally correct but undetected (70% of recall [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Multiple minimal fixes (20% of recall<1 cases): A single bug can have multiple minimal correct fixes, and the model chose a different valid fix than the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Bug composition issue (10% of recall<1 cases): Compounding bugs introduced during bug-composition stage where one injected bug changes program logic affecting other bugs [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Under-repair (31.4%): The model fixes some bugs without introducing unnecessary edits but fails to [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗

**Figure 19.** Figure 19: Regressive repair (39.2%): The model fixes all original bugs (recall [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗

**Figure 20.** Figure 20: Bug injection prompt for benchmark construction. [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: Minimal debugging prompt with problem description and buggy code. [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗

**Figure 22.** Figure 22: Minimal debugging prompt with unit tests. [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗

**Figure 23.** Figure 23: Minimal debugging prompt with execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗

**Figure 24.** Figure 24: Minimal debugging prompt with unit tests and execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗

**Figure 25.** Figure 25: Free-form debugging prompt without minimal edit constraint. [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗

**Figure 26.** Figure 26: Free-form debugging prompt with unit tests. [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗

**Figure 27.** Figure 27: Free-form debugging prompt with execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗

**Figure 28.** Figure 28: Free-form debugging prompt with unit tests and execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗

**Figure 29.** Figure 29: External API template for minimal debugging. [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗

**Figure 30.** Figure 30: External API template for free-form debugging. [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗

**Figure 31.** Figure 31: Solution rewriting prompt for benchmark construction. [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗

read the original abstract

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical way to test whether coding models fix bugs with minimal edits or just rewrite, and the numbers show mostly rewriting even on simple cases.

read the letter

The main point is that frontier models pass unit tests at high rates but make far more changes than needed when asked to debug. The authors turn this into a measurable claim with the PDB framework, which injects verified atomic bugs into existing code datasets and then scores outputs on edit-level precision and bug-level recall. They release two concrete benchmarks, one for single-line bugs and one for multi-bug compositions, and run them on current top models including GPT-5.1-Codex and DeepSeek-V3.2-Thinking. The results hold up under explicit minimal-edit instructions and under iterative or agentic setups, which is a clean negative finding worth knowing. The framework itself is straightforward to apply to other datasets, so it could become a standard check for coding post-training. The soft spot is the mapping from synthetic bugs to real debugging. The paper does not show edit-distance distributions or interaction patterns against human bug corpora, so it remains possible that the low precision partly reflects how the atomic bugs are localized and non-interacting rather than a general model limitation. If the minimal fixes for these bugs differ systematically from human ones, the regeneration conclusion applies more narrowly. This work is aimed at groups building or evaluating LLM coding agents and benchmarks. The empirical gap is clear enough and the methods are reproducible enough that it deserves a serious referee to check the bug synthesis details and metric definitions.

Referee Report

1 major / 1 minor

Summary. The paper introduces the Precise Debugging Benchmark (PDB) framework that automatically converts coding datasets into debugging tasks by synthesizing verified atomic bugs and composing them into multi-bug programs. It defines edit-level precision and bug-level recall metrics to distinguish precise, minimal debugging from code regeneration. Two benchmarks (PDB-Single-Hard for single-line bugs and PDB-Multi for multi-line bugs) are released, and experiments show frontier models (e.g., GPT-5.1-Codex, DeepSeek-V3.2-Thinking) achieve >76% unit-test pass rates but <45% precision even when instructed to debug minimally; iterative and agentic strategies yield little improvement.

Significance. If the synthetic bugs and compositions are shown to be representative of real debugging edit distributions, the results would be significant for highlighting a systematic tendency in current coding LLMs to over-edit or regenerate rather than apply targeted fixes. The new precision-aware metrics and open benchmarks provide concrete tools for future evaluation and could inform post-training improvements in software engineering applications.

major comments (1)

[PDB framework and benchmark generation] Benchmark construction (PDB synthesis process): The headline finding that high pass rates reflect regeneration rather than debugging is load-bearing on the assumption that the automatically synthesized atomic bugs and their compositions produce minimal-edit solutions whose statistical structure matches human debugging scenarios. No quantitative comparison (e.g., edit-distance histograms, bug-interaction graphs, or direct matching to a human-curated corpus) is reported, so the low precision (<45%) could be an artifact of localized, non-interacting synthetic bugs rather than a general model property.

minor comments (1)

[Abstract] Abstract: The reported thresholds ('above 76%' pass rate, 'below 45%' precision) are given without the underlying sample sizes, exact model versions, or dataset sources, making it harder to assess robustness or reproduce the quantitative claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment on the PDB framework and benchmark generation below.

read point-by-point responses

Referee: The headline finding that high pass rates reflect regeneration rather than debugging is load-bearing on the assumption that the automatically synthesized atomic bugs and their compositions produce minimal-edit solutions whose statistical structure matches human debugging scenarios. No quantitative comparison (e.g., edit-distance histograms, bug-interaction graphs, or direct matching to a human-curated corpus) is reported, so the low precision (<45%) could be an artifact of localized, non-interacting synthetic bugs rather than a general model property.

Authors: We acknowledge the referee's concern that the headline result relies on the synthetic bugs producing minimal-edit ground truths representative of real debugging. By construction, each atomic bug in PDB is the smallest change that causes test failure (verified by re-running tests after the edit), and multi-bug programs are formed by controlled composition of such atoms. This design provides explicit minimal-edit oracles for the precision metric. We did not report quantitative comparisons (edit-distance histograms or matches to human corpora) in the submitted manuscript. While we maintain that the observed low precision demonstrates a regeneration bias even when minimal fixes are feasible, we agree a direct distributional comparison would better address potential artifacts from synthetic localization. In the revision we will add a dedicated analysis subsection with edit-distance histograms drawn from PDB alongside those from publicly available human fix collections (e.g., Defects4J and ManyBugs) and will discuss observed differences in interaction complexity. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark construction with external model evaluation

full rationale

The paper's chain consists of (1) defining a synthesis procedure for atomic bugs and their compositions to create PDB-Single-Hard and PDB-Multi, (2) introducing edit-level precision and bug-level recall as new metrics, and (3) running empirical evaluations of external frontier models (GPT-5.1-Codex, DeepSeek-V3.2-Thinking, etc.) on unit-test pass rates versus the new metrics. None of these steps reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central observation (high pass-rate, low precision) is a direct empirical measurement on held-out models and does not presuppose its own conclusion. The mapping from synthetic bugs to real debugging is an external validity assumption, not a circular derivation internal to the paper's equations or citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that synthesized atomic bugs are verified and representative; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Synthesized verified atomic bugs accurately simulate real debugging scenarios when composed into multi-bug programs
The PDB framework relies on this to generate the test cases used for evaluation.

pith-pipeline@v0.9.0 · 5514 in / 1291 out tokens · 69801 ms · 2026-05-10T06:02:06.759620+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

171 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Robert L Glass

A study on robustness and reliability of large language model code generation.arXiv preprint arXiv:2308.13888. Robert L Glass. 2002.Facts and fallacies of software engineering. Addison-Wesley Professional. Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, and 1 others. 2024. Codeedi- torbe...

work page arXiv 2002
[2]

Measuring Coding Challenge Competence With APPS

Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938. Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, and Libo Qin. 2025. Mldebugging: Towards benchmarking code debugging across multi-library scenarios.arXiv preprint arXiv:2506.13824. Binyuan Hui, Jian Yang, Zeyu Cui, Jia...

work page internal anchor Pith review arXiv 2025
[3]

StarCoder: may the source be with you!

Spoc: Search-based pseudocode to code.Ad- vances in Neural Information Processing Systems, 32. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, and 1 others. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161. Yujia Li, David Choi, Juny...

work page internal anchor Pith review arXiv 2023
[4]

arXiv preprint arXiv:2401.15963 , year =

Debugbench: Evaluating debugging capability of large language models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657. Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. No- funeval: Funny how code lms falter on require- ments beyond functional correctness.arX...

work page arXiv 2024
[5]

Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang

Vibe checker: Aligning code evaluation with human preference.arXiv preprint arXiv:2510.07315. Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang. 2024. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. InProceedings of the 12th International Con- ference on...

work page arXiv 2024
[6]

demonstrates autonomous repair capabilities, and COAST (Yang et al., 2024b) enhances debug- ging through communicative agent-based data syn- thesis. These approaches utilize techniques ranging from zero-shot prompting to multi-turn conversa- tional agents (Chen et al., 2023; Fan and Xia, 2024; Xia and Zhang, 2023; Zhong et al., 2024; Islam et al., 2024). ...

2023
[7]

examine automatic bug fixing performance on existing datasets. A common limitation of these benchmarks, however, is their reliance on a sim- ple, binary pass/fail metric on test cases (Zhang Model Precision Recall Unit (%) Claude-Sonnet-4.578.1±0.785.7±0.6 81.9±0.9 Gemini-2.5-Pro 77.9±0.787.5±0.683.8±0.8 Qwen3-Coder-480B 73.5±0.8 82.4±0.7 77.4±0.9 Kimi-K2...

2024
[10]

config.read(config_file_path) 8
[11]

project_dir = config.get('Project', 'directory') 10
[13]

raise FileNotFoundError(f'Directory {project_dir} does not exist.') 13
[14]

archive_file = f'{archieve_dir}/{project_dir}.z ip' A (Modify) 15
[16]

shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 18
[18]

Failed to create archive {archive_fi le}

raise Exception(f"Failed to create archive {archive_fi le}") 21
[19]

return True Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: gpt-5.1-codex Model Generated Solution
[20]

def task_func(config_file_path, archieve_dir ='/home/user/arch ive'): 5
[21]

if not os.path.isfile(config_file_path):
[22]

Config file {config_file_pat h} does not exist

raise FileNotFoundError(f"Config file {config_file_pat h} does not exist.") 8
[23]

config = configparser.ConfigParser()
[24]

config.read(config_file_path) 11
[25]

project_dir = config.get('Project', 'directory') 13
[26]

if not os.path.isdir(project_dir):
[27]

raise FileNotFoundError(f'Directory {project_dir} does not exist.') 16
[28]

archive_file = f'{archieve_dir}/{os.path.basename(proj ect_dir)}.zip' A 18
[29]

# Using shutil to create the zip archive
[30]

shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 21
[31]

if not os.path.isfile(archive_file):
[32]

Failed to create archive {archive_fi le}

raise Exception(f"Failed to create archive {archive_fi le}") 24
[33]

📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks

return True Fixed bug Extra editIntroduced bug Debug Model: gemini-2.5-pro Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect correctness. 📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks. This function manipulates plot data using matplotlib. Note: If the ...
[37]

if not isinstance(ax, plt.PolarAxes):
[41]

ax.set_rlabel_position(radius * 45)
[42]

return ax Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution
[44]

def task_func(ax, radius): 4
[45]

raise ValueError('Radius must be non-negative')
[46]

if ax.name != 'polar':
[47]

raise TypeError('ax must be a polar plot') 9
[48]

theta = np.linspace(0, 2 * np.pi, 1000)
[49]

ax.plot(theta, radius * np.ones_like(theta))
[50]

ax.set_rlabel_position(45) — ax.draw()
[51]

📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets

return ax Fixed bug Extra editIntroduced bug Debug Model: Kimi-K2-Thinking Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks beyond what is required to fix the bug. 📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets. Move the files to su...
[56]

with open(os.path.join(directory, filename), 'r') as f ile:
[58]

match = re.search('(.*?)\[.*?\]', content)
[59]

subdirectory = match.group(1).strip() 15
[61]

os.makedirs(os.path.join(directory, subdir ectory)) 18
[62]

{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}

new_filename = f"{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}"
[63]

shutil.move(os.path.join(directory, filename), os.path.join(directory, subdirectory, new_filename)) 21
[66]

moved_files[subdirectory].append(new_filename) 25
[67]

return directory.resolve(), moved_files A (Modify) Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution
[68]

from datetime import datetime
[69]

def task_func(directory): 6
[70]

DATE_FORMAT = '%Y%m%d%H%M%S'
[71]

for filename in os.listdir(directory):
[72]

filepath = os.path.join(directory, filename)
[73]

if os.path.isfile(filepath):
[74]

with open(filepath, 'r') as file:
[75]

content = file.read()
[76]

match = re.search(r'^(.*?)\[(.*?)\](.*)$', con tent, re.DOTALL)
[77]

match = re.search(r'\[(.*?)\](.*)$', conte nt, re.DOTALL)
[78]

before_brackets = content[:match.start ()]
[79]

first_non_bracket_text = before_bracke ts.split()[-1] if before_brackets.split() else ""
[80]

subdirectory = first_non_bracket_text
[81]

before_first_bracket = match.group(1)
[82]

parts = before_first_bracket.strip().split ()
[83]

subdirectory = parts[-1] if parts else "" 27
[84]

if not subdirectory:
[85]

if not os.path.exists(os.path.join(directory, subdirectory)):
[86]

os.makedirs(os.path.join(directory, subdir ectory)) 33
[87]

name_part, ext_part = os.path.splitext(filenam e)
[88]

{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}

new_filename = f"{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}"
[89]

shutil.move(filepath, os.path.join(directory, subdirectory, new_filename)) 37
[90]

if subdirectory not in moved_files:
[91]

moved_files[subdirectory] = []
[92]

moved_files[subdirectory].append(new_filename) 41
[93]

return os.path.abspath(directory), moved_files A Fixed bug Extra editIntroduced bug Debug Model: Qwen3-Coder-480B-A35B-Instruct-FP8 Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal targeted fixes. 📋Task Description Unzip a list of objects and their 3D coordinates, run PCA to reduce the dimensiona...
[104]

if plot_path is not None == True: A (Modify)
[108]

return coordinates_2d Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution
[109]

from sklearn.decomposition import PCA
[111]

def task_func(data, save_plot=False, plot_path=None): 5
[112]

items, x_values, y_values, z_values = zip(*data)
[113]

coordinates = np.array(list(zip(x_values, y_values, z_valu es))) 8
[114]

pca = PCA(n_components=2)

Showing first 80 references.