pith. sign in

arxiv: 2604.17338 · v4 · pith:P3SW5TQDnew · submitted 2026-04-19 · 💻 cs.SE · cs.CL

Precise Debugging Benchmark: Is Your Model Debugging or Regenerating?

Pith reviewed 2026-05-20 23:57 UTC · model grok-4.3

classification 💻 cs.SE cs.CL
keywords LLM code debuggingprecise code editingbug synthesis benchmarkedit precision metricfault localizationregeneration vs editingsoftware engineering evaluation
0
0 comments X

The pith

Frontier LLMs achieve high test-pass rates on debugging but low edit precision because they regenerate rather than minimally fix code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Precise Debugging Benchmark to test whether LLMs localize faults and apply only necessary edits or simply overwrite with new correct code. It synthesizes atomic bugs into single-line and multi-line faulty programs, then scores models on edit-level precision (how many required changes they actually make) and bug-level recall (how many bugs they eliminate). Even when told to edit minimally, models such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking reach above 76 percent unit-test pass rates yet stay below 45 percent precision. Iterative and agentic prompting strategies also fail to raise these figures. The result shows that high functional correctness can hide a lack of precise debugging behavior.

Core claim

Frontier models regenerate correct but over-edited solutions during debugging tasks; the PDB framework converts existing coding datasets into precision-aware benchmarks by injecting verified atomic bugs, and the resulting metrics reveal that unit-test success above 76 percent coincides with edit precision below 45 percent even under explicit minimal-edit instructions.

What carries the argument

PDB framework that synthesizes verified atomic bugs and composes them into multi-bug programs, paired with edit-level precision and bug-level recall metrics that count only the necessary changes.

If this is right

  • Post-training pipelines for coding models must be redesigned to reward localized, minimal edits rather than full regeneration.
  • Iterative or agentic loops do not automatically produce more precise debugging behavior.
  • Benchmarks that only measure final test-pass rates will continue to overestimate debugging skill.
  • Single-line and multi-line bug variants expose different precision gaps that unit tests alone miss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training objectives that penalize unnecessary token changes could directly improve the observed precision gap.
  • The same synthesis method could be applied to other languages or domains to test whether the regeneration pattern is universal.
  • Real bug-fix datasets from open-source repositories could serve as an external check on whether synthetic atomic bugs capture typical fault distributions.

Load-bearing premise

The atomic bugs created by the synthesis process behave like the faults that appear in actual developer workflows.

What would settle it

A direct measurement on real-world bug reports showing that models achieve edit precision above 60 percent when explicitly prompted for minimal changes.

Figures

Figures reproduced from arXiv: 2604.17338 by Honghua Dong, Miaosen Chai, Robin Jia, Shangshang Wang, Song Bian, Wang Bill Zhu, Willie Neiswanger, Yejia Liu.

Figure 1
Figure 1. Figure 1: Real example from GPT-5.2 debugging a binary search program, where the model rewrites the entire solution. Green lines mark precise edits; gray lines highlight over-edits. debugging and maintenance (Glass, 2002). When applied to debugging tasks, we observe that frontier LLMs often default to regeneration, i.e., rewriting large portions, or even the entirety, of a program when presented with buggy code ( [… view at source ↗
Figure 2
Figure 2. Figure 2: PDB pipeline. Generation: LLMs first synthesize and verify single-line bugs from existing coding datasets, which are then composed into multi-bug programs. Evaluation: Automated debugging systems are evaluated on these programs using both unit-test accuracy and edit-level precision and bug-level recall. as map, which pairs each Ei with the closest ed￾its in Eˆ. For each bug i, we construct a pseudo￾revisio… view at source ↗
Figure 3
Figure 3. Figure 3: Data distribution of PDB-SINGLE-HARD. Bug composition. To create more challenging debugging scenarios, we compose multiple atomic bugs into a single program. For each (x, Cgt) pair and a target bug count k, we randomly sample k distinct block edits from the generated bugs. To encourage independence between bugs, we enforce a stride constraint, requiring any two selected edits to be at least s lines apart. … view at source ↗
Figure 4
Figure 4. Figure 4: Correlation between precision, recall, and unit-test score across bug counts. Results are shown on subsets [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Both iterative and agentic setups on PDB-SINGLE-HARD improve unit-test pass rates and recall over single-shot debugging, indicating higher functional success. However, edit-level precision does not improve and sometimes degrades. Notably, even Claude-Code with access to unit-test and execution feedback exhibits only 50% precision. Model Precision Recall Unit (%) Claude-Sonnet-4.5 65.9 73.9 64.8 Gemini-2.5-… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of model performance under minimal-debug and freeform prompting on a subset of [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Recall distribution over bug categories. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model breakdown performance on PDB-SINGLE-HARD rewriting with the same generator, or a different generator. ODC Category Sub-category Brief Description Assignment Mutability Trap Mutable default arguments cause unintended shared state across calls. Late Binding in Closures Loop variables captured by reference, yielding unexpected final values. List Multiplication Surprise List multiplication creates multip… view at source ↗
Figure 9
Figure 9. Figure 9: Model averaged performance on PDB￾SINGLE-HARD over distribution of buggy code length. All metrics show a similar performance drop. of large quantities of plausible faulty code with minimal surface changes. Such capability may be misused to degrade software reliability in collab￾orative development settings, increase the review burden on maintainers, or seed low-quality code into shared repositories. Anothe… view at source ↗
Figure 10
Figure 10. Figure 10: Correlation between precision, recall, and unit-test score across bug counts. Results are shown on subsets [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Discovering bugs missed by ground-truth (1.9%): The model identifies and fixes bugs that were [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Functionally correct but undetected (70% of recall [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Multiple minimal fixes (20% of recall<1 cases): A single bug can have multiple minimal correct fixes, and the model chose a different valid fix than the ground-truth [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Bug composition issue (10% of recall<1 cases): Compounding bugs introduced during bug-composition stage where one injected bug changes program logic affecting other bugs [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Under-repair (31.4%): The model fixes some bugs without introducing unnecessary edits but fails to [PITH_FULL_IMAGE:figures/full_fig_p026_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Regressive repair (39.2%): The model fixes all original bugs (recall [PITH_FULL_IMAGE:figures/full_fig_p027_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Bug injection prompt for benchmark construction. [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Minimal debugging prompt with problem description and buggy code. [PITH_FULL_IMAGE:figures/full_fig_p029_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Minimal debugging prompt with unit tests. [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Minimal debugging prompt with execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Minimal debugging prompt with unit tests and execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p032_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Free-form debugging prompt without minimal edit constraint. [PITH_FULL_IMAGE:figures/full_fig_p032_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Free-form debugging prompt with unit tests. [PITH_FULL_IMAGE:figures/full_fig_p033_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Free-form debugging prompt with execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p033_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Free-form debugging prompt with unit tests and execution feedback. [PITH_FULL_IMAGE:figures/full_fig_p034_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: External API template for minimal debugging. [PITH_FULL_IMAGE:figures/full_fig_p035_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: External API template for free-form debugging. [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Solution rewriting prompt for benchmark construction. [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗
read the original abstract

Unlike code completion, debugging requires localizing faults and applying targeted edits. We observe that frontier LLMs often regenerate correct but over-edited solutions during debugging. To evaluate how far LLMs are from precise debugging, we introduce the Precise Debugging Benchmark (PDB) framework, which automatically converts any coding dataset into a debugging benchmark with precision-aware evaluation. PDB generates buggy programs by synthesizing verified atomic bugs and composing them into multi-bug programs. We define two novel metrics, edit-level precision and bug-level recall, which measures how many necessary edits are made and how many bugs are resolved. We release two evaluation benchmarks: PDB-Single-Hard on single-line bugs, and PDB-Multi on multi-line bugs. Experiments show that frontier models, such as GPT-5.1-Codex and DeepSeek-V3.2-Thinking, achieve unit-test pass rates above 76% but exhibit precision below 45%, even when explicitly instructed to perform minimal debugging. Finally, we show that iterative and agentic debugging strategies do not substantially improve precision or recall, highlighting the need to rethink post-training pipelines for coding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces the Precise Debugging Benchmark (PDB) framework, which automatically converts coding datasets into debugging benchmarks by synthesizing verified atomic bugs and composing them into multi-bug programs. It defines edit-level precision (measuring necessary targeted edits) and bug-level recall metrics to distinguish precise debugging from over-editing or regeneration. Experiments on PDB-Single-Hard and PDB-Multi show frontier models (e.g., GPT-5.1-Codex, DeepSeek-V3.2-Thinking) achieving >76% unit-test pass rates but <45% precision, even under minimal-debugging instructions; iterative and agentic strategies yield no substantial gains.

Significance. If the synthetic bugs prove representative of real faults, the work identifies a key gap in LLM post-training for coding: models solve tests via regeneration rather than localized fixes. The release of two concrete benchmarks (PDB-Single-Hard, PDB-Multi) and the precision/recall metrics supplies a reproducible evaluation resource that the community can use to measure and improve targeted editing behavior.

major comments (3)
  1. [Abstract] Abstract (PDB generation paragraph): the claim that synthesized atomic bugs and their compositions serve as faithful proxies for real developer debugging workflows is load-bearing for interpreting low edit-level precision as evidence of 'regeneration' rather than debugging, yet the manuscript supplies no comparison against established real-world bug corpora (e.g., Defects4J or QuixBugs).
  2. [Abstract] Abstract (bug synthesis description): verification procedures for the atomic bugs (e.g., how each insertion is confirmed to be a genuine fault that requires a specific edit rather than an arbitrary change) are not described, which directly affects the soundness of the edit-level precision metric.
  3. [Abstract] Abstract (experimental results): the reported precision gap (<45%) lacks any mention of statistical significance testing, confidence intervals, or variance across runs, making it difficult to assess whether the difference from unit-test pass rates is robust.
minor comments (1)
  1. Clarify in the metric definitions whether edit-level precision counts only exact string matches to the inserted bug locations or allows semantically equivalent but differently located fixes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below in a point-by-point fashion and indicate the changes planned for the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract (PDB generation paragraph): the claim that synthesized atomic bugs and their compositions serve as faithful proxies for real developer debugging workflows is load-bearing for interpreting low edit-level precision as evidence of 'regeneration' rather than debugging, yet the manuscript supplies no comparison against established real-world bug corpora (e.g., Defects4J or QuixBugs).

    Authors: We agree that validating the representativeness of the synthetic bugs against real-world corpora strengthens the interpretation of the precision results. The synthesis approach was chosen to enable atomic, verifiable bugs that support the edit-level precision metric, which is difficult to obtain from existing corpora. In the revision we will add a dedicated limitations and validation subsection that compares bug characteristics (e.g., edit span, failure mode distribution) with Defects4J and QuixBugs using publicly available metadata, while noting that a full end-to-end evaluation on real bugs remains future work. revision: partial

  2. Referee: [Abstract] Abstract (bug synthesis description): verification procedures for the atomic bugs (e.g., how each insertion is confirmed to be a genuine fault that requires a specific edit rather than an arbitrary change) are not described, which directly affects the soundness of the edit-level precision metric.

    Authors: The verification procedure—inserting a candidate edit, confirming the original tests pass, the modified program fails, and the minimal edit restores passage—is described in Section 3.2 of the full manuscript. To address the abstract-level concern we will add a concise sentence in the abstract summarizing the verification step. revision: yes

  3. Referee: [Abstract] Abstract (experimental results): the reported precision gap (<45%) lacks any mention of statistical significance testing, confidence intervals, or variance across runs, making it difficult to assess whether the difference from unit-test pass rates is robust.

    Authors: We concur that statistical details improve interpretability. The revised manuscript will report 95% confidence intervals and standard deviation across five independent runs for both unit-test pass rate and edit precision on the main results tables and will briefly note this in the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or metrics

full rationale

The paper constructs PDB by synthesizing verified atomic bugs into programs and directly defines edit-level precision and bug-level recall from counts of matching edits and resolved bugs. These are definitional choices for the benchmark rather than fitted parameters or equations that reduce outputs to inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the abstract or described framework. Results are empirical measurements of model behavior on the self-generated benchmark, making the central claims self-contained without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the reliability of automatically synthesized and verified atomic bugs as proxies for real debugging tasks; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Synthesized atomic bugs can be verified as causing the observed failure and remain minimal.
    The PDB generation process depends on this property to create controlled debugging instances.
invented entities (1)
  • edit-level precision metric no independent evidence
    purpose: Quantifies the fraction of necessary edits performed by the model.
    Newly defined evaluation measure introduced to distinguish precise debugging from regeneration.

pith-pipeline@v0.9.0 · 5745 in / 1220 out tokens · 28950 ms · 2026-05-20T23:57:20.423767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

171 extracted references · 171 canonical work pages · 2 internal anchors

  1. [1]

    Robert L Glass

    A study on robustness and reliability of large language model code generation.arXiv preprint arXiv:2308.13888. Robert L Glass. 2002.Facts and fallacies of software engineering. Addison-Wesley Professional. Jiawei Guo, Ziming Li, Xueling Liu, Kaijing Ma, Tianyu Zheng, Zhouliang Yu, Ding Pan, Yizhi Li, Ruibo Liu, Yue Wang, and 1 others. 2024. Codeedi- torbe...

  2. [2]

    Measuring Coding Challenge Competence With APPS

    Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938. Jinyang Huang, Xiachong Feng, Qiguang Chen, Hanjie Zhao, Zihui Cheng, Jiesong Bai, Jingxuan Zhou, Min Li, and Libo Qin. 2025. Mldebugging: Towards benchmarking code debugging across multi-library scenarios.arXiv preprint arXiv:2506.13824. Binyuan Hui, Jian Yang, Zeyu Cui, Jia...

  3. [3]

    StarCoder: may the source be with you!

    Spoc: Search-based pseudocode to code.Ad- vances in Neural Information Processing Systems, 32. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, and 1 others. 2023. Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161. Yujia Li, David Choi, Juny...

  4. [4]

    InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657

    Debugbench: Evaluating debugging capability of large language models. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 8647–8657. Manav Singhal, Tushar Aggarwal, Abhijeet Awasthi, Nagarajan Natarajan, and Aditya Kanade. 2024. No- funeval: Funny how code lms falter on require- ments beyond functional correctness.arX...

  5. [5]

    Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang

    Vibe checker: Aligning code evaluation with human preference.arXiv preprint arXiv:2510.07315. Yixuan Zhu, Zhitong Zeng, Zhaoxue Liu, Yixing Feng, Yuming Sun, Zhaoyang Chen, Yiling Liu, and Haoyu Wang. 2024. Livecodebench: Holistic and contami- nation free evaluation of large language models for code. InProceedings of the 12th International Con- ference on...

  6. [6]

    demonstrates autonomous repair capabilities, and COAST (Yang et al., 2024b) enhances debug- ging through communicative agent-based data syn- thesis. These approaches utilize techniques ranging from zero-shot prompting to multi-turn conversa- tional agents (Chen et al., 2023; Fan and Xia, 2024; Xia and Zhang, 2023; Zhong et al., 2024; Islam et al., 2024). ...

  7. [7]

    examine automatic bug fixing performance on existing datasets. A common limitation of these benchmarks, however, is their reliance on a sim- ple, binary pass/fail metric on test cases (Zhang Model Precision Recall Unit (%) Claude-Sonnet-4.578.1±0.785.7±0.6 81.9±0.9 Gemini-2.5-Pro 77.9±0.787.5±0.683.8±0.8 Qwen3-Coder-480B 73.5±0.8 82.4±0.7 77.4±0.9 Kimi-K2...

  8. [10]

    config.read(config_file_path) 8

  9. [11]

    project_dir = config.get('Project', 'directory') 10

  10. [13]

    raise FileNotFoundError(f'Directory {project_dir} does not exist.') 13

  11. [14]

    archive_file = f'{archieve_dir}/{project_dir}.z ip' A (Modify) 15

  12. [16]

    shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 18

  13. [18]

    Failed to create archive {archive_fi le}

    raise Exception(f"Failed to create archive {archive_fi le}") 21

  14. [19]

    return True Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: gpt-5.1-codex Model Generated Solution

  15. [20]

    def task_func(config_file_path, archieve_dir ='/home/user/arch ive'): 5

  16. [21]

    if not os.path.isfile(config_file_path):

  17. [22]

    Config file {config_file_pat h} does not exist

    raise FileNotFoundError(f"Config file {config_file_pat h} does not exist.") 8

  18. [23]

    config = configparser.ConfigParser()

  19. [24]

    config.read(config_file_path) 11

  20. [25]

    project_dir = config.get('Project', 'directory') 13

  21. [26]

    if not os.path.isdir(project_dir):

  22. [27]

    raise FileNotFoundError(f'Directory {project_dir} does not exist.') 16

  23. [28]

    archive_file = f'{archieve_dir}/{os.path.basename(proj ect_dir)}.zip' A 18

  24. [29]

    # Using shutil to create the zip archive

  25. [30]

    shutil.make_archive(base_name=os.path.splitext(archive_fil e)[0], format='zip', root_dir=project_dir) 21

  26. [31]

    if not os.path.isfile(archive_file):

  27. [32]

    Failed to create archive {archive_fi le}

    raise Exception(f"Failed to create archive {archive_fi le}") 24

  28. [33]

    📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks

    return True Fixed bug Extra editIntroduced bug Debug Model: gemini-2.5-pro Figure 11: Redundant guard checks (9.8%): The model adds unnecessary defensive checks that don’t affect correctness. 📋Task Description Draw a circle with a given radius on the polar chart 'ax' and set radial ticks. This function manipulates plot data using matplotlib. Note: If the ...

  29. [37]

    if not isinstance(ax, plt.PolarAxes):

  30. [41]

    ax.set_rlabel_position(radius * 45)

  31. [42]

    return ax Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

  32. [44]

    def task_func(ax, radius): 4

  33. [45]

    raise ValueError('Radius must be non-negative')

  34. [46]

    if ax.name != 'polar':

  35. [47]

    raise TypeError('ax must be a polar plot') 9

  36. [48]

    theta = np.linspace(0, 2 * np.pi, 1000)

  37. [49]

    ax.plot(theta, radius * np.ones_like(theta))

  38. [50]

    ax.set_rlabel_position(45) — ax.draw()

  39. [51]

    📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets

    return ax Fixed bug Extra editIntroduced bug Debug Model: Kimi-K2-Thinking Figure 12: Additional modifications (66.8%): The model makes additional modifications to correct code blocks beyond what is required to fix the bug. 📋Task Description Organize files in a directory based on the first text that is not enclosed in square brackets. Move the files to su...

  40. [56]

    with open(os.path.join(directory, filename), 'r') as f ile:

  41. [58]

    match = re.search('(.*?)\[.*?\]', content)

  42. [59]

    subdirectory = match.group(1).strip() 15

  43. [61]

    os.makedirs(os.path.join(directory, subdir ectory)) 18

  44. [62]

    {filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}

    new_filename = f"{filename.split('.')[0]}_{dat etime.now().strftime(DATE_FORMAT)}.{filename.split('.')[1]}"

  45. [63]

    shutil.move(os.path.join(directory, filename), os.path.join(directory, subdirectory, new_filename)) 21

  46. [66]

    moved_files[subdirectory].append(new_filename) 25

  47. [67]

    return directory.resolve(), moved_files A (Modify) Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

  48. [68]

    from datetime import datetime

  49. [69]

    def task_func(directory): 6

  50. [70]

    DATE_FORMAT = '%Y%m%d%H%M%S'

  51. [71]

    for filename in os.listdir(directory):

  52. [72]

    filepath = os.path.join(directory, filename)

  53. [73]

    if os.path.isfile(filepath):

  54. [74]

    with open(filepath, 'r') as file:

  55. [75]

    content = file.read()

  56. [76]

    match = re.search(r'^(.*?)\[(.*?)\](.*)$', con tent, re.DOTALL)

  57. [77]

    match = re.search(r'\[(.*?)\](.*)$', conte nt, re.DOTALL)

  58. [78]

    before_brackets = content[:match.start ()]

  59. [79]

    first_non_bracket_text = before_bracke ts.split()[-1] if before_brackets.split() else ""

  60. [80]

    subdirectory = first_non_bracket_text

  61. [81]

    before_first_bracket = match.group(1)

  62. [82]

    parts = before_first_bracket.strip().split ()

  63. [83]

    subdirectory = parts[-1] if parts else "" 27

  64. [84]

    if not subdirectory:

  65. [85]

    if not os.path.exists(os.path.join(directory, subdirectory)):

  66. [86]

    os.makedirs(os.path.join(directory, subdir ectory)) 33

  67. [87]

    name_part, ext_part = os.path.splitext(filenam e)

  68. [88]

    {name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}

    new_filename = f"{name_part}_{datetime.now().s trftime(DATE_FORMAT)}{ext_part}"

  69. [89]

    shutil.move(filepath, os.path.join(directory, subdirectory, new_filename)) 37

  70. [90]

    if subdirectory not in moved_files:

  71. [91]

    moved_files[subdirectory] = []

  72. [92]

    moved_files[subdirectory].append(new_filename) 41

  73. [93]

    return os.path.abspath(directory), moved_files A Fixed bug Extra editIntroduced bug Debug Model: Qwen3-Coder-480B-A35B-Instruct-FP8 Figure 13: Complete rewrite (7.8%): The model completely regenerates the solution rather than making minimal targeted fixes. 📋Task Description Unzip a list of objects and their 3D coordinates, run PCA to reduce the dimensiona...

  74. [104]

    if plot_path is not None == True: A (Modify)

  75. [108]

    return coordinates_2d Add: Line added to inject bug Delete: Line removed to inject bug Modify: Line changed to inject bug Bug Injection: claude-sonnet-4.5 Model Generated Solution

  76. [109]

    from sklearn.decomposition import PCA

  77. [111]

    def task_func(data, save_plot=False, plot_path=None): 5

  78. [112]

    items, x_values, y_values, z_values = zip(*data)

  79. [113]

    coordinates = np.array(list(zip(x_values, y_values, z_valu es))) 8

  80. [114]

    pca = PCA(n_components=2)

Showing first 80 references.