arxiv: 2605.09134 · v3 · submitted 2026-05-09 · 💻 cs.AI · cs.SE

Recognition: unknown

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

Yuanhao Li , Hongbo Wang , Xiaotang Shang , Xunzhu Tang , Yiming Cao , Xuhong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:53 UTC · model grok-4.3

classification 💻 cs.AI cs.SE

keywords automated program repairreinforcement learningreward modelsline-level credit assignmentexecution feedbackPPO optimizationSWE-benchcode editing

0 comments

The pith

BoostAPR trains a line-level reward model from execution outcomes to guide which code edits to reinforce during repair.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Automated program repair via reinforcement learning often fails because success signals arrive only after a full sequence of edits, leaving the model unsure which changes actually worked. BoostAPR counters this with a three-stage process that first fine-tunes on verified repair examples, then trains one reward model on overall success and a second model that redistributes credit to individual lines based on test outcomes. These signals then steer PPO optimization so the policy learns to focus edits on the right locations. The resulting system raises repair rates substantially on benchmarks that include real-world issues and cross-language transfers. If the line-level allocation proves reliable, repair agents could scale to larger codebases by learning precise, grounded edit strategies rather than relying on coarse sequence rewards.

Core claim

BoostAPR is a three-stage framework that performs supervised fine-tuning on execution-verified demonstrations with reasoning traces, trains dual reward models (a sequence-level assessor and a line-level credit allocator) directly from execution outcomes, and applies PPO optimization in which the line-level model redistributes rewards to critical edit regions, producing 40.7% success on SWE-bench Verified, 24.8% on Defects4J under Python-to-Java transfer, 84.5% on HumanEval-Java, and 95.0% on QuixBugs.

What carries the argument

The line-level credit allocator, a reward model trained on execution outcomes that assigns credit to specific edited lines rather than entire sequences.

If this is right

The same dual-reward structure lifts performance across four distinct benchmarks including real-world issues and cross-language transfer.
Line-level credit assignment works at a granularity that matches natural code edits without requiring full sequence-level supervision.
The approach remains competitive among open-source models while demonstrating strong generalization from Python training data to Java tasks.
PPO guided by execution-derived rewards produces higher repair accuracy than the base model on HumanEval-Java and QuixBugs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dual reward models could be applied to other sparse-reward code tasks such as test generation or refactoring where partial credit matters.
If the line-level allocator generalizes, it may reduce the need for dense human annotations in training repair agents.
The method's reliance on execution feedback suggests it could be combined with static analysis to handle cases where tests are incomplete.

Load-bearing premise

That execution outcomes supply enough signal to train a line-level model that correctly identifies which edits caused a fix without systematic bias from incomplete test coverage or noisy rewards.

What would settle it

Replace the trained line-level credit allocator with uniform random credit assignment across edits and measure whether success rates on SWE-bench Verified fall back to the base model's level; a large drop would support the claim while little change would falsify the value of the allocator.

Figures

Figures reproduced from arXiv: 2605.09134 by Hongbo Wang, Xiaotang Shang, Xuhong Chen, Xunzhu Tang, Yiming Cao, Yuanhao Li.

**Figure 1.** Figure 1: Overview of the BOOSTAPR training framework. Our approach consists of three stages: Stage I performs supervised finetuning on execution-verified demonstrations with reasoning traces; Stage II trains dual reward models using a hybrid regression-preference objective on execution outcomes; Stage III optimizes the policy via PPO with token-level rewards derived from the combination of Rseq and Rline. The line… view at source ↗

**Figure 2.** Figure 2: PPO training dynamics. Performance improves steadily until approximately step 250, then plateaus. Shaded region shows standard deviation across 3 seeds. Credit Assignment Strategies [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BoostAPR gets a reported 22.9-point lift on SWE-bench Verified by training a line-level credit allocator from execution traces and feeding it into PPO, but the abstract gives almost no evidence that the allocator actually works as claimed.

read the letter

The colleague should know two things right away. First, the paper describes a three-stage pipeline that starts with supervised fine-tuning on verified fixes, trains dual reward models from pass/fail outcomes, and then runs PPO with the line-level model redistributing credit to specific edits. Second, the headline numbers are 40.7% on SWE-bench Verified, 24.8% on Defects4J under Python-to-Java transfer, and high scores on the other two benchmarks, with competitive open-source results and some cross-language signal.

Referee Report

2 major / 1 minor

Summary. The paper introduces BoostAPR, a three-stage framework for automated program repair: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models (sequence-level assessor and line-level credit allocator) from execution outcomes, and (3) PPO optimization in which the line-level model redistributes rewards to critical edit regions. Trained on SWE-Gym and evaluated on SWE-bench Verified, Defects4J (Python-to-Java), HumanEval-Java, and QuixBugs, it reports 40.7% (+22.9pp over base model), 24.8%, 84.5%, and 95.0% respectively, claiming competitive open-source results and strong cross-language generalization.

Significance. If the reported gains are substantiated, the work would be significant for demonstrating that execution-grounded line-level credit assignment can mitigate sparse sequence-level rewards in RL-based APR. The cross-language transfer results and concrete benchmark improvements over a base model provide evidence of practical utility for open-source repair systems.

major comments (2)

[Abstract] Abstract: the central claim of a +22.9pp gain on SWE-bench Verified is attributed to stage-3 PPO with the line-level credit allocator, yet no baselines, statistical tests, ablation studies, or derivation of line-level labels from execution traces are described, preventing assessment of whether the improvement is load-bearing on the proposed mechanism.
[Method] Method (Dual Reward Models and PPO stages): the line-level credit allocator is trained on execution outcomes to identify responsible edits, but the manuscript supplies no quantitative test-coverage statistics or noise-handling details; if coverage gaps or partial-fix ambiguity exist, credit misassignment would render the PPO policy updates unreliable and undermine the reported gains.

minor comments (1)

[Evaluation] Evaluation section: the claim of 'competitive results among open-source models' is stated without a referenced comparison table or explicit list of competing systems, reducing clarity on relative standing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the comments identify gaps in the current presentation, we have revised the manuscript to incorporate additional details, experiments, and clarifications.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of a +22.9pp gain on SWE-bench Verified is attributed to stage-3 PPO with the line-level credit allocator, yet no baselines, statistical tests, ablation studies, or derivation of line-level labels from execution traces are described, preventing assessment of whether the improvement is load-bearing on the proposed mechanism.

Authors: We agree that the abstract does not convey the supporting analyses. In the revision we have (1) added a brief mention of the ablation results to the abstract, (2) inserted a new subsection (4.3) that reports three controlled baselines (SFT-only, PPO with sequence-level reward only, and PPO with random line-level credit), (3) included bootstrap confidence intervals and paired statistical tests (p < 0.01) on SWE-bench Verified, and (4) expanded Section 3.2 with the precise derivation procedure: line-level labels are obtained by executing the patch on a per-line basis, computing the delta in passing tests attributable to each edited line, and assigning normalized credit only to lines whose removal re-introduces failures. revision: yes
Referee: [Method] Method (Dual Reward Models and PPO stages): the line-level credit allocator is trained on execution outcomes to identify responsible edits, but the manuscript supplies no quantitative test-coverage statistics or noise-handling details; if coverage gaps or partial-fix ambiguity exist, credit misassignment would render the PPO policy updates unreliable and undermine the reported gains.

Authors: We acknowledge the importance of quantifying coverage and noise. The revised manuscript now reports that 87 % of SWE-Gym training patches achieve at least 80 % line coverage on the relevant functions (measured via the execution harness). For noise mitigation we added a three-run consistency filter: only traces that produce identical pass/fail outcomes across three independent executions are retained for reward-model training; samples exhibiting partial-fix ambiguity (inconsistent verdicts) are discarded, removing approximately 9 % of the data. These statistics and filtering steps are documented in the updated Section 3.2 and Appendix B. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline relies on external execution signals and benchmark evaluation

full rationale

The paper describes a three-stage empirical process: supervised fine-tuning on execution-verified demonstrations, training of dual reward models (sequence-level and line-level) directly from execution outcomes, and PPO optimization that redistributes rewards using the trained line-level allocator. No equations, derivations, or self-citations are presented that reduce the reported performance gains to quantities defined internally by fitted parameters or by construction. All gains are measured on external benchmarks (SWE-bench Verified, Defects4J, HumanEval-Java, QuixBugs) after training on SWE-Gym, with the reward signal originating from program execution rather than from the model's own predictions. This keeps the derivation chain self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that execution feedback is a reliable and sufficiently dense training signal for both reward models; no new physical entities or mathematical axioms are introduced.

axioms (1)

domain assumption Execution outcomes from test runs provide an accurate and unbiased signal for training reward models that allocate credit to individual code edits.
Invoked when the line-level allocator is trained directly from whether edits cause tests to pass or fail.

pith-pipeline@v0.9.0 · 5488 in / 1359 out tokens · 31992 ms · 2026-05-14T20:53:16.846412+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

doi: 10.48550/arXiv.2602.16313. URL https: //arxiv.org/abs/2602.16313. Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring coding challenge competence with APPS. InAdvances in Neural Information Process- ing Systems, 2021. Hu, Y ., Goel, K., Killiakov, V ., and Ya...

work page doi:10.48550/arxiv.2602.16313 2021
[2]

Proximal Policy Optimization Algorithms

URL https://proceedings.mlr.press/ v267/pan25g.html. Qiao, G., Ouyang, R., Xu, S., Jin, R., Deng, Y ., Tai, Y ., Jia, K., and Liu, G. Focus-Then-Contact: Speeding up robotic contact-rich task learning with affordance-guided real- world residual reinforcement learning. InProceedings of the Forty-third International Conference on Machine Learning, 2026. URL...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tse.2025.3581062 2026
[3]

From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation

doi: 10.48550/arXiv.2605.11951. URL https: //arxiv.org/abs/2605.11951. Accepted to RSS 2026. Yang, B., Tian, H., Ren, J., Zhang, H., Klein, J., Bissyand´e, T. F., Le Goues, C., and Jin, S. MORepair: Teaching LLMs to repair code via multi-objective fine-tuning.ACM Transactions on Software Engineering and Methodology,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.11951 2026
[4]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

doi: 10.1145/3735129. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. R., and Press, O. SWE-agent: Agent- computer interfaces enable automated software engineer- ing. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2024. arXiv:2405.15793. Yoon, E., Yoon, H. S., Eom, S., Han, G., Nam, D. W., Jo, D., On, K.-W., Has...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3735129 2024
[5]

Extract the hidden states corresponding to span tokens
[6]

Apply mean pooling across the span
[7]

• Learning rate:1×10 −5 • Batch size: 64 • Epochs: 5 • Optimizer: AdamW (β 1 = 0.9,β 2 = 0.999) • Hybrid loss weight:λ reg = 0.5 B.3

Pass through a two-layer MLP (hidden dim 512, ReLU activation) to produce a scalar score Training Details. • Learning rate:1×10 −5 • Batch size: 64 • Epochs: 5 • Optimizer: AdamW (β 1 = 0.9,β 2 = 0.999) • Hybrid loss weight:λ reg = 0.5 B.3. PPO Training Infrastructure.We use VERL (Sheng et al., 2024) for distributed PPO training with vLLM (Kwon et al., 20...

2024
[8]

Applies the candidate patch to the repository
[9]

Runs the test suite in an isolated Docker container
[10]

τpass@1 pass@4 0.25 (sharp) 39.4 43.2 0.5 (default) 40.7 44.3 1.0 (smooth) 39.8 43.5 2.0 (uniform) 38.6 41.2 Table 16.Computational cost breakdown (A100 GPU-hours)

Reports success only if all relevant tests pass 19 BOOSTAPR: Execution-Grounded RL for Automated Program Repair Table 15.Impact of allocation temperatureτ. τpass@1 pass@4 0.25 (sharp) 39.4 43.2 0.5 (default) 40.7 44.3 1.0 (smooth) 39.8 43.5 2.0 (uniform) 38.6 41.2 Table 16.Computational cost breakdown (A100 GPU-hours). Stage GPU-hours % of Total Demonstra...

2021
[11]

3.Audit trails: Maintain detailed logs of automated repairs for accountability and debugging

Confidence calibration: Systems should provide calibrated confidence estimates to help users identify repairs requiring extra scrutiny. 3.Audit trails: Maintain detailed logs of automated repairs for accountability and debugging
[12]

Gradual adoption: Introduce APR tools incrementally, starting with low-risk fixes and expanding as trust is established. G. Reproducibility Checklist To facilitate reproduction of our results, we provide: ✓Complete hyperparameter specifications (Appendix B) ✓Training data construction details (Section 3.1) ✓Evaluation protocol and metrics (Section 4) ✓Com...