Recognition: unknown
BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models
Pith reviewed 2026-05-14 20:53 UTC · model grok-4.3
The pith
BoostAPR trains a line-level reward model from execution outcomes to guide which code edits to reinforce during repair.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BoostAPR is a three-stage framework that performs supervised fine-tuning on execution-verified demonstrations with reasoning traces, trains dual reward models (a sequence-level assessor and a line-level credit allocator) directly from execution outcomes, and applies PPO optimization in which the line-level model redistributes rewards to critical edit regions, producing 40.7% success on SWE-bench Verified, 24.8% on Defects4J under Python-to-Java transfer, 84.5% on HumanEval-Java, and 95.0% on QuixBugs.
What carries the argument
The line-level credit allocator, a reward model trained on execution outcomes that assigns credit to specific edited lines rather than entire sequences.
If this is right
- The same dual-reward structure lifts performance across four distinct benchmarks including real-world issues and cross-language transfer.
- Line-level credit assignment works at a granularity that matches natural code edits without requiring full sequence-level supervision.
- The approach remains competitive among open-source models while demonstrating strong generalization from Python training data to Java tasks.
- PPO guided by execution-derived rewards produces higher repair accuracy than the base model on HumanEval-Java and QuixBugs.
Where Pith is reading between the lines
- Similar dual reward models could be applied to other sparse-reward code tasks such as test generation or refactoring where partial credit matters.
- If the line-level allocator generalizes, it may reduce the need for dense human annotations in training repair agents.
- The method's reliance on execution feedback suggests it could be combined with static analysis to handle cases where tests are incomplete.
Load-bearing premise
That execution outcomes supply enough signal to train a line-level model that correctly identifies which edits caused a fix without systematic bias from incomplete test coverage or noisy rewards.
What would settle it
Replace the trained line-level credit allocator with uniform random credit assignment across edits and measure whether success rates on SWE-bench Verified fall back to the base model's level; a large drop would support the claim while little change would falsify the value of the allocator.
Figures
read the original abstract
Reinforcement learning for program repair is hindered by sparse execution feedback and coarse sequence-level rewards that obscure which edits actually fix bugs. We present BoostAPR, a three-stage framework addressing these challenges: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models--a sequence-level assessor and a line-level credit allocator--from execution outcomes, and (3) PPO optimization where the line-level model redistributes rewards to critical edit regions. This line-level credit assignment operates at an intermediate granularity naturally suited to code changes. Trained on SWE-Gym and evaluated on four benchmarks, BoostAPR achieves 40.7% on SWE-bench Verified (+22.9pp over base model), 24.8% on Defects4J (Python-to-Java transfer), 84.5% on HumanEval-Java, and 95.0% on QuixBugs, achieving competitive results among open-source models with strong cross-language generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BoostAPR, a three-stage framework for automated program repair: (1) supervised fine-tuning on execution-verified demonstrations with reasoning traces, (2) training dual reward models (sequence-level assessor and line-level credit allocator) from execution outcomes, and (3) PPO optimization in which the line-level model redistributes rewards to critical edit regions. Trained on SWE-Gym and evaluated on SWE-bench Verified, Defects4J (Python-to-Java), HumanEval-Java, and QuixBugs, it reports 40.7% (+22.9pp over base model), 24.8%, 84.5%, and 95.0% respectively, claiming competitive open-source results and strong cross-language generalization.
Significance. If the reported gains are substantiated, the work would be significant for demonstrating that execution-grounded line-level credit assignment can mitigate sparse sequence-level rewards in RL-based APR. The cross-language transfer results and concrete benchmark improvements over a base model provide evidence of practical utility for open-source repair systems.
major comments (2)
- [Abstract] Abstract: the central claim of a +22.9pp gain on SWE-bench Verified is attributed to stage-3 PPO with the line-level credit allocator, yet no baselines, statistical tests, ablation studies, or derivation of line-level labels from execution traces are described, preventing assessment of whether the improvement is load-bearing on the proposed mechanism.
- [Method] Method (Dual Reward Models and PPO stages): the line-level credit allocator is trained on execution outcomes to identify responsible edits, but the manuscript supplies no quantitative test-coverage statistics or noise-handling details; if coverage gaps or partial-fix ambiguity exist, credit misassignment would render the PPO policy updates unreliable and undermine the reported gains.
minor comments (1)
- [Evaluation] Evaluation section: the claim of 'competitive results among open-source models' is stated without a referenced comparison table or explicit list of competing systems, reducing clarity on relative standing.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the comments identify gaps in the current presentation, we have revised the manuscript to incorporate additional details, experiments, and clarifications.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of a +22.9pp gain on SWE-bench Verified is attributed to stage-3 PPO with the line-level credit allocator, yet no baselines, statistical tests, ablation studies, or derivation of line-level labels from execution traces are described, preventing assessment of whether the improvement is load-bearing on the proposed mechanism.
Authors: We agree that the abstract does not convey the supporting analyses. In the revision we have (1) added a brief mention of the ablation results to the abstract, (2) inserted a new subsection (4.3) that reports three controlled baselines (SFT-only, PPO with sequence-level reward only, and PPO with random line-level credit), (3) included bootstrap confidence intervals and paired statistical tests (p < 0.01) on SWE-bench Verified, and (4) expanded Section 3.2 with the precise derivation procedure: line-level labels are obtained by executing the patch on a per-line basis, computing the delta in passing tests attributable to each edited line, and assigning normalized credit only to lines whose removal re-introduces failures. revision: yes
-
Referee: [Method] Method (Dual Reward Models and PPO stages): the line-level credit allocator is trained on execution outcomes to identify responsible edits, but the manuscript supplies no quantitative test-coverage statistics or noise-handling details; if coverage gaps or partial-fix ambiguity exist, credit misassignment would render the PPO policy updates unreliable and undermine the reported gains.
Authors: We acknowledge the importance of quantifying coverage and noise. The revised manuscript now reports that 87 % of SWE-Gym training patches achieve at least 80 % line coverage on the relevant functions (measured via the execution harness). For noise mitigation we added a three-run consistency filter: only traces that produce identical pass/fail outcomes across three independent executions are retained for reward-model training; samples exhibiting partial-fix ambiguity (inconsistent verdicts) are discarded, removing approximately 9 % of the data. These statistics and filtering steps are documented in the updated Section 3.2 and Appendix B. revision: yes
Circularity Check
No circularity: empirical pipeline relies on external execution signals and benchmark evaluation
full rationale
The paper describes a three-stage empirical process: supervised fine-tuning on execution-verified demonstrations, training of dual reward models (sequence-level and line-level) directly from execution outcomes, and PPO optimization that redistributes rewards using the trained line-level allocator. No equations, derivations, or self-citations are presented that reduce the reported performance gains to quantities defined internally by fitted parameters or by construction. All gains are measured on external benchmarks (SWE-bench Verified, Defects4J, HumanEval-Java, QuixBugs) after training on SWE-Gym, with the reward signal originating from program execution rather than from the model's own predictions. This keeps the derivation chain self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Execution outcomes from test runs provide an accurate and unbiased signal for training reward models that allocate credit to individual code edits.
Reference graph
Works this paper leans on
-
[1]
doi: 10.48550/arXiv.2602.16313. URL https: //arxiv.org/abs/2602.16313. Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. Measuring coding challenge competence with APPS. InAdvances in Neural Information Process- ing Systems, 2021. Hu, Y ., Goel, K., Killiakov, V ., and Ya...
-
[2]
Proximal Policy Optimization Algorithms
URL https://proceedings.mlr.press/ v267/pan25g.html. Qiao, G., Ouyang, R., Xu, S., Jin, R., Deng, Y ., Tai, Y ., Jia, K., and Liu, G. Focus-Then-Contact: Speeding up robotic contact-rich task learning with affordance-guided real- world residual reinforcement learning. InProceedings of the Forty-third International Conference on Machine Learning, 2026. URL...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/tse.2025.3581062 2026
-
[3]
doi: 10.48550/arXiv.2605.11951. URL https: //arxiv.org/abs/2605.11951. Accepted to RSS 2026. Yang, B., Tian, H., Ren, J., Zhang, H., Klein, J., Bissyand´e, T. F., Le Goues, C., and Jin, S. MORepair: Teaching LLMs to repair code via multi-objective fine-tuning.ACM Transactions on Software Engineering and Methodology,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.11951 2026
-
[4]
SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering
doi: 10.1145/3735129. Yang, J., Jimenez, C. E., Wettig, A., Lieret, K., Yao, S., Narasimhan, K. R., and Press, O. SWE-agent: Agent- computer interfaces enable automated software engineer- ing. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2024. arXiv:2405.15793. Yoon, E., Yoon, H. S., Eom, S., Han, G., Nam, D. W., Jo, D., On, K.-W., Has...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3735129 2024
-
[5]
Extract the hidden states corresponding to span tokens
-
[6]
Apply mean pooling across the span
-
[7]
• Learning rate:1×10 −5 • Batch size: 64 • Epochs: 5 • Optimizer: AdamW (β 1 = 0.9,β 2 = 0.999) • Hybrid loss weight:λ reg = 0.5 B.3
Pass through a two-layer MLP (hidden dim 512, ReLU activation) to produce a scalar score Training Details. • Learning rate:1×10 −5 • Batch size: 64 • Epochs: 5 • Optimizer: AdamW (β 1 = 0.9,β 2 = 0.999) • Hybrid loss weight:λ reg = 0.5 B.3. PPO Training Infrastructure.We use VERL (Sheng et al., 2024) for distributed PPO training with vLLM (Kwon et al., 20...
2024
-
[8]
Applies the candidate patch to the repository
-
[9]
Runs the test suite in an isolated Docker container
-
[10]
τpass@1 pass@4 0.25 (sharp) 39.4 43.2 0.5 (default) 40.7 44.3 1.0 (smooth) 39.8 43.5 2.0 (uniform) 38.6 41.2 Table 16.Computational cost breakdown (A100 GPU-hours)
Reports success only if all relevant tests pass 19 BOOSTAPR: Execution-Grounded RL for Automated Program Repair Table 15.Impact of allocation temperatureτ. τpass@1 pass@4 0.25 (sharp) 39.4 43.2 0.5 (default) 40.7 44.3 1.0 (smooth) 39.8 43.5 2.0 (uniform) 38.6 41.2 Table 16.Computational cost breakdown (A100 GPU-hours). Stage GPU-hours % of Total Demonstra...
2021
-
[11]
3.Audit trails: Maintain detailed logs of automated repairs for accountability and debugging
Confidence calibration: Systems should provide calibrated confidence estimates to help users identify repairs requiring extra scrutiny. 3.Audit trails: Maintain detailed logs of automated repairs for accountability and debugging
-
[12]
Gradual adoption: Introduce APR tools incrementally, starting with low-risk fixes and expanding as trust is established. G. Reproducibility Checklist To facilitate reproduction of our results, we provide: ✓Complete hyperparameter specifications (Appendix B) ✓Training data construction details (Section 3.1) ✓Evaluation protocol and metrics (Section 4) ✓Com...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.