SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

Hao Han; Hongkai Chen; Jin Xie; Qingwen Ye; Weiquan Zhu; Xuehao Ma; ZhiLiang Long; Ziyao Zhang

arxiv: 2604.14820 · v1 · submitted 2026-04-16 · 💻 cs.SE

SWE-TRACE: Optimizing Long-Horizon SWE Agents Through Rubric Process Reward Models and Heuristic Test-Time Scaling

Hao Han , Jin Xie , Xuehao Ma , Weiquan Zhu , Ziyao Zhang , ZhiLiang Long , Hongkai Chen , Qingwen Ye This is my paper

Pith reviewed 2026-05-10 10:12 UTC · model grok-4.3

classification 💻 cs.SE

keywords software engineering agentsprocess reward modelstest-time scalingtrajectory distillationagentic reinforcement learninglong-horizon reasoningheuristic pruningSWE benchmarks

0 comments

The pith

SWE-TRACE optimizes software engineering agents by distilling shortest-path trajectories, applying rubric process rewards for dense guidance, and reusing the model for efficient test-time pruning to raise resolution rates while cutting cost

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Software engineering agents must reason over long sequences of code edits and tests to resolve real issues, yet current systems waste tokens on inefficient paths, receive only weak end-of-task signals, and slow down when they try many options at once. The paper claims these problems share a common fix: curate training data around only the shortest verified successes, train with a rubric-based process reward model that scores every step through an auxiliary agent, and then apply that same model at inference time to drop poor candidate actions early. The result is presented as higher success on standard benchmarks together with large drops in tokens consumed and latency incurred. If the approach holds, agents could tackle larger codebases without the current explosion in compute per task.

Core claim

SWE-TRACE unifies data curation, reinforcement learning, and inference through three linked pieces: an LLM cascading procedure that distills a 60K-instance SFT corpus of only token-efficient shortest-path trajectories via stepwise oracle verification, a MemoryAugmented Agentic RL stage that trains with a Rubric-Based Process Reward Model whose auxiliary Rubric-Agent supplies dense heuristic scores on intermediate steps, and direct reuse of the same PRM to evaluate and prune action candidates during heuristic-guided test-time scaling, yielding higher resolution rates at lower token and latency cost.

What carries the argument

The Rubric-Based Process Reward Model, an auxiliary agent that supplies dense, fine-grained heuristic scores on every intermediate step to stabilize long-horizon reinforcement learning and to enable early pruning of weak actions at test time.

If this is right

Agents reach higher rates of resolving issues on standard SWE benchmarks.
Inference consumes substantially fewer tokens per task.
Latency falls because poor action candidates are pruned at each step instead of being sampled in parallel.
Training remains stable over long trajectories because dense process rewards replace sparse outcome signals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation-plus-reuse pattern could be applied to other long-sequence agent domains such as multi-step planning or scientific workflow agents.
Sharing one reward model between training and inference may reduce the usual gap between learned policy and deployed behavior.
If the oracle verification step works reliably, similar shortest-path filtering could shrink data needs in other sequential decision tasks.

Load-bearing premise

The rubric-based process reward model must deliver accurate dense feedback on every step without letting the agent exploit scoring flaws or shifting the training distribution away from useful behaviors.

What would settle it

A head-to-head run on SWE-bench where the complete SWE-TRACE pipeline produces no higher fraction of resolved issues and no lower average tokens or latency than strong baselines without the rubric model or the distillation step would falsify the central optimization claim.

Figures

Figures reproduced from arXiv: 2604.14820 by Hao Han, Hongkai Chen, Jin Xie, Qingwen Ye, Weiquan Zhu, Xuehao Ma, ZhiLiang Long, Ziyao Zhang.

**Figure 1.** Figure 1: Overview of the Rubric-Based PRM and GRPO traning with Rubric-Based PRM. 4.1 MEMORY-AUGMENTED LONG-HORIZON ARCHITECTURE Let the agent interact with a SWE environment for an instance x = (I, C, U), where I is the issue, C is the repository state, and U is the test suite. At step t, the raw interaction history is ht = (I, o0, a0, o1, a1, . . . , ot−1, at−1, ot), where aj is an action and oj is the correspond… view at source ↗

**Figure 2.** Figure 2: RL training dynamics on SWE-BENCH VERIFIED. Rubric-conditioned RL converges to higher final performance with lower variance than execution-only RL, especially on the 30B backbone. Shaded regions indicate run-to-run variation. 0 20 40 60 80 RL Training Step 26 27 28 29 30 31 32 Avg. Token Usage / Issue (k) Token Efficiency During RL Training 4B Execution-only RL 4B Rubric-PRM RL 30B Execution-only RL 30B Ru… view at source ↗

**Figure 3.** Figure 3: Token-efficiency dynamics during RL training. Rubric-conditioned RL reduces average [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: Latency–performance scaling for test-time inference under increasing rollout budgets. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Resolve rate by trajectory token-budget bin. The largest gains of SWE-TRACE appear [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Representative long-horizon case study on [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Resolving real-world software engineering (SWE) issues with autonomous agents requires complex, long-horizon reasoning. Current pipelines are bottlenecked by unoptimized demonstration data, sparse execution rewards, and computationally prohibitive inference scaling, which collectively exacerbate token bloat, reward hacking, and policy degradation. We present SWE-TRACE (Trajectory Reduction and Agentic Criteria Evaluation), a unified framework optimizing the SWE agent lifecycle across data curation, reinforcement learning (RL), and test-time inference. First, we introduce an LLM multi-task cascading method, utilizing stepwise oracle verification to distill a 60K-instance Supervised Fine-Tuning (SFT) corpus strictly biased toward token-efficient, shortest-path trajectories. Second, to overcome the instability of sparse outcome rewards, we design a MemoryAugmented Agentic RL pipeline featuring a Rubric-Based Process Reward Model (PRM). An auxiliary Rubric-Agent provides dense, fine-grained heuristic feedback on intermediate steps, guiding the model through long-horizon tasks. Finally, we bridge training and inference by repurposing the PRM for heuristic-guided Test-Time Scaling (TTS). By dynamically evaluating and pruning action candidates at each step, SWE-TRACE achieves superior search efficiency without the latency overhead of standard parallel sampling. Extensive experiments on standard SWE benchmarks demonstrate that SWE-TRACE significantly advances the state-of-the-art, maximizing resolution rates while drastically reducing both token consumption and inference latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWE-TRACE puts together trajectory distillation, a rubric PRM for RL, and PRM-based pruning at test time into one pipeline for long-horizon SWE agents, with the main open question being how well the rubric actually works.

read the letter

The paper's main contribution is SWE-TRACE, a three-part framework that first distills a 60K SFT corpus of shortest-path trajectories via multi-task cascading and oracle verification, then runs memory-augmented agentic RL with a rubric-based process reward model for dense step feedback, and finally reuses that PRM to prune action candidates during heuristic test-time scaling. The unified lifecycle view and the reuse of the PRM for inference pruning are the clearest new pieces; each draws on existing ideas but the specific combination for SWE agents is not a direct copy of prior work cited in the abstract. The experiments reportedly show higher resolution rates on standard benchmarks alongside large reductions in tokens and latency, which would be practically useful if the numbers hold up under scrutiny. The data curation step looks reasonably grounded because it uses stepwise oracle checks to bias toward efficiency. The stress-test concern about rubric accuracy is fair and worth checking in the full text. If the paper only reports end-to-end gains without ablations on rubric construction, PRM validation against ground-truth step quality, or tests for reward hacking and distribution shift from the shortest-path data, then the central claims rest on an unverified assumption. The shortest-path bias could also reduce trajectory diversity and hurt robustness on harder or more varied tasks. The citation pattern follows the usual PRM and TTS references without obvious gaps. This work is aimed at researchers building autonomous coding agents who care about token cost and long-horizon stability. Readers working on agentic RL or test-time methods will find the concrete pipeline worth examining. It has enough structure and benchmark claims to deserve peer review, though referees will need to press on the PRM validation details and statistical robustness of the reported gains.

Referee Report

3 major / 2 minor

Summary. The paper introduces SWE-TRACE, a unified framework for optimizing long-horizon SWE agents across data curation, RL, and inference. It distills a 60K-instance SFT corpus of token-efficient shortest-path trajectories via LLM multi-task cascading and oracle verification; employs a MemoryAugmented Agentic RL pipeline with a Rubric-Based Process Reward Model (PRM) where an auxiliary Rubric-Agent supplies dense heuristic feedback on intermediate steps; and repurposes the PRM for heuristic-guided Test-Time Scaling (TTS) to prune action candidates dynamically. The authors claim that experiments on standard SWE benchmarks show SWE-TRACE advances the state-of-the-art by maximizing resolution rates while drastically reducing token consumption and inference latency.

Significance. If the empirical claims hold, this work would be significant for autonomous software engineering agents by addressing data inefficiency, sparse outcome rewards, and prohibitive inference costs in long-horizon reasoning. The integration of oracle-distilled efficient trajectories, rubric-based dense process rewards, and their reuse for heuristic TTS offers a coherent pipeline that could improve both policy learning and deployment efficiency. The structured approach to biasing toward shortest paths and providing fine-grained feedback is a notable strength that merits further exploration if supported by rigorous validation.

major comments (3)

Abstract: The assertion of 'superior benchmark results' and 'significantly advances the state-of-the-art' is unsupported by any quantitative numbers, baseline comparisons, ablation studies, or error analysis. This is load-bearing for the central performance claim regarding resolution rates, token reduction, and latency.
MemoryAugmented Agentic RL pipeline (Rubric-Based Process Reward Model subsection): No details are given on rubric construction, the PRM training objective, validation against ground-truth step quality, or explicit checks for reward hacking and distribution shift from the biased shortest-path SFT corpus. This is critical because the central claim depends on the PRM supplying accurate, unbiased dense feedback across long trajectories; without it, the RL policy and subsequent TTS may degrade rather than improve performance.
Heuristic-guided Test-Time Scaling subsection: The mechanism for dynamically evaluating and pruning action candidates via the PRM is described only at a high level, with no specifics on pruning criteria or reported metrics on search efficiency versus standard parallel sampling. This undermines the claim of achieving superior search efficiency without latency overhead.

minor comments (2)

The abstract refers to 'standard SWE benchmarks' without naming them (e.g., SWE-bench), which would improve immediate clarity for readers.
A high-level diagram of the overall SWE-TRACE pipeline (data curation to RL to TTS) would enhance readability of the unified framework.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating revisions made to strengthen the paper.

read point-by-point responses

Referee: Abstract: The assertion of 'superior benchmark results' and 'significantly advances the state-of-the-art' is unsupported by any quantitative numbers, baseline comparisons, ablation studies, or error analysis. This is load-bearing for the central performance claim regarding resolution rates, token reduction, and latency.

Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In the revised manuscript, we have updated the abstract to reference key empirical outcomes from our experiments, including specific improvements in resolution rates, token consumption, and latency relative to baselines, with explicit pointers to the relevant tables, figures, and ablation studies in the main text. This addresses the load-bearing nature of the claims without altering the abstract's brevity. revision: yes
Referee: MemoryAugmented Agentic RL pipeline (Rubric-Based Process Reward Model subsection): No details are given on rubric construction, the PRM training objective, validation against ground-truth step quality, or explicit checks for reward hacking and distribution shift from the biased shortest-path SFT corpus. This is critical because the central claim depends on the PRM supplying accurate, unbiased dense feedback across long trajectories; without it, the RL policy and subsequent TTS may degrade rather than improve performance.

Authors: We acknowledge that additional methodological details are necessary to substantiate the PRM's role. We have expanded the Rubric-Based Process Reward Model subsection to include: the rubric construction methodology using SWE-specific heuristics derived from oracle trajectories; the PRM training objective as a supervised binary classification task on step-level quality labels; validation results demonstrating correlation with ground-truth step annotations on held-out data; and explicit analyses for reward hacking and distribution shift, including performance comparisons between the SFT-biased corpus and broader trajectory distributions. These revisions clarify how the PRM provides reliable dense feedback. revision: yes
Referee: Heuristic-guided Test-Time Scaling subsection: The mechanism for dynamically evaluating and pruning action candidates via the PRM is described only at a high level, with no specifics on pruning criteria or reported metrics on search efficiency versus standard parallel sampling. This undermines the claim of achieving superior search efficiency without latency overhead.

Authors: We agree that the TTS mechanism requires more operational detail. In the revised Heuristic-guided Test-Time Scaling subsection, we now specify the pruning criteria (including PRM score thresholds and dynamic top-k selection rules) and report quantitative metrics on search efficiency, such as average candidates pruned per step, token usage per resolved task, and direct comparisons to parallel sampling baselines. These are supported by additional experimental results demonstrating the efficiency gains without increased latency. revision: yes

Circularity Check

0 steps flagged

No circularity in SWE-TRACE derivation chain

full rationale

The paper presents an empirical framework involving data distillation via oracle verification, RL training with a rubric-based PRM, and heuristic TTS pruning. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided abstract or description that reduce any central claim to its own inputs by construction. The claims rest on benchmark experiments, which are independent of internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; rubric design, memory augmentation specifics, and scaling heuristics are likely sources of free parameters but remain unspecified.

pith-pipeline@v0.9.0 · 5576 in / 1092 out tokens · 37907 ms · 2026-05-10T10:12:11.065985+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

0 *H 0 çPkjr 3eA` (k

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

0 *H 0 çPkjr 3eA` (k

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page