Polar: Agentic RL on Any Harness at Scale

Binfeng Xu; Hao Zhang; Jan Kautz; Jian Hu; Michael Demoret; Mingjie Liu; Shaokun Zhang; Shizhe Diao; Songyang Han; Yi Dong

arxiv: 2605.24220 · v1 · pith:SZ5BGYO7new · submitted 2026-05-22 · 💻 cs.DC

Polar: Agentic RL on Any Harness at Scale

Binfeng Xu , Hao Zhang , Shaokun Zhang , Songyang Han , Mingjie Liu , Jian Hu , Shizhe Diao , Zhenghui Jin

show 4 more authors

Yunheng Zou Michael Demoret Jan Kautz Yi Dong

This is my paper

Pith reviewed 2026-06-30 14:35 UTC · model grok-4.3

classification 💻 cs.DC

keywords agentic RLrollout frameworkblack-box harnesstrajectory reconstructionSWE-Benchasynchronous trainingGRPOlanguage agents

0 comments

The pith

Polar lets any agent harness run scalable RL by proxying LLM calls and rebuilding token-faithful trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Polar introduces a rollout framework that decouples agent harnesses from RL training so existing harnesses can be used without porting or modification. The system proxies LLM API calls inside the harness, records the interactions at token level, and reconstructs complete trajectories that trainers can consume asynchronously. This design improves compute utilization on long-running agent workloads while remaining agnostic to the specific harness, infrastructure, or RL algorithm. On software-engineering tasks, the approach yields measurable gains on SWE-Bench Verified when training a 4B model with simple GRPO across four different coding harnesses.

Core claim

Polar treats the agent harness as a black box: it proxies LLM API calls, records token-level model interactions, and reconstructs token-faithful trajectories for training. Each rollout node handles runtime prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous service endpoints that independent trainers can consume at scale. Using this mechanism with GRPO, the framework improves Qwen3.5-4B by 22.6, 4.8, 0.6 and 6.2 points on SWE-Bench Verified with the Codex, Claude Code, Qwen Code and Pi harnesses respectively.

What carries the argument

Black-box proxy of LLM API calls plus token-level trajectory reconstruction that turns an arbitrary harness into a source of training data without internal changes.

If this is right

Existing coding harnesses can be used for RL without rewriting them as RL environments.
Rollout nodes can be scaled independently of trainers for better utilization on long-running tasks.
The same framework supports both online RL and offline data generation over custom harnesses.
Simple algorithms like GRPO become viable across multiple harnesses without harness-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could let researchers train agents on closed-source or proprietary harnesses where source access is unavailable.
If reconstruction fidelity holds, the same proxy pattern might apply to non-coding agent domains that already have mature harnesses.
Decoupling could reduce duplication of effort in the community by letting one harness serve both evaluation and RL training pipelines.

Load-bearing premise

Proxying the LLM calls inside an arbitrary harness and reconstructing trajectories from those calls preserves every training signal required for effective RL.

What would settle it

A side-by-side comparison on the same harness and model where native RL integration produces higher final performance or faster convergence than Polar-mediated training.

Figures

Figures reproduced from arXiv: 2605.24220 by Binfeng Xu, Hao Zhang, Jan Kautz, Jian Hu, Michael Demoret, Mingjie Liu, Shaokun Zhang, Shizhe Diao, Songyang Han, Yi Dong, Yunheng Zou, Zhenghui Jin.

**Figure 1.** Figure 1: Polar architecture overview. Polar runs an existing agent harness inside an isolated runtime and places a model API proxy between the harness and the inference server. The proxy forwards model calls, records token-level request and response data, and reconstructs RL trajectories, while rollout gateways asynchronously handle runtime prewarming, harness execution, evaluation, and trainer callbacks. This deco… view at source ↗

**Figure 2.** Figure 2: Polar uses the model API proxy as the rollout boundary. Traditional rollout frameworks usually require the agent or harness logic to be rewritten behind a framework-owned environment API. This makes the trainer depend on harness-specific integration code and can miss details of the native execution path. Polar instead keeps the harness unchanged and places a provider-compatible proxy at the LLM API boundar… view at source ↗

**Figure 3.** Figure 3: Gateway-level asynchronous staging in Polar. A gateway separates runtime initialization, ready buffering, harness execution, and post-run trajectory and evaluation work into isolated worker pools. Runtime preparation and evaluator prewarm proceed off the critical path, so CPU-heavy runtime setup and long-tail evaluation do not block active GPU-bound agent run. editing files, spawning sub-agents, or managin… view at source ↗

**Figure 4.** Figure 4: Trajectory reconstruction example. The visualized session contains a three-turn main agent that undergoes one harness-level context compaction and spawns one subagent. The per-request builder keeps each captured model call as an independent trace. Prefix merging instead recovers append-only conversation chains where valid, while compaction and subagent boundaries naturally form separate chains. Within each… view at source ↗

**Figure 5.** Figure 5: Polar improves GPU utilization across the rollout-training boundary. (a) Shows an asynchronous RL pipeline enabled by Polar services. The rollout server keeps inferencing with existing policy, while trainer steps only if receiving batch size of evaluated trajectory groups. (b) Shows a span of 3 training steps under the same workload and topology, prefix merging emits fewer trainer updates than per-request … view at source ↗

**Figure 6.** Figure 6: SWE-Gym GRPO training curves. Each panel shows the per-step outcome reward, equivalent to rollout pass@1, for one of four evaluated coding harnesses. RL improves reward across harnesses, with the clear gains on execution paths involving complex prompting, orchestration, or unfamiliar tool schemas. Fig. 5b compares the GPU utilizations of the 2 strategies above with the same configurations. 3.5. Evaluation … view at source ↗

read the original abstract

Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, porting these harnesses into RL environment interfaces remains difficult and often loses important training signals. We bridge this gap with polar, a rollout framework for scalable asynchronous RL over arbitrary agent harnesses. Polar treats the agent harness as a black box: it proxies LLM API calls, records token-level model interactions, and reconstructs token-faithful trajectories for training. Each rollout node efficiently manages runtime prewarming, agent execution, trajectory reconstruction, and evaluation in parallel, exposing asynchronous service endpoints that can be consumed by independent trainers at scale. This decoupled design makes Polar agnostic to agent harnesses, training infrastructure, and RL algorithms while improving compute utilization for long-running agent workloads. We validate polar by training agents on software-engineering tasks with popular coding harnesses. Using simple GRPO, polar improves Qwen3.5-4B by 22.6, 4.8, 0.6 and 6.2 points on SWE-Bench Verified with the Codex, Claude Code, Qwen Code and Pi harnesses, respectively. We further demonstrate Polar for offline data generation over custom harnesses and ablate trajectory reconstruction strategies. Polar rewrites its preceding work, Prorl Agent, and has been registered as one of NeMo Gym environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Polar gives a practical black-box way to run RL on top of existing agent harnesses by proxying calls and reconstructing trajectories, with uneven but positive gains on SWE-Bench.

read the letter

Polar treats the harness as a black box: it proxies LLM API calls, records the interactions, and rebuilds token-faithful trajectories for training. The decoupled rollout nodes handle prewarming, execution, reconstruction, and evaluation in parallel, then expose async endpoints for separate trainers. This setup is meant to work with any harness and any RL algorithm.

What is new is the explicit black-box proxy plus reconstruction pattern that avoids rewriting the harness itself. They test it on four coding harnesses (Codex, Claude Code, Qwen Code, Pi) using simple GRPO on Qwen3.5-4B and report lifts of 22.6, 4.8, 0.6, and 6.2 points on SWE-Bench Verified. They also show offline data generation over custom harnesses and ablate reconstruction strategies. The registration as a NeMo Gym environment is a small practical signal.

The results are the strongest part. Consistent gains across distinct harnesses support the claim that the proxying preserves enough signal for GRPO to work. The parallel node design directly targets the scaling issues with long-running agent workloads.

The gains vary sharply by harness, so the benefit is not uniform. This work rewrites their earlier Prorl Agent system, so the advance is incremental rather than foundational. The abstract gives limited experimental detail, though the stress-test note indicates the full paper includes ablations that test the reconstruction step.

This paper is for engineers and researchers who already have complex agent harnesses and want to add RL without major rewrites. A reader focused on practical scaling of agent training would get concrete design ideas and benchmark numbers. It deserves a serious referee because it addresses a real integration bottleneck with working results on standard tasks.

I would send it to peer review.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Polar, a rollout framework for scalable asynchronous RL over arbitrary agent harnesses. It treats harnesses as black boxes by proxying LLM API calls, recording token-level interactions, and reconstructing token-faithful trajectories for training. Using GRPO, it reports improvements of 22.6, 4.8, 0.6, and 6.2 points on SWE-Bench Verified for Qwen3.5-4B across the Codex, Claude Code, Qwen Code, and Pi harnesses, respectively, along with demonstrations of offline data generation and ablations of reconstruction strategies.

Significance. If the empirical results hold under rigorous evaluation, Polar would meaningfully lower the barrier to applying RL to complex, long-running agent systems by decoupling harness logic from training infrastructure. The multi-harness validation and explicit ablations directly test the viability of the black-box proxy approach, which is a practical strength.

major comments (1)

[Results] Results section (and abstract): the reported point improvements on SWE-Bench Verified are presented without error bars, number of runs, statistical significance tests, baseline details, or controls for harness-specific variance. These omissions are load-bearing because the gains constitute the primary evidence that proxying and trajectory reconstruction preserve sufficient training signal for effective GRPO.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of Polar to lower barriers for RL on complex agent systems. We address the major comment on results presentation below and commit to revisions that strengthen the empirical evidence.

read point-by-point responses

Referee: [Results] Results section (and abstract): the reported point improvements on SWE-Bench Verified are presented without error bars, number of runs, statistical significance tests, baseline details, or controls for harness-specific variance. These omissions are load-bearing because the gains constitute the primary evidence that proxying and trajectory reconstruction preserve sufficient training signal for effective GRPO.

Authors: We agree that the absence of these details weakens the primary claims. In the revised manuscript we will: (1) report the number of independent training runs (minimum three seeds per harness), (2) add error bars (standard deviation across runs) to all reported point improvements, (3) include statistical significance tests (e.g., paired t-tests or Wilcoxon) comparing Polar-trained agents against the corresponding baselines, (4) expand baseline descriptions to clarify the exact harness configurations and evaluation protocols used, and (5) add explicit controls or discussion of harness-specific variance (e.g., by reporting per-harness variance and any normalization steps). These changes will appear in both the results section and the abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an engineering framework (Polar) for asynchronous RL rollouts over arbitrary agent harnesses, validated empirically via performance gains on the external SWE-Bench Verified benchmark across four distinct harnesses plus ablations of reconstruction strategies. No mathematical derivations, first-principles predictions, or fitted parameters are claimed; the central results are direct measurements against independent external benchmarks rather than quantities defined internally by the framework. The incidental note that Polar rewrites prior work (Prorl Agent) is not load-bearing for any claim and does not invoke uniqueness theorems or ansatzes from self-citations. The work is therefore self-contained against external benchmarks with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that black-box proxying and trajectory reconstruction are sufficient to capture training signals; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Trajectory reconstruction from proxied LLM API calls inside an arbitrary harness is faithful and lossless for RL purposes
This premise is required for the black-box design to deliver usable training data.

pith-pipeline@v0.9.1-grok · 5817 in / 1422 out tokens · 57120 ms · 2026-06-30T14:35:09.899657+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 3 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2 Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URLhttps://github.com/PrimeIntellect-ai/prime-rl. GitHub repository. 2, 3, 15 13 Polar: Agentic RL on Any Harness at Scale Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

SGLang: Efficient Execution of Structured Language Model Programs

URLhttps://arxiv.org/abs/2312.07104. 3 Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 2 Zilin Zhu, Chengxing Xie, Xin Lv, and Contributors. slime: An llm post-t...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

2 Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2 Harbor Framework Team. Harbor: A framework for evaluating and optimizing agents and models in container ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URLhttps://github.com/PrimeIntellect-ai/prime-rl. GitHub repository. 2, 3, 15 13 Polar: Agentic RL on Any Harness at Scale Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

SGLang: Efficient Execution of Structured Language Model Programs

URLhttps://arxiv.org/abs/2312.07104. 3 Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023. 2 Zilin Zhu, Chengxing Xie, Xin Lv, and Contributors. slime: An llm post-t...

work page internal anchor Pith review Pith/arXiv arXiv 2023