Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

Abhishek Gupta; Anusha Nagabandi; Mustafa Mukadam; Raymond Yu; William Huey

arxiv: 2606.27475 · v1 · pith:7NH2UPSXnew · submitted 2026-06-25 · 💻 cs.RO · cs.LG

Support-Constrained RL Enables Real-World Policy Improvement without Real-World Experience

Raymond Yu , William Huey , Mustafa Mukadam , Anusha Nagabandi , Abhishek Gupta This is my paper

Pith reviewed 2026-06-29 01:54 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords support-constrained RLreal-to-sim-to-realdexterous manipulationflow steeringpolicy improvementmulti-fingered robotssimulation constraintsrobotic hands

0 comments

The pith

Support-constrained RL in simulation improves real-world robot policies without further real-world experience.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SCORE, a framework that performs reinforcement learning entirely in simulation to refine policies first trained on real robot data. It constrains the simulated actions to those the real-data policy can already generate, using flow steering to avoid unsafe behaviors from simulation mismatches. On eight dexterous multi-fingered manipulation tasks, this raises average success from 37.8 percent to 89.9 percent and shortens the steps needed for success. A sympathetic reader would care because it offers a low-cost path to better robot skills after the initial real-world data collection, without needing more hardware time or distillation.

Core claim

By constraining reinforcement learning in simulation to the support of a generative policy pretrained on real data, implemented through flow steering, the optimized policies transfer to hardware and deliver higher success rates plus faster task completion across eight real-world dexterous manipulation tasks, all without real-world RL or changes to the base policy.

What carries the argument

The support constraint via flow steering, which restricts actions during simulated RL to the distribution of the real-data generative policy.

If this is right

Policy improvement after initial real data collection can occur entirely in simulation.
The process works with sparse rewards and requires no distillation step.
The base policy stays unchanged while a separate improved policy is learned.
Simulation becomes usable for safe real-to-sim-to-real transfer on manipulation tasks when actions are limited to the real policy support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same support constraint might apply to other robot skills where simulation gaps cause problems, such as assembly or locomotion.
The method could lower the total real-world data needed across repeated policy refinements.
Combining the constraint with different base policy training approaches might show how broadly the gains hold.

Load-bearing premise

That actions kept inside the real policy support will avoid exploiting simulation inaccuracies enough to block transfer while still allowing useful policy gains.

What would settle it

Running unconstrained RL in simulation on the same tasks and finding that its policies achieve comparable or higher real-world success rates than the constrained SCORE versions would show the support constraint is not required.

Figures

Figures reproduced from arXiv: 2606.27475 by Abhishek Gupta, Anusha Nagabandi, Mustafa Mukadam, Raymond Yu, William Huey.

**Figure 1.** Figure 1: SCORE framework. SCORE starts from any real-world flow matching policy, which may have been trained on successes, play data, failures, and retry behaviors. The flow policy is brought into simulation, where SCORE learns to improve the policy using flow steering, a support-constrained RL algorithm. Finally, our training framework enables direct deployment of the steering policy in the real world, preserving … view at source ↗

**Figure 2.** Figure 2: Toy Example. The real-world base policy avoids barriers, but performs roundabout trajectories that sometimes miss the goal. In simulation, unconstrained RL exploits dynamics mismatch to move directly towards the goal, but this fails in the real world. As shown by the red arrows, distributional regularization allows for small deviations from the base policy, refining imprecisions but preserving slow motion … view at source ↗

**Figure 3.** Figure 3: Real-world tasks. We evaluate on eight contact-rich dexterous manipulation tasks spanning grasping, pouring, pushing, reorientation, and object placement. SCORE-DSRL and SCORE, respectively. DSRL performs pure latent steering: it optimizes only the flow noise z, so every action lies within the model-induced set Abase(o) above, imposing a hard model-induced support constraint. RFS additionally adds a small … view at source ↗

**Figure 4.** Figure 4: Average real-world success rate across all 8 tasks. SCORE and SCORE-DSRL outperform all baselines, while FPO and RialTo learn dangerous actions, and Residual-RL is constrained to suboptimal behaviors [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Speed improvement of SCORE and Residual-RL over Base, averaged across 8 tasks. Motivated by our discussion in Section 3.2, we now empirically investigate how support constraints overcome the limitations of unconstrained and distributionally constrained optimization. Does unconstrained optimization in simulation result in dangerous behavior? To test our hypothesis about unconstrained optimization, we optim… view at source ↗

**Figure 6.** Figure 6: Distributional Constraints Introduce a Tradeoff Between Improvement and Transferability. The left plot shows the simulated performance (circles) and real world performance (diamonds) of RialTo policies trained with 5 different levels of BC regularization during BC-PPO. A value of 10 leads to collapse in simulation, while a larger value of 100 learns a dangerous strategy far from the base policy distributio… view at source ↗

**Figure 7.** Figure 7: Data-size ablation. More data enables stronger support-constrained improvement. Can more demonstrations improve steering? We train base policies on the Cube Pinch task with varying numbers of demonstrations and apply SCORE on each [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 10.** Figure 10: Adaptation experiment. SCORE adapts the base policy to shifted settings only when its support contains compatible behavior. (Left) Steering the bottle-grasp prior toward carrot grasping improves real-world success from 22% to 67% by reusing compatible pinches already inside the prior, while the cup-grasp prior fails, as it lacks the behavior. (Right) With distractor cubes added, SCORE improves over the br… view at source ↗

**Figure 8.** Figure 8: Cube Pinch retry data. Retry data leaves the base policy unchanged, but lets SCORE improve from 40% to 100% success after simulation steering [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Play data ablation. With only right-side coverage, Base and SCORE fail on the left; adding left play data lets SCORE improve. Can SCORE adapt to unseen objects and distractors? Our previous experiments test the ability of SCORE to improve policies in fixed environments, but real world tasks are constantly changing. Below, we train SCORE in a simulation environment unseen by the base policy, then deploy th… view at source ↗

**Figure 11.** Figure 11: Asymmetric actor-critic ablation. Using an asymmetric critic improves sample efficiency and final simulated success while keeping the actor observation and deployment policy unchanged. Evaluation is performed over 4096 environments every 40M steps. when the post-training environment is significantly out of distribution. This suggests that pretraining should aim not just for the strongest base policy, but … view at source ↗

**Figure 12.** Figure 12: Percent improvement in time to completion of SCORE and Residual-RL over Base. SCORE improves substantially over the base policy and beats Residual-RL, the nearest baseline, on all tasks [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Real-world Robot Setup B Experiment Details B.1 Hardware and Control Setup [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Retry Data Collection. For the Cube Pinch and Bottle Grasp tasks, drops and misses followed by retries are captured in the dataset to encourage learning retry behavior. B.3 Data Collection We collect real-world demonstrations using an Apple Vision Pro teleoperation interface. The system tracks the operator’s hand motion and end-effector motion using keypoints, and retargets these motions to the Franka arm… view at source ↗

**Figure 15.** Figure 15: FPO failure modes. FPO can exploit simulator-specific dynamics and drift outside the real-world policy support, producing unsafe or non-transferable behaviors. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: RialTo failure modes. BC regularization can limit the amount of drift from the support of the real-world policy, but can also limit improvement and retain imprecise base-policy behaviors. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 17.** Figure 17: SCORE successful rollouts. SCORE improves task performance while maintaining real-world-feasible behaviors within the base policy support. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_17.png] view at source ↗

**Figure 18.** Figure 18: SCORE failure modes. SCORE failures are primarily caused by contact sensitivity and task precision demands, rather than unsafe support drift. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Multi-task base policy failure modes. Although multi-task training expands the behavior support of the base policy, direct deployment can suffer from interference between task-specific behaviors. The multi-task base policy sometimes applies behavior modes from the wrong task, such as Credit Card Pick-like high grasps during object grasping or Bottle Grasp-like motions during Credit Card Pick, leading to u… view at source ↗

**Figure 20.** Figure 20: Multi-task SCORE enables cross-task behavior reuse. After steering a shared multi-task prior, SCORE can select task-appropriate behaviors while also reusing useful strategies across tasks. The top two rows show successful Bottle Grasp and Credit Card Pick executions. The bottom rows show Cube Pinch under broader object placements, where multi-task SCORE can reuse Credit Card Pick-like sliding behavior and… view at source ↗

**Figure 21.** Figure 21: Visual representation of Proposition E.1. While πreal successfully completes the task, πsim exploits a transition that quickly leads to the goal in simulation, but causes the policy to get stuck when deployed in the real world. Proof Sketch We proceed by constructing a discrete MDP with 5 states, as shown in [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: Visual representation of Proposition E.2. π adds a residual of ϵ to πreal, but this is not sufficient to recover the optimal policy For a small enough ϵ and large enough δ, distributional constraints ensure realizability. In practice, however, too much regularization prevents meaningful improvement, while too little allows the policy to exploit the dynamics gap. In many settings, there is no level of r… view at source ↗

**Figure 23.** Figure 23: Visual representation of how SCORE addresses the limitations of distributional constraints shown in [PITH_FULL_IMAGE:figures/full_fig_p035_23.png] view at source ↗

read the original abstract

Robots trained on real world data tend to be imprecise, slow, and brittle to perturbations. Improving these policies with reinforcement learning (RL) is an appealing alternative, but this process often requires expensive training in the real world. Performing policy improvement in simulation instead provides a far cheaper alternative, but unconstrained RL in simulation can exploit contact and dynamics mismatches, resulting in unsafe behaviors that do not transfer to hardware. Common forms of regularization can furthermore limit improvement by overconstraining to an imperfect behavior prior. In this work, we propose Support-Constrained Off-Domain REinforcement (SCORE), a real-to-sim-to-real framework that constrains RL in simulation to the support of a generative policy pretrained on real data. We instantiate this constraint through flow steering, restricting SCORE to actions the base policy can already produce, which ensures transferable behaviors while maximizing policy improvement. Improving a policy with SCORE requires minimal effort: it learns from sparse rewards, avoids distillation, and leaves the base policy untouched. Across eight real-world dexterous multi-fingered robotic manipulation tasks, SCORE improves average success rate from 37.8% to 89.9%, compared to 59.5% for the best baseline, and reaches success in 36.8% fewer steps than the base policy. Ultimately, through extensive experiments and ablations, we show that simulation can substantially improve real-world manipulation policies when policy optimization is appropriately constrained, introducing a new paradigm for real-to-sim-to-real policy improvement. Videos and code are available at https://weirdlabuw.github.io/score/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SCORE uses flow steering to keep sim RL inside the support of a real-data policy and reports large transferable gains on eight dexterous tasks.

read the letter

The main takeaway is that SCORE constrains RL in simulation to the support of a generative policy trained on real data, using flow steering, and this produces big improvements that transfer back to hardware. They show average success rising from 37.8% to 89.9% across eight real multi-fingered manipulation tasks, beating the best baseline at 59.5% and cutting steps by 36.8%.

What stands out as new is the concrete instantiation of the support constraint via flow steering in a real-to-sim-to-real loop for dexterous work. The paper does well by keeping the method lightweight: sparse rewards only, no distillation, and the base policy left untouched. The empirical coverage on eight tasks gives the central claim some weight, and releasing code and videos helps.

The soft spots are mostly about missing detail rather than outright flaws. The abstract gives no error bars or ablation numbers, so the reliability of those deltas is not yet visible. The load-bearing assumption—that staying in support blocks sim exploitation while still allowing real improvement—needs clear evidence in the full methods and results; other constraint choices could be compared to test whether flow steering is essential. Nothing in the description suggests circularity or internal contradiction.

This paper is aimed at robotics researchers who already have some real data and want a low-effort way to refine policies in simulation without the usual transfer failures. A reader working on manipulation RL would get practical ideas if the experiments check out.

It deserves a serious referee because the reported gains are large, the problem is real, and the approach is simple enough to test. I would send it for peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces Support-Constrained Off-Domain REinforcement (SCORE), a real-to-sim-to-real framework that performs RL in simulation while constraining actions via flow steering to the support of a generative policy pretrained on real data. This is claimed to enable substantial policy improvement on real hardware without real-world RL experience, without distillation, and without altering the base policy. The central empirical result is that across eight real-world dexterous multi-fingered manipulation tasks, SCORE raises average success rate from 37.8% (base policy) to 89.9%, outperforming the best baseline at 59.5%, while also reaching success in 36.8% fewer steps; the work states that extensive experiments and ablations support the approach.

Significance. If the reported results and ablations hold, the work is significant for robotics because it offers a concrete, low-effort method to leverage simulation for real policy improvement while mitigating sim-reality exploitation. The provision of code and videos is a positive factor for reproducibility and verification of the claimed gains.

minor comments (2)

[Abstract] Abstract: the quantitative claims (e.g., 89.9% success, 36.8% fewer steps) would be strengthened by a brief indication of trial counts, error bars, or statistical testing even at the abstract level.
The manuscript states that the base policy is left untouched and only sparse rewards are used; a minor clarification on how the final deployed policy is obtained (e.g., whether it is the improved sim policy or a combination) would aid clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the work, recognition of its potential significance for robotics, and recommendation for minor revision. We are pleased that the reproducibility elements (code and videos) were noted favorably.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an empirical real-to-sim-to-real RL method (SCORE) that constrains simulation rollouts to the support of a real-data generative policy via flow steering. All load-bearing claims consist of reported success-rate deltas and step-count reductions measured on eight physical tasks; these rest on external experimental outcomes rather than any derivation, fitted parameter, or self-citation that reduces the result to its own inputs by construction. No equations, ansatzes, or uniqueness theorems appear in the provided text that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The core idea of support constraint and flow steering is presented as a modeling choice rather than a derived quantity.

pith-pipeline@v0.9.1-grok · 5828 in / 1102 out tokens · 28043 ms · 2026-06-29T01:54:37.753755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 39 canonical work pages · 14 internal anchors

[1]

P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning, 2026. URL https://arxiv.org/abs/2603.15789

work page arXiv 2026
[2]

Aljalbout, J

E. Aljalbout, J. Xing, A. Romero, I. Akinola, C. R. Garrett, E. Heiden, A. Gupta, T. Hermans, Y . Narang, D. Fox, D. Scaramuzza, and F. Ramos. The reality gap in robotics: Challenges, solutions, and best practices, 2025. URLhttps://arxiv.org/abs/2510.20808

work page arXiv 2025
[3]

Torne, A

M. Torne, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal. Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation, 2024. URL https://arxiv.org/abs/2403.03949

work page arXiv 2024
[4]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. URL https: //arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[5]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URL https://arxiv.org/ abs/2303.04137

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

G. Yan, J. Zhu, Y . Deng, S. Yang, R.-Z. Qiu, X. Cheng, M. Memmel, R. Krishna, A. Goyal, X. Wang, and D. Fox. Maniflow: A general robot manipulation policy via consistency flow training, 2025. URLhttps://arxiv.org/abs/2509.01819

work page arXiv 2025
[7]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization, 2024. URL https://arxiv.org/ abs/2409.00588

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

McAllister, S

D. McAllister, S. Ge, B. Yi, C. M. Kim, E. Weber, H. Choi, H. Feng, and A. Kanazawa. Flow matching policy gradients, 2025. URLhttps://arxiv.org/abs/2507.21053

work page arXiv 2025
[10]

S. Park, Q. Li, and S. Levine. Flow q-learning, 2025. URL https://arxiv.org/abs/2502. 02538

2025
[11]

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning, 2025. URL https://arxiv.org/abs/2506.15799

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

E. Su, T. Westenbroek, A. Nagabandi, and A. Gupta. Rfs: Reinforcement learning with residual flow steering for dexterous manipulation, 2026. URL https://arxiv.org/abs/ 2602.01789

work page arXiv 2026
[13]

M. M. Hong, J. Zhang, A. Nagabandi, and A. Gupta. Tmrl: Diffusion timestep-modulated pretraining enables exploration for efficient policy finetuning, 2026. URL https://arxiv. org/abs/2605.12236. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

B. Yi, H. Choi, H. G. Singh, X. Huang, T. E. Truong, C. Sferrazza, Y . Ma, R. Duan, P. Abbeel, G. Shi, K. Liu, and A. Kanazawa. Flow policy gradients for robot control, 2026. URL https://arxiv.org/abs/2602.02481

work page arXiv 2026
[15]

Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. Dexteritygen: Foundation controller for unprecedented dexterity, 2025. URL https://arxiv.org/abs/ 2502.04307

work page arXiv 2025
[16]

Memmel, A

M. Memmel, A. Wagenmaker, C. Zhu, P. Yin, D. Fox, and A. Gupta. Asid: Active exploration for system identification in robotic manipulation, 2024. URL https://arxiv.org/abs/ 2404.12308

work page arXiv 2024
[17]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots,
[18]

URLhttps://arxiv.org/abs/2107.04034

work page internal anchor Pith review Pith/arXiv arXiv
[19]

X. Liu, H. Wang, and L. Yi. Dexndm: Closing the reality gap for dexterous in-hand rotation via joint-wise neural dynamics model, 2025. URLhttps://arxiv.org/abs/2510.08556

work page arXiv 2025
[20]

Mandi, Y

Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. Dexmachina: Functional retargeting for bimanual dexterous manipulation, 2025. URL https://arxiv.org/abs/ 2505.24853

work page arXiv 2025
[21]

Z. Chen, S. Chen, E. Arlaud, I. Laptev, and C. Schmid. Vividex: Learning vision-based dexterous manipulation from human videos, 2025. URLhttps://arxiv.org/abs/2404.15709

work page arXiv 2025
[22]

Qin, Y .-H

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos, 2022. URL https://arxiv.org/abs/2108. 05877

2022
[23]

Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation, 2022. URL https://arxiv. org/abs/2211.09423

work page arXiv 2022
[24]

Kedia, T

K. Kedia, T. G. W. Lum, J. Bohg, and C. K. Liu. Simtoolreal: An object-centric policy for zero-shot dexterous tool manipulation, 2026. URLhttps://arxiv.org/abs/2602.16863

work page arXiv 2026
[25]

Z. Xu, R. Gong, M. V . Minniti, A. S. Gundogdu, E. Rosen, K. Sivakumar, R. Yan, Z. Wang, D. Deng, P. Stone, X. Zhang, and K. Schmeckpeper. Expertgen: Scalable sim-to-real expert policy learning from imperfect behavior priors, 2026. URL https://arxiv.org/abs/2603. 15956

2026
[26]

Eysenbach, S

B. Eysenbach, S. Asawa, S. Chaudhari, S. Levine, and R. Salakhutdinov. Off-dynamics reinforcement learning: Training for transfer with domain classifiers, 2021. URL https: //arxiv.org/abs/2006.13916

work page arXiv 2021
[27]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20, July 2021. ISSN 1557-7368. doi:10.1145/3450626.3459670. URL http://dx.doi.org/10. 1145/3450626.3459670

work page doi:10.1145/3450626.3459670 2021
[28]

P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real, 2025. URL https://arxiv.org/abs/ 2505.07096

work page arXiv 2025
[29]

H. Niu, S. Sharma, Y . Qiu, M. Li, G. Zhou, J. HU, and X. Zhan. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=zXE8iFOZKw. 12

2022
[30]

Y . Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning, 2019. URLhttps://arxiv.org/abs/1911.11361

work page internal anchor Pith review Pith/arXiv arXiv 2019
[31]

Kumar, A

A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning, 2020. URLhttps://arxiv.org/abs/2006.04779

work page arXiv 2020
[32]

Singh, A

A. Singh, A. Kumar, Q. Vuong, Y . Chebotar, and S. Levine. Offline rl with realistic datasets: Heteroskedasticity and support constraints, 2022. URL https://arxiv.org/abs/2211. 01052

2022
[33]

Y . Mao, H. Zhang, C. Chen, Y . Xu, and X. Ji. Supported trust region optimization for offline reinforcement learning, 2023. URLhttps://arxiv.org/abs/2311.08935

work page arXiv 2023
[34]

Zhang, O

S. Zhang, O. So, H. M. S. Ahmad, E. Y . Yu, M. Cleaveland, M. Black, and C. Fan. Reform: Reflected flows for on-support offline rl via noise manipulation, 2026. URL https://arxiv. org/abs/2602.05051

work page arXiv 2026
[35]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[36]

Skalse, N

J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022

work page arXiv 2022
[37]

J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. InRobotics: Science and Systems, 2018

2018
[38]

Z. Wu, W. Lian, V . V . Unhelkar, M. Tomizuka, and S. Schaal. Learning dense rewards for contact-rich manipulation tasks. In2021 IEEE International Conference on Robotics and Automation (ICRA), 2021

2021
[39]

W. Huey, H. Wang, A. Wu, Y . Artzi, and S. Choudhury. Imitation learning from a single temporally misaligned video, 2025. URLhttps://arxiv.org/abs/2502.05397

work page arXiv 2025
[40]

L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa. Imitation learning as f-divergence minimization, 2020. URLhttps://arxiv.org/abs/1905.12888

work page arXiv 2020
[41]

Generative Adversarial Imitation Learning

J. Ho and S. Ermon. Generative adversarial imitation learning, 2016. URL https://arxiv. org/abs/1606.03476

work page internal anchor Pith review Pith/arXiv arXiv 2016
[42]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

K. Shaw, A. Agarwal, and D. Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning, 2023. URLhttps://arxiv.org/abs/2309.06440

work page arXiv 2023
[45]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Y . Lin, A. S. Wang, G. Sutanto, A. Rai, and F. Meier. Polymetis. https:// facebookresearch.github.io/fairo/polymetis/, 2021

2021
[47]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017. URLhttps://arxiv.org/abs/1612.00593

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, W.-C. Ma, D. Shah, A. Gupta, and K. Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512.16881. 14 Appendix Table of Contents A Per-Task Performance 15 B Experiment Details 17 B.1 Hardwar...

work page arXiv 2025

[1] [1]

P. Yin, T. Westenbroek, Z. Zhang, J. Tran, I. Dagnino, E. Shilamkar, N. Mbiziwo-Tiapo, S. Bagaria, X. Liu, G. Mullins, A. Kolobov, and A. Gupta. Emergent dexterity via diverse resets and large-scale reinforcement learning, 2026. URL https://arxiv.org/abs/2603.15789

work page arXiv 2026

[2] [2]

Aljalbout, J

E. Aljalbout, J. Xing, A. Romero, I. Akinola, C. R. Garrett, E. Heiden, A. Gupta, T. Hermans, Y . Narang, D. Fox, D. Scaramuzza, and F. Ramos. The reality gap in robotics: Challenges, solutions, and best practices, 2025. URLhttps://arxiv.org/abs/2510.20808

work page arXiv 2025

[3] [3]

Torne, A

M. Torne, A. Simeonov, Z. Li, A. Chan, T. Chen, A. Gupta, and P. Agrawal. Reconciling reality through simulation: A real-to-sim-to-real approach for robust manipulation, 2024. URL https://arxiv.org/abs/2403.03949

work page arXiv 2024

[4] [4]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models, 2020. URL https: //arxiv.org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020

[5] [5]

C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2024. URL https://arxiv.org/ abs/2303.04137

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

G. Yan, J. Zhu, Y . Deng, S. Yang, R.-Z. Qiu, X. Cheng, M. Memmel, R. Krishna, A. Goyal, X. Wang, and D. Fox. Maniflow: A general robot manipulation policy via consistency flow training, 2025. URLhttps://arxiv.org/abs/2509.01819

work page arXiv 2025

[7] [7]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. V...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization, 2024. URL https://arxiv.org/ abs/2409.00588

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

McAllister, S

D. McAllister, S. Ge, B. Yi, C. M. Kim, E. Weber, H. Choi, H. Feng, and A. Kanazawa. Flow matching policy gradients, 2025. URLhttps://arxiv.org/abs/2507.21053

work page arXiv 2025

[10] [10]

S. Park, Q. Li, and S. Levine. Flow q-learning, 2025. URL https://arxiv.org/abs/2502. 02538

2025

[11] [11]

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine. Steering your diffusion policy with latent space reinforcement learning, 2025. URL https://arxiv.org/abs/2506.15799

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

E. Su, T. Westenbroek, A. Nagabandi, and A. Gupta. Rfs: Reinforcement learning with residual flow steering for dexterous manipulation, 2026. URL https://arxiv.org/abs/ 2602.01789

work page arXiv 2026

[13] [13]

M. M. Hong, J. Zhang, A. Nagabandi, and A. Gupta. Tmrl: Diffusion timestep-modulated pretraining enables exploration for efficient policy finetuning, 2026. URL https://arxiv. org/abs/2605.12236. 11

work page internal anchor Pith review Pith/arXiv arXiv 2026

[14] [14]

B. Yi, H. Choi, H. G. Singh, X. Huang, T. E. Truong, C. Sferrazza, Y . Ma, R. Duan, P. Abbeel, G. Shi, K. Liu, and A. Kanazawa. Flow policy gradients for robot control, 2026. URL https://arxiv.org/abs/2602.02481

work page arXiv 2026

[15] [15]

Z.-H. Yin, C. Wang, L. Pineda, F. Hogan, K. Bodduluri, A. Sharma, P. Lancaster, I. Prasad, M. Kalakrishnan, J. Malik, M. Lambeta, T. Wu, P. Abbeel, and M. Mukadam. Dexteritygen: Foundation controller for unprecedented dexterity, 2025. URL https://arxiv.org/abs/ 2502.04307

work page arXiv 2025

[16] [16]

Memmel, A

M. Memmel, A. Wagenmaker, C. Zhu, P. Yin, D. Fox, and A. Gupta. Asid: Active exploration for system identification in robotic manipulation, 2024. URL https://arxiv.org/abs/ 2404.12308

work page arXiv 2024

[17] [17]

Kumar, Z

A. Kumar, Z. Fu, D. Pathak, and J. Malik. Rma: Rapid motor adaptation for legged robots,

[18] [18]

URLhttps://arxiv.org/abs/2107.04034

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

X. Liu, H. Wang, and L. Yi. Dexndm: Closing the reality gap for dexterous in-hand rotation via joint-wise neural dynamics model, 2025. URLhttps://arxiv.org/abs/2510.08556

work page arXiv 2025

[20] [20]

Mandi, Y

Z. Mandi, Y . Hou, D. Fox, Y . Narang, A. Mandlekar, and S. Song. Dexmachina: Functional retargeting for bimanual dexterous manipulation, 2025. URL https://arxiv.org/abs/ 2505.24853

work page arXiv 2025

[21] [21]

Z. Chen, S. Chen, E. Arlaud, I. Laptev, and C. Schmid. Vividex: Learning vision-based dexterous manipulation from human videos, 2025. URLhttps://arxiv.org/abs/2404.15709

work page arXiv 2025

[22] [22]

Qin, Y .-H

Y . Qin, Y .-H. Wu, S. Liu, H. Jiang, R. Yang, Y . Fu, and X. Wang. Dexmv: Imitation learning for dexterous manipulation from human videos, 2022. URL https://arxiv.org/abs/2108. 05877

2022

[23] [23]

Y . Qin, B. Huang, Z.-H. Yin, H. Su, and X. Wang. Dexpoint: Generalizable point cloud reinforcement learning for sim-to-real dexterous manipulation, 2022. URL https://arxiv. org/abs/2211.09423

work page arXiv 2022

[24] [24]

Kedia, T

K. Kedia, T. G. W. Lum, J. Bohg, and C. K. Liu. Simtoolreal: An object-centric policy for zero-shot dexterous tool manipulation, 2026. URLhttps://arxiv.org/abs/2602.16863

work page arXiv 2026

[25] [25]

Z. Xu, R. Gong, M. V . Minniti, A. S. Gundogdu, E. Rosen, K. Sivakumar, R. Yan, Z. Wang, D. Deng, P. Stone, X. Zhang, and K. Schmeckpeper. Expertgen: Scalable sim-to-real expert policy learning from imperfect behavior priors, 2026. URL https://arxiv.org/abs/2603. 15956

2026

[26] [26]

Eysenbach, S

B. Eysenbach, S. Asawa, S. Chaudhari, S. Levine, and R. Salakhutdinov. Off-dynamics reinforcement learning: Training for transfer with domain classifiers, 2021. URL https: //arxiv.org/abs/2006.13916

work page arXiv 2021

[27] [27]

X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics, 40(4):1–20, July 2021. ISSN 1557-7368. doi:10.1145/3450626.3459670. URL http://dx.doi.org/10. 1145/3450626.3459670

work page doi:10.1145/3450626.3459670 2021

[28] [28]

P. Dan, K. Kedia, A. Chao, E. W. Duan, M. A. Pace, W.-C. Ma, and S. Choudhury. X-sim: Cross-embodiment learning via real-to-sim-to-real, 2025. URL https://arxiv.org/abs/ 2505.07096

work page arXiv 2025

[29] [29]

H. Niu, S. Sharma, Y . Qiu, M. Li, G. Zhou, J. HU, and X. Zhan. When to trust your simulator: Dynamics-aware hybrid offline-and-online reinforcement learning. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors,Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum?id=zXE8iFOZKw. 12

2022

[30] [30]

Y . Wu, G. Tucker, and O. Nachum. Behavior regularized offline reinforcement learning, 2019. URLhttps://arxiv.org/abs/1911.11361

work page internal anchor Pith review Pith/arXiv arXiv 2019

[31] [31]

Kumar, A

A. Kumar, A. Zhou, G. Tucker, and S. Levine. Conservative q-learning for offline reinforcement learning, 2020. URLhttps://arxiv.org/abs/2006.04779

work page arXiv 2020

[32] [32]

Singh, A

A. Singh, A. Kumar, Q. Vuong, Y . Chebotar, and S. Levine. Offline rl with realistic datasets: Heteroskedasticity and support constraints, 2022. URL https://arxiv.org/abs/2211. 01052

2022

[33] [33]

Y . Mao, H. Zhang, C. Chen, Y . Xu, and X. Ji. Supported trust region optimization for offline reinforcement learning, 2023. URLhttps://arxiv.org/abs/2311.08935

work page arXiv 2023

[34] [34]

Zhang, O

S. Zhang, O. So, H. M. S. Ahmad, E. Y . Yu, M. Cleaveland, M. Black, and C. Fan. Reform: Reflected flows for on-support offline rl via noise manipulation, 2026. URL https://arxiv. org/abs/2602.05051

work page arXiv 2026

[35] [35]

Concrete Problems in AI Safety

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané. Concrete problems in ai safety.arXiv preprint arXiv:1606.06565, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[36] [36]

Skalse, N

J. Skalse, N. H. R. Howe, D. Krasheninnikov, and D. Krueger. Defining and characterizing reward hacking.arXiv preprint arXiv:2209.13085, 2022

work page arXiv 2022

[37] [37]

J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bohez, and V . Vanhoucke. Sim-to-real: Learning agile locomotion for quadruped robots. InRobotics: Science and Systems, 2018

2018

[38] [38]

Z. Wu, W. Lian, V . V . Unhelkar, M. Tomizuka, and S. Schaal. Learning dense rewards for contact-rich manipulation tasks. In2021 IEEE International Conference on Robotics and Automation (ICRA), 2021

2021

[39] [39]

W. Huey, H. Wang, A. Wu, Y . Artzi, and S. Choudhury. Imitation learning from a single temporally misaligned video, 2025. URLhttps://arxiv.org/abs/2502.05397

work page arXiv 2025

[40] [40]

L. Ke, S. Choudhury, M. Barnes, W. Sun, G. Lee, and S. Srinivasa. Imitation learning as f-divergence minimization, 2020. URLhttps://arxiv.org/abs/1905.12888

work page arXiv 2020

[41] [41]

Generative Adversarial Imitation Learning

J. Ho and S. Ermon. Generative adversarial imitation learning, 2016. URL https://arxiv. org/abs/1606.03476

work page internal anchor Pith review Pith/arXiv arXiv 2016

[42] [42]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling, 2023. URLhttps://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[44] [44]

K. Shaw, A. Agarwal, and D. Pathak. Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning, 2023. URLhttps://arxiv.org/abs/2309.06440

work page arXiv 2023

[45] [45]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Y . Lin, A. S. Wang, G. Sutanto, A. Rai, and F. Meier. Polymetis. https:// facebookresearch.github.io/fairo/polymetis/, 2021

2021

[47] [47]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation, 2017. URLhttps://arxiv.org/abs/1612.00593

work page internal anchor Pith review Pith/arXiv arXiv 2017

[48] [48]

A. Jain, M. Zhang, K. Arora, W. Chen, M. Torne, M. Z. Irshad, S. Zakharov, Y . Wang, S. Levine, C. Finn, W.-C. Ma, D. Shah, A. Gupta, and K. Pertsch. Polaris: Scalable real-to-sim evaluations for generalist robot policies, 2025. URLhttps://arxiv.org/abs/2512.16881. 14 Appendix Table of Contents A Per-Task Performance 15 B Experiment Details 17 B.1 Hardwar...

work page arXiv 2025