Beyond Action Residuals: Real-World Robot Policy Steering via Bottleneck Latent Reinforcement Learning
Pith reviewed 2026-05-20 05:08 UTC · model grok-4.3
The pith
Perturbing a compact latent bottleneck steers pretrained robot policies more effectively than adding residuals to actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that a plug-and-play variational information bottleneck module extracts a compact, task-aligned latent representation from observation embeddings. During online finetuning, reinforcement learning applies residual perturbations only to this latent while the pretrained base policy and action generator stay frozen; the perturbed latent is decoded to condition actions. This interface improves adaptation without updating policy weights and yields smoother behaviors than direct action residuals.
What carries the argument
A plug-and-play variational information bottleneck module that extracts a compact task-relevant latent from observation embeddings, allowing RL to apply residual perturbations that condition the frozen action generator.
If this is right
- Across eight simulation manipulation tasks, ZPRL improves sample efficiency and final performance relative to strong post-training baselines.
- On four real-world tasks, ZPRL raises average success rate by 33.7 percent over imitation base policies.
- Exploration behavior remains smoother than that produced by direct action-residual counterparts.
- Adaptation occurs without any weight updates to the pretrained base policy.
Where Pith is reading between the lines
- The same latent-interface approach could be tested on pretrained policies that use architectures other than flow matching.
- Focusing perturbations on a compact task-relevant latent may reduce the data needed for online adaptation compared with full action-space methods.
- Smoother exploration from latent perturbations could lower the risk of unsafe motions during real-robot fine-tuning.
Load-bearing premise
The variational information bottleneck produces a latent representation that remains sufficiently informative and stable for reinforcement learning perturbations without any updates to the frozen base policy weights or action generator.
What would settle it
If online reinforcement learning with ZPRL on the four real-world tasks fails to raise success rates above the imitation baseline or produces less smooth exploration than an action-residual method, the central claim would be falsified.
Figures
read the original abstract
Pretrained imitation policies have become a strong foundation for robot manipulation, but they often require online improvement to overcome execution errors, limited dataset coverage, and deployment mismatch. A central question is therefore how reinforcement learning (RL) should adapt policies after offline pretraining. Existing lightweight methods commonly apply residual corrections directly in action space, but this often leads to noisy and poorly structured exploration. In this work, we propose Z-Perturbation Reinforcement Learning (ZPRL), an approach that steers pretrained policies through a compact bottleneck latent rather than through policy weights or output actions. During offline training, we augment the policy with a plug-and-play variational information bottleneck (VIB) module to extract a task-relevant latent interface from observation embeddings. During online finetuning, the base policy is frozen and RL learns only a residual perturbation on this latent, whose decoded representation conditions the frozen action generator. We instantiate ZPRL on flow-matching policies and evaluate it on eight simulation tasks and four real-world tasks. Across diverse manipulation settings, ZPRL improves both sample efficiency and final performance over strong post-training baselines. In the real world, ZPRL improves the average success rate on four tasks by 33.7% over imitation base policies while producing smoother exploration behaviors than an action residual counterpart. These results suggest that a compact, task-aligned bottleneck latent provides an effective interface for online RL adaptation. More videos can be found at https://manutdmoon.github.io/ZPRL/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Z-Perturbation Reinforcement Learning (ZPRL), which augments a pretrained imitation policy with a plug-and-play variational information bottleneck (VIB) module during offline training to extract a compact task-relevant latent z. During online RL finetuning the base policy and flow-matching action generator remain frozen while RL learns only a residual perturbation in latent space; the perturbed z is decoded to condition the generator. The method is evaluated on eight simulation tasks and four real-world manipulation tasks, claiming improved sample efficiency and final performance over post-training baselines, including a 33.7% average success-rate gain over imitation policies in the real world and smoother exploration than direct action-residual methods.
Significance. If the central empirical claims hold after addressing the validation gaps, ZPRL would demonstrate that a frozen, offline-trained bottleneck latent can serve as a stable and effective interface for structured online adaptation of pretrained robot policies. This would be a practical contribution for real-world deployment where full policy updates are undesirable, and the smoother exploration behavior relative to action residuals could reduce wear and improve safety in physical settings.
major comments (2)
- [Abstract] Abstract: the reported 33.7% average success-rate improvement on four real-world tasks provides no information on the number of evaluation trials per task, standard deviation across runs, or any statistical significance test. Without these details the quantitative central claim cannot be properly assessed.
- [Method] Method description of online finetuning and VIB module: no quantitative check (KL divergence, reconstruction error, or mutual information) is reported comparing the distribution of latents produced by the final RL policy against the original imitation training distribution. Because both the VIB encoder and the action decoder remain frozen, any RL-induced shift outside the original support could silently degrade decoding fidelity; the absence of such a diagnostic leaves open the possibility that observed gains arise only from limited exploration that stays inside the training support rather than from a robust latent interface.
minor comments (1)
- [Experiments] The description of baseline implementations (action residual counterpart and other post-training methods) would benefit from explicit hyperparameter matching details to ensure fair comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details and diagnostics as suggested.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported 33.7% average success-rate improvement on four real-world tasks provides no information on the number of evaluation trials per task, standard deviation across runs, or any statistical significance test. Without these details the quantitative central claim cannot be properly assessed.
Authors: We agree that these statistical details are necessary for proper assessment of the central claim. In the revised manuscript, we will update the abstract and add a table in the experiments section reporting the number of evaluation trials per task (20 trials across 5 random seeds), standard deviations, and results from paired t-tests showing statistical significance of the reported improvements. revision: yes
-
Referee: [Method] Method description of online finetuning and VIB module: no quantitative check (KL divergence, reconstruction error, or mutual information) is reported comparing the distribution of latents produced by the final RL policy against the original imitation training distribution. Because both the VIB encoder and the action decoder remain frozen, any RL-induced shift outside the original support could silently degrade decoding fidelity; the absence of such a diagnostic leaves open the possibility that observed gains arise only from limited exploration that stays inside the training support rather than from a robust latent interface.
Authors: This concern is valid and highlights a potential gap in validating the latent interface. While our empirical results show performance gains and smoother exploration, we will add in the revision quantitative diagnostics including KL divergence values between the final RL latent distribution and the original imitation distribution, as well as reconstruction error metrics on held-out samples. These will confirm that perturbations remain within the supported range and support the robustness of the approach. revision: yes
Circularity Check
Empirical method with no load-bearing derivations or self-referential predictions
full rationale
The paper presents ZPRL as an engineering pipeline: offline VIB training on imitation data to produce a latent interface, followed by online RL that perturbs only that latent while keeping the base policy and action generator frozen. No equations, uniqueness theorems, or first-principles results are claimed that reduce to fitted quantities or prior self-citations by construction. Performance improvements are reported via direct empirical comparison on simulation and real-world tasks rather than any derived identity. The approach is therefore self-contained against external benchmarks and exhibits no circularity patterns.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We adopt a compact bottleneck latent on top of the observation embedding... min −I(z;a) + β I(z;c)
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the bottleneck latent provides a more efficient control interface... dim(z) typically 16 or 32
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A careful examination of large behavior models for multitask dexterous manipulation,
T. L. Team, J. Barreiros, A. Beaulieuet al., “A careful examination of large behavior models for multitask dexterous manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2507.05331
-
[2]
π 0: A Vision-Language-Action Flow Model for General Robot Control,
K. Black, N. Brown, D. Driesset al., “π 0: A Vision-Language-Action Flow Model for General Robot Control,” inProc. Robot. Sci. Syst., LosAngeles, CA, USA, June 2025
work page 2025
-
[3]
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success,
M. J. Kim, C. Finn, and P. Liang, “Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success,” inProc. Robot. Sci. Syst., LosAngeles, CA, USA, June 2025
work page 2025
-
[4]
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,
T. Z. Zhao, V . Kumar, S. Levineet al., “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware,” inProc. Robot. Sci. Syst., Daegu, Republic of Korea, July 2023
work page 2023
-
[5]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inProc. Adv. Neural Inf. Process. Syst., H. Larochelle, M. Ranzato, R. Hadsellet al., Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6840–6851
work page 2020
-
[6]
Flow straight and fast: Learning to generate and transfer data with rectified flow,
X. Liu, C. Gong, and Q. Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” inProc. Int. Conf. Learn. Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=XVjTT1nw5z
work page 2023
-
[7]
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,
C. Chi, S. Feng, Y . Duet al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion,” inProc. Robot. Sci. Syst., Daegu, Republic of Korea, July 2023
work page 2023
-
[8]
S. Park, Q. Li, and S. Levine, “Flow q-learning,” inProc. Int. Conf. Mach. Learn., ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsuet al., Eds., vol. 267. PMLR, 13–19 Jul 2025, pp. 48 104–48 127. [Online]. Available: https://proceedings.mlr.press/v267/park25f.html
work page 2025
-
[9]
H 3dp: Triply- hierarchical diffusion policy for visuomotor learning,
Y . Lu, Y . Tian, Z. Yuanet al., “H 3dp: Triply- hierarchical diffusion policy for visuomotor learning,” inProc. Int. Conf. Learn. Representations, 2026. [Online]. Available: https://openreview.net/forum?id=Q1CP0iAmOb
work page 2026
-
[10]
Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning,
J. Luo, C. Xu, J. Wuet al., “Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning,”Sci. Robot., 2025
work page 2025
-
[11]
Sime: Enhancing policy self-improvement with modal-level exploration,
Y . Jin, J. Lv, W. Yuet al., “Sime: Enhancing policy self-improvement with modal-level exploration,” inProc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2025, pp. 9792–9799
work page 2025
-
[12]
Soe: Sample-efficient robot policy self- improvement via on-manifold exploration,
Y . Jin, J. Lv, H. Xueet al., “Soe: Sample-efficient robot policy self- improvement via on-manifold exploration,” 2025. [Online]. Available: https://arxiv.org/abs/2509.19292
-
[13]
Diffusion policy policy optimization,
A. Z. Ren, J. Lidard, L. L. Ankileet al., “Diffusion policy policy optimization,” inProc. Int. Conf. Learn. Representations, 2025. [Online]. Available: https://openreview.net/forum?id=mEpqHvbD2h
work page 2025
-
[14]
Reinflow: Fine-tuning flow matching policy with online reinforcement learning,
T. Zhang, C. Yu, S. Suet al., “Reinflow: Fine-tuning flow matching policy with online reinforcement learning,” inProc. Adv. Neural Inf. Process. Syst., 2025. [Online]. Available: https: //openreview.net/forum?id=ACagRwCCqu
work page 2025
-
[15]
Z. Yuan, T. Wei, L. Guet al., “Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation,” 2025. [Online]. Available: https://arxiv.org/abs/2508. 20085
work page 2025
-
[16]
Rl-100: Performant robotic manipulation with real-world reinforcement learning,
K. Lei, H. Li, D. Yuet al., “Rl-100: Performant robotic manipulation with real-world reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2510.14830
-
[17]
π RL: Online rl fine-tuning for flow-based vision-language-action models,
K. Chen, Z. Liu, T. Zhanget al., “π RL: Online rl fine-tuning for flow-based vision-language-action models,” 2026. [Online]. Available: https://arxiv.org/abs/2510.25889
-
[18]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
P. Intelligence, A. Amin, R. Anicetoet al., “π ∗ 0.6: a vla that learns from experience,” 2025. [Online]. Available: https://arxiv.org/abs/2511.14759
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Behavior Transform- ers: Cloning k modes with one stone,
N. M. Shafiullah, Z. Cui, A. A. Altanzayaet al., “Behavior Transform- ers: Cloning k modes with one stone,” inProc. Adv. Neural Inf. Process. Syst., S. Koyejo, S. Mohamed, A. Agarwalet al., Eds., vol. 35. Curran Associates, Inc., 2022, pp. 22 955–22 968
work page 2022
-
[20]
Policy decorator: Model- agnostic online refinement for large policy model,
X. Yuan, T. Mu, S. Taoet al., “Policy decorator: Model- agnostic online refinement for large policy model,” inProc. Int. Conf. Learn. Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=e5jGTEiJMT
work page 2025
-
[21]
From imitation to refinement - residual rl for precise assembly,
L. Ankile, A. Simeonov, I. Shenfeldet al., “From imitation to refinement - residual rl for precise assembly,” inProc. IEEE Int. Conf. Robot. Autom., 2025, pp. 01–08
work page 2025
-
[22]
Residual off-policy rl for finetuning behavior cloning policies,
L. Ankile, Z. Jiang, R. Duanet al., “Residual off-policy rl for finetuning behavior cloning policies,” 2025. [Online]. Available: https://arxiv.org/abs/2509.19301
-
[23]
Residual reinforcement learning for robot control,
T. Johannink, S. Bahl, A. Nairet al., “Residual reinforcement learning for robot control,” inProc. IEEE Int. Conf. Robot. Autom., 2019, pp. 6023–6029
work page 2019
-
[24]
Steering Your Diffusion Policy with Latent Space Reinforcement Learning,
A. Wagenmaker, Y . Zhang, M. Nakamotoet al., “Steering Your Diffusion Policy with Latent Space Reinforcement Learning,” inProc. Conf. Robot Learn.PMLR, 2025, pp. 258–282
work page 2025
-
[25]
Deep Variational Information Bottleneck,
A. A. Alemi, I. Fischer, J. V . Dillonet al., “Deep Variational Information Bottleneck,” inProc. Int. Conf. Learn. Representations,
-
[26]
Available: https://openreview.net/forum?id=HyxQzBceg
[Online]. Available: https://openreview.net/forum?id=HyxQzBceg
-
[27]
Dynamical movement primitives: Learning attractor models for motor behaviors,
A. J. Ijspeert, J. Nakanishi, H. Hoffmannet al., “Dynamical movement primitives: Learning attractor models for motor behaviors,”Neural Comput., vol. 25, no. 2, pp. 328–373, 02 2013. [Online]. Available: https://doi.org/10.1162/NECO a 00393
-
[28]
Probabilistic movement primitives,
A. Paraschos, C. Daniel, J. R. Peterset al., “Probabilistic movement primitives,” inProc. Adv. Neural Inf. Process. Syst., C. Burges, L. Bottou, M. Wellinget al., Eds., vol. 26. Curran Associates, Inc., 2013. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2013/file/e53a0a2978c28872a4505bdb51db06dc-Paper.pdf
work page 2013
-
[29]
Da-mmp: Learning coordinated and accurate throwing with dynamics-aware motion manifold primitives,
C. Chu and H. Xu, “Da-mmp: Learning coordinated and accurate throwing with dynamics-aware motion manifold primitives,” 2026. [Online]. Available: https://arxiv.org/abs/2509.23721
-
[30]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajalet al., “Rt-1: Robotics transformer for real-world control at scale,” 2022. [Online]. Available: https: //arxiv.org/abs/2212.06817
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations,
Y . Ze, G. Zhang, K. Zhanget al., “3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations,” inProc. Robot. Sci. Syst., Delft, Netherlands, July 2024
work page 2024
-
[32]
Planning with diffusion for flexible behavior synthesis,
M. Janner, Y . Du, J. Tenenbaumet al., “Planning with diffusion for flexible behavior synthesis,” inProc. Int. Conf. Mach. Learn.PMLR, 2022, pp. 9902–9915
work page 2022
-
[33]
Vitas: Visual tactile soft fusion contrastive learning for visuomotor learning,
Y . Tian, S. Cheng, T. Weiet al., “Vitas: Visual tactile soft fusion contrastive learning for visuomotor learning,” 2026. [Online]. Available: https://arxiv.org/abs/2602.11643
-
[34]
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
Y . Zhu, J. Wong, A. Mandlekaret al., “robosuite: A modular simulation framework and benchmark for robot learning,” 2020. [Online]. Available: https://arxiv.org/abs/2009.12293
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[35]
What matters in learning from offline human demonstrations for robot manipulation,
A. Mandlekar, D. Xu, J. Wonget al., “What matters in learning from offline human demonstrations for robot manipulation,” in Proc. Conf. Robot Learn., ser. Proceedings of Machine Learning Research, A. Faust, D. Hsu, and G. Neumann, Eds., vol. 164. PMLR, 08–11 Nov 2022, pp. 1678–1690. [Online]. Available: https://proceedings.mlr.press/v164/mandlekar22a.html
work page 2022
-
[36]
DROID: A Large-Scale In- The-Wild Robot Manipulation Dataset,
A. Khazatsky, K. Pertsch, S. Nairet al., “DROID: A Large-Scale In- The-Wild Robot Manipulation Dataset,” inProc. Robot. Sci. Syst., Delft, Netherlands, July 2024
work page 2024
-
[37]
Open x-embodiment: Robotic learning datasets and RT-x models,
Q. Vuong, S. Levine, H. R. Walkeet al., “Open x-embodiment: Robotic learning datasets and RT-x models,” inTowards Generalist Robots: Learning Paradigms for Scalable Skill Acquisition @ CoRL2023, 2023. [Online]. Available: https://openreview.net/forum?id=zraBtFgxT0
work page 2023
-
[38]
Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,
C. Li, R. Zhang, J. Wonget al., “Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation,” inProc. Conf. Robot Learn.PMLR, 2023, pp. 80–93
work page 2023
-
[39]
Dynaguide: Steering diffusion polices with active dynamic guidance,
M. Du and S. Song, “Dynaguide: Steering diffusion polices with active dynamic guidance,” inProc. Adv. Neural Inf. Process. Syst., 2025. [Online]. Available: https://openreview.net/forum?id=XOw7Yf8qN3
work page 2025
-
[40]
arXiv preprint arXiv:2512.02834 , year=
S. Yang, Y . Zhang, H. Heet al., “Steering vision-language-action models as anti-exploration: A test-time scaling approach,” 2025. [Online]. Available: https://arxiv.org/abs/2512.02834
-
[41]
R. S. Sutton, A. G. Bartoet al.,Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1
work page 1998
-
[42]
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
S. Levine, A. Kumar, G. Tuckeret al., “Offline reinforcement learning: Tutorial, review, and perspectives on open problems,” 2020. [Online]. Available: https://arxiv.org/abs/2005.01643
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[43]
K. Lei, Z. He, C. Luet al., “Uni-o4: Unifying online and offline deep reinforcement learning with multi-step on-policy optimization,” inProc. Int. Conf. Learn. Representations, 2024. [Online]. Available: https://openreview.net/forum?id=tbFBh3LMKi
work page 2024
-
[44]
H. Li, K. Lei, S. Zanget al., “Failure-Aware RL: Reliable Offline- to-Online Reinforcement Learning with Self-Recovery for Real-World Manipulation,”arXiv e-prints, p. arXiv:2601.07821, Jan. 2026
-
[45]
Jump-start reinforcement learning,
I. Uchendu, T. Xiao, Y . Luet al., “Jump-start reinforcement learning,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 34 556–34 583
work page 2023
-
[46]
Efficient online reinforcement learning fine-tuning need not retain offline data,
Z. Zhou, A. Peng, Q. Liet al., “Efficient online reinforcement learning fine-tuning need not retain offline data,” inProc. Int. Conf. Learn. Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=HN0CYZbAPw
work page 2025
-
[47]
Efficient online reinforcement learning with offline data,
P. J. Ball, L. Smith, I. Kostrikovet al., “Efficient online reinforcement learning with offline data,” inProc. Int. Conf. Mach. Learn.PMLR, 2023, pp. 1577–1594
work page 2023
-
[48]
T. Silver, K. Allen, J. Tenenbaumet al., “Residual policy learning,”
-
[49]
[Online]. Available: https://arxiv.org/abs/1812.06298
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Residual Learning From Demonstration: Adapting DMPs for Contact-Rich Manipulation,
T. Davchev, K. S. Luck, M. Burkeet al., “Residual Learning From Demonstration: Adapting DMPs for Contact-Rich Manipulation,”IEEE Robot. Autom. Lett., vol. 7, no. 2, pp. 4488–4495, 2022
work page 2022
-
[51]
From prior to pro: Efficient skill mastery via distribution contractive rl finetuning,
Z. Sun and S. Song, “From prior to pro: Efficient skill mastery via distribution contractive rl finetuning,” 2026. [Online]. Available: https://arxiv.org/abs/2603.10263
-
[52]
Reinforcement learning with action chunking,
Q. Li, Z. Zhou, and S. Levine, “Reinforcement learning with action chunking,” inProc. Adv. Neural Inf. Process. Syst., 2025. [Online]. Available: https://openreview.net/forum?id=XUks1Y96NR
work page 2025
-
[53]
Prior-guided diffusion planning for offline reinforcement learning,
D. Ki, J. Oh, S.-W. Shimet al., “Prior-guided diffusion planning for offline reinforcement learning,” inProc. Adv. Neural Inf. Process. Syst., 2025. [Online]. Available: https://openreview.net/forum? id=lC4WKmTScD
work page 2025
-
[54]
Denoising diffusion implicit models,
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inProc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https://openreview.net/forum?id=St1giarCHLP
work page 2021
-
[55]
Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative,
C. He, X. Liu, G. M. S. Campset al., “Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative,” inProc. Int. Conf. Learn. Representations, 2026. [Online]. Available: https://openreview.net/forum?id=PL0tJOfm7I
work page 2026
-
[56]
Soft actor-critic algorithms and applications,
T. Haarnoja, A. Zhou, K. Hartikainenet al., “Soft actor-critic algorithms and applications,” 2019. [Online]. Available: https://arxiv.org/abs/1812. 05905
work page 2019
-
[57]
Randomized ensembled double q-learning: Learning fast without a model,
X. Chen, C. Wang, Z. Zhouet al., “Randomized ensembled double q-learning: Learning fast without a model,” inProc. Int. Conf. Learn. Representations, 2021. [Online]. Available: https: //openreview.net/forum?id=AY8zfZm0tDd
work page 2021
-
[58]
Manipulators and manipulation in high dimensional spaces,
V . Kumar, “Manipulators and manipulation in high dimensional spaces,” Ph.D. dissertation, University of Washington, Seattle, 2016. [Online]. Available: https://digital.lib.washington.edu/researchworks/handle/1773/ 38104
work page 2016
-
[59]
Meta- world+: An improved, standardized, RL benchmark,
R. McLean, E. Chatzaroulas, L. McCutcheonet al., “Meta- world+: An improved, standardized, RL benchmark,” inProc. Adv. Neural Inf. Process. Syst., 2025. [Online]. Available: https: //openreview.net/forum?id=1de3azE606
work page 2025
-
[60]
Drm: Mastering visual reinforcement learning through dormant ratio minimization,
G. Xu, R. Zheng, Y . Lianget al., “Drm: Mastering visual reinforcement learning through dormant ratio minimization,” in Proc. Int. Conf. Learn. Representations, 2024. [Online]. Available: https://openreview.net/forum?id=MSe8YFbhUE
work page 2024
-
[61]
Accelerating reinforcement learning with learned skill priors,
K. Pertsch, Y . Lee, and J. Lim, “Accelerating reinforcement learning with learned skill priors,” inProc. Conf. Robot Learn., ser. Proceedings of Machine Learning Research, J. Kober, F. Ramos, and C. Tomlin, Eds., vol. 155. PMLR, 16–18 Nov 2021, pp. 188–204. [Online]. Available: https://proceedings.mlr.press/v155/pertsch21a.html
work page 2021
-
[62]
RL Token: Bootstrapping Online RL with Vision-Language-Action Models,
C. Xu, J. T. Springenberg, M. Equiet al., “RL Token: Bootstrapping Online RL with Vision-Language-Action Models,” 2026. [Online]. Available: https://www.pi.website/research/rlt
work page 2026
-
[63]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold approximation and projection for dimension reduction,” 2020. [Online]. Available: https://arxiv.org/abs/1802.03426
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[64]
A well-conditioned estimator for large- dimensional covariance matrices,
O. Ledoit and M. Wolf, “A well-conditioned estimator for large- dimensional covariance matrices,”J. Multivar. Anal., vol. 88, no. 2, pp. 365–411, 2004. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0047259X03000964
work page 2004
-
[65]
On the generalised distance in statistics,
P. C. Mahalanobis, “On the generalised distance in statistics,” inProc. Natl. Inst. Sci. India, vol. 12, 1936, pp. 49–55
work page 1936
-
[66]
Scikit-learn: Machine learning in Python,
F. Pedregosa, G. Varoquaux, A. Gramfortet al., “Scikit-learn: Machine learning in Python,”J. Mach. Learn. Res., vol. 12, pp. 2825–2830, 2011
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.