pith. sign in

arxiv: 2606.18594 · v1 · pith:5PTBAZZGnew · submitted 2026-06-17 · 💻 cs.RO · cs.AI

Benchmarking Action Spaces in Reinforcement Learning for Vision-based Robotic Manipulation

Pith reviewed 2026-06-26 21:21 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords reinforcement learningaction spacesrobotic manipulationsim-to-real transfervision-based controlpickingpushingmotion smoothness
0
0 comments X

The pith

Joint velocity action spaces outperform pose increments, pose velocity, and joint position increments for vision-based robotic picking and pushing when policies transfer from simulation to real robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests four action space options in reinforcement learning policies for two standard manipulation tasks. All policies receive camera images as input and are first trained in simulation before direct deployment on physical hardware. Performance is measured by both final task success and the smoothness of the resulting motions. Joint velocity produces the highest success rates alongside the smoothest trajectories. The results supply direct guidance on which representation to select when moving from simulated training to real-world execution.

Core claim

Among the four representations tested, joint velocity yields the best combination of task completion rates and motion smoothness for both object picking and object pushing once policies move from simulation to the physical robot.

What carries the argument

Benchmark comparison of four action space formulations (pose increment, pose velocity, joint position increment, joint velocity) on vision-based picking and pushing with sim-to-real transfer.

If this is right

  • Joint velocity produces smoother real-robot trajectories than the other three spaces.
  • Policies using joint velocity reach higher final success rates after sim-to-real transfer.
  • Action space choice measurably changes the reliability of sim-to-real transfer for vision-based tasks.
  • Practical selection of joint velocity can reduce the need for extra safety filtering or post-processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If joint velocity remains superior on additional tasks with similar dynamics, it could become a default choice for many vision-based manipulation problems.
  • The ranking might shift on robots whose kinematic properties differ substantially from the one used here.
  • Extending the same benchmark to tasks requiring precise force control could expose limits of velocity-based representations.

Load-bearing premise

The two chosen tasks together with the specific sim-to-real protocol are representative of vision-based manipulation problems in general.

What would settle it

Repeating the identical training and transfer protocol on a third task such as stacking blocks or on a robot with different joint configuration and checking whether joint velocity still ranks first in success and smoothness.

Figures

Figures reproduced from arXiv: 2606.18594 by Abhishek Naik, A. Rupam Mahmood, Colin Bellinger, Homayoon Farrahi, Seyed Alireza Azimi.

Figure 1
Figure 1. Figure 1: MuJoCo state transition. action is the action produced by the policy. q ′ t and v ′ t are the target joint positions and velocities, respectively. qt and vt are the current joint positions and velocities. τt is the applied torque and at is the current acceleration. qt+1, vt+1 are the updated joint positions and velocities. Actuator. In our work, we consider two actuation models that are described as follow… view at source ↗
Figure 3
Figure 3. Figure 3: PandaPickCuboid simulation training results consisting of episodic return, success rate, and episodic length. 10 independent runs are shown with [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Real-world setup and wrist-mounted camera. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Simulation training metrics: floor collisions and jerk per step for picking and pushing tasks. 10 independent runs are displayed, and the median [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

In real-world reinforcement learning (RL), the choice of action space can play a key role in shaping motion smoothness, safety, and overall task performance. In this study, we evaluate pose increment, pose velocity, joint position increment, and joint velocity across two vision-based manipulation tasks: object picking and pushing. We train policies in simulation and deploy them to the real world using sim-to-real transfer. We find that action-space representation indeed significantly affects sim-to-real performance. In particular, we find that the joint velocity action space is best for the vision-based picking and pushing tasks in terms of smoothness and final task performance. We also provide practical guidance for RL practitioners in choosing action spaces for both simulation and real-world experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript empirically benchmarks four action spaces (pose increment, pose velocity, joint position increment, joint velocity) for vision-based RL policies on two manipulation tasks (object picking and pushing). Policies are trained in simulation and transferred to the real robot; the central claim is that joint velocity yields the best combination of motion smoothness and task performance, with additional practical guidance offered to practitioners.

Significance. If the reported ranking is supported by detailed quantitative results, this work would supply actionable, task-specific evidence on action-space choice for sim-to-real robotic RL, a topic of direct relevance to motion quality and deployment safety.

major comments (1)
  1. [Abstract] Abstract: the claim that joint velocity is best for smoothness and final task performance is stated without any quantitative metrics, statistical tests, number of trials, or description of how smoothness and success were measured. This prevents verification that the data support the ranking.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding the abstract. We address it directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that joint velocity is best for smoothness and final task performance is stated without any quantitative metrics, statistical tests, number of trials, or description of how smoothness and success were measured. This prevents verification that the data support the ranking.

    Authors: We agree that the abstract in its current form is too concise and does not include supporting quantitative details. The main body of the manuscript (Sections 4 and 5) already reports success rates, smoothness metrics (e.g., mean squared jerk), number of trials (typically 10 real-world rollouts per action space), and the exact measurement protocols for both simulation and real-robot evaluation. In the revised manuscript we will expand the abstract to include the key quantitative results and a brief statement of how smoothness and success were quantified, while preserving the word limit. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a purely empirical benchmarking study that trains RL policies in simulation for four action spaces, evaluates them on two tasks, and reports sim-to-real transfer results. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. The central claim is an observed ranking from direct experiments, with no load-bearing step that reduces to its own inputs by construction. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical benchmarking study; no mathematical derivations or new entities introduced. Relies on standard RL training assumptions and sim-to-real transfer validity.

axioms (1)
  • domain assumption Standard assumptions in RL training and sim-to-real transfer hold across the tested action spaces
    The study reports performance differences after transfer without detailing how transfer success was ensured or controlled for each action space.

pith-pipeline@v0.9.1-grok · 5666 in / 1126 out tokens · 20366 ms · 2026-06-26T21:21:10.079335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    Setting up a reinforcement learning task with a real-world robot,

    A. R. Mahmood, D. Korenkevych, B. J. Komer, and J. Bergstra, “Setting up a reinforcement learning task with a real-world robot,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018

  2. [2]

    Benchmarking reinforcement learning algorithms on real-world robots,

    A. R. Mahmood, D. Korenkevych, G. Vasan, W. Ma, and J. Bergstra, “Benchmarking reinforcement learning algorithms on real-world robots,” inConference on Robot Learning (CoRL), 2018, pp. 561– 591

  3. [3]

    Sim-to- real transfer of robotic control with dynamics randomization,

    X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to- real transfer of robotic control with dynamics randomization,” inIEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 1–8

  4. [4]

    On the role of the action space in robot manipulation learning and sim-to-real transfer,

    E. Aljalbout, F. Frank, M. Karl, and P. van der Smagt, “On the role of the action space in robot manipulation learning and sim-to-real transfer,”IEEE Robotics and Automation Letters, vol. 9, no. 6, pp. 5895–5902, 2024

  5. [5]

    Torque-based deep reinforcement learning for task- and robot-agnostic learning on bipedal robots using sim-to-real transfer,

    D. Kim, G. Berseth, M. Schwartz, and J. Park, “Torque-based deep reinforcement learning for task- and robot-agnostic learning on bipedal robots using sim-to-real transfer,”IEEE Robotics and Automation Letters, vol. 8, no. 10, pp. 6251–6258, 2023

  6. [6]

    Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks,

    R. Mart ´ın-Mart´ın, M. A. Lee, R. Gardner, S. Savarese, J. Bohg, and A. Garg, “Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 1010–1017

  7. [7]

    A comparison of action spaces for learning manipulation tasks,

    P. Varin, L. Grossman, and S. Kuindersma, “A comparison of action spaces for learning manipulation tasks,”arXiv preprint arXiv:1908.08659, 2019

  8. [8]

    Mujoco playground

    K. Zakka, B. Tabanpour, Q. Liao, M. Haiderbhai, S. Holt, J. Y . Luo, A. Allshire, E. Frey, K. Sreenath, L. A. Kahrs, C. Sferrazza, Y . Tassa, and P. Abbeel, “MuJoCo playground,”arXiv preprint arXiv:2502.08844, 2025

  9. [9]

    Open-source reinforcement learning environments implemented in MuJoCo with franka manipulator,

    Z. Xu, Y . Li, X. Yang, Z. Zhao, L. Zhuang, and J. Zhao, “Open-source reinforcement learning environments implemented in MuJoCo with franka manipulator,” inIEEE International Conference on Advanced Intelligent Mechatronics (AIM), 2024, pp. 709–714

  10. [10]

    Learn- ing hand-eye coordination for robotic grasping with deep learning and large-scale data collection,

    S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learn- ing hand-eye coordination for robotic grasping with deep learning and large-scale data collection,”International Journal of Robotics Research, vol. 37, no. 4-5, pp. 421–436, 2018

  11. [11]

    Making sense of vision and touch: Self- supervised learning of multimodal representations for contact-rich tasks,

    M. A. Lee, Y . Zhu, K. Srinivasan, P. Shah, S. Savarese, L. Fei-Fei, A. Garg, and J. Bohg, “Making sense of vision and touch: Self- supervised learning of multimodal representations for contact-rich tasks,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 8943–8950

  12. [12]

    Domain randomization for transferring deep neural networks from simulation to the real world,

    J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Domain randomization for transferring deep neural networks from simulation to the real world,” inIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 23–30

  13. [13]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  14. [14]

    Brax: A differentiable physics engine for large-scale rigid-body simulation,

    C. D. Freeman, E. Frey, A. Raichuk, S. Girgin, I. Mordatch, and O. Bachem, “Brax: A differentiable physics engine for large-scale rigid-body simulation,” 2021. [Online]. Available: https: //github.com/google/brax

  15. [15]

    Sim-to-real: Learning agile locomotion for quadruped robots,

    J. Tan, T. Zhang, E. Coumans, A. Iscen, Y . Bai, D. Hafner, S. Bo- hez, and V . Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” inRobotics: Science and Systems (RSS), 2018

  16. [16]

    Closing the sim-to-real loop: Adapting simulation randomization with real world experience,

    Y . Chebotar, A. Handa, V . Makoviychuk, M. Macklin, J. Issac, N. D. Ratliff, and D. Fox, “Closing the sim-to-real loop: Adapting simulation randomization with real world experience,” inIEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 8973– 8979

  17. [17]

    Solving Rubik's Cube with a Robot Hand

    OpenAI, I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas, J. Schneider, N. Tezak, J. Tworek, P. Welinder, L. Weng, Q. Yuan, W. Zaremba, and L. Zhang, “Solving rubik’s cube with a robot hand,” arXiv preprint arXiv:1910.07113, 2019

  18. [18]

    Learning locomotion skills using deeprl: Does the choice of action space matter?

    X. B. Peng and M. van de Panne, “Learning locomotion skills using deeprl: Does the choice of action space matter?” inACM SIGGRAPH / Eurographics Symposium on Computer Animation (SCA), 2017, pp. 12:1–12:13

  19. [19]

    Learning torque control for quadrupedal locomotion,

    S. Chen, B. Zhang, M. W. Mueller, A. Rai, and K. Sreenath, “Learning torque control for quadrupedal locomotion,” inIEEE-RAS Interna- tional Conference on Humanoid Robots (Humanoids), 2023, pp. 1–8

  20. [20]

    Investigating the impact of action representations in policy gradient algorithms,

    J. Schneider, P. Schumacher, D. F. B. H ¨aufle, B. Sch ¨olkopf, and D. B ¨uchler, “Investigating the impact of action representations in policy gradient algorithms,”arXiv preprint arXiv:2309.06921, 2023

  21. [21]

    dm control: Software and tasks for continuous control,

    S. Tunyasuvunakool, A. Muldal, Y . Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, N. Heess, and Y . Tassa, “dm control: Software and tasks for continuous control,”Software Impacts, vol. 6, p. 100022, 2020

  22. [22]

    MuJoCo: A physics engine for model-based control,

    E. Todorov, T. Erez, and Y . Tassa, “MuJoCo: A physics engine for model-based control,” inIEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033

  23. [23]

    Franka ros interface: A ros/python api for controlling and managing the franka emika panda robot (real and simulated)

    S. Sidhik, “Franka ros interface: A ros/python api for controlling and managing the franka emika panda robot (real and simulated).” 2020

  24. [24]

    DexPBT: Scaling up Dexterous Manipulation for Hand-Arm Systems with Population Based Training,

    A. Petrenko, A. Allshire, G. State, A. Handa, and V . Makoviychuk, “DexPBT: Scaling up Dexterous Manipulation for Hand-Arm Systems with Population Based Training,” inIEEE International Conference on Robotics and Automation (ICRA), 2023

  25. [25]

    Analytical inverse kinematics for franka emika panda: A geometrical solver for 7-dof manipulators with unconven- tional design,

    Y . He and S. Liu, “Analytical inverse kinematics for franka emika panda: A geometrical solver for 7-dof manipulators with unconven- tional design,” inInternational Conference on Control, Mechatronics and Automation (ICCMA), 2021, pp. 194–199

  26. [26]

    Understanding domain randomization for sim-to-real transfer,

    X. Chen, J. Hu, C. Jin, L. Li, and L. Wang, “Understanding domain randomization for sim-to-real transfer,” inTenth International Confer- ence on Learning Representations (ICLR), 2022

  27. [27]

    Performance Variation in Deep Reinforcement Learning

    H. Tanaka and A. R. Mahmood, “Performance variation in deep reinforcement learning,”arXiv preprint arXiv:2606.06746, 2026

  28. [28]

    General and efficient visual goal-conditioned reinforcement learning using object-agnostic masks,

    F. Shahriar, C. Wang, A. Azimi, G. Vasan, H. H. Elanwar, A. R. Mah- mood, and C. Bellinger, “General and efficient visual goal-conditioned reinforcement learning using object-agnostic masks,”arXiv preprint arXiv:2510.06277, 2025