pith. sign in

arxiv: 2511.11218 · v3 · submitted 2025-11-14 · 💻 cs.RO

Humanoid Whole-Body Badminton via Multi-Stage Reinforcement Learning

Pith reviewed 2026-05-17 22:39 UTC · model grok-4.3

classification 💻 cs.RO
keywords humanoid robotsreinforcement learningwhole-body controlbadmintondynamic interactionscurriculum learningsim-to-real transfer
0
0 comments X

The pith

A multi-stage reinforcement learning pipeline produces a unified whole-body controller for humanoid badminton without motion priors or expert demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how reinforcement learning can train a single controller that makes a humanoid robot move its feet and swing its arms together to hit a shuttlecock. This matters because most prior humanoid work handled either walking or arm tasks separately, leaving fast dynamic interactions like sports out of reach. A three-stage curriculum first teaches footwork, then adds precise swinging, and finally refines the full hitting behavior so both body parts serve the same goal. The resulting system sustains rallies in simulation and returns shuttles at high speed on real hardware, including a version that works without explicit trajectory prediction.

Core claim

The authors develop a reinforcement-learning training pipeline that yields a unified whole-body controller for humanoid badminton, coordinating footwork and striking without motion priors or expert demonstrations. Training follows a three-stage curriculum (footwork acquisition, precision-guided swing generation, and task-focused refinement) so legs and arms jointly serve the hitting objective. For deployment, an Extended Kalman Filter estimates and predicts shuttlecock trajectories, while a prediction-free variant removes the EKF and explicit prediction. In simulation two robots sustain a rally of 21 consecutive hits; in real-world tests the robot reaches outgoing shuttle speeds up to 19.1 m

What carries the argument

the three-stage curriculum that progressively builds footwork, then precision swings, then task-focused refinement so locomotion and manipulation jointly optimize the hitting objective

If this is right

  • Legs and arms can be trained to serve a shared dynamic objective rather than being optimized in isolation.
  • Both prediction-using and prediction-free policies achieve comparable hitting performance on hardware.
  • The same pipeline supports both machine-fed shuttles and human-robot rallies with mean return distances around 4 m.
  • Whole-body coordination learned this way extends the range of feasible fast-moving object interactions for humanoids.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar curricula could be applied to other dynamic sports requiring simultaneous locomotion and striking, such as tennis or volleyball.
  • The success of the prediction-free variant suggests that many dynamic tasks may not require explicit state estimation once the policy is sufficiently trained.
  • Testing the same controller against faster or irregularly spinning shuttles would reveal the limits of the learned robustness.

Load-bearing premise

The three-stage curriculum enables the legs and arms to jointly optimize the hitting objective and the learned policy transfers from simulation to real hardware without additional motion priors or detailed domain randomization.

What would settle it

A real-world test in which the robot repeatedly fails to coordinate foot placement with arm swing timing, producing rallies shorter than a few exchanges or outgoing speeds below 10 m/s, would show the curriculum does not achieve the claimed joint optimization and transfer.

Figures

Figures reproduced from arXiv: 2511.11218 by Chenhao Liu, Jinchen Fu, Kairan Yao, Leyun Jiang, Xiaoyu Ren, Yibo Wang.

Figure 1
Figure 1. Figure 1: Real-world humanoid badminton. A fully autonomous humanoid returns machine-fed shuttles in a motion-capture arena; overlaid arcs show an incoming (blue) and returned (orange) trajectory. Project Page: humanoid-badminton.github.io. Abstract—Humanoid robots have demonstrated strong ca￾pabilities for interacting with static scenes across locomotion, manipulation, and more challenging loco-manipulation tasks. … view at source ↗
Figure 2
Figure 2. Figure 2: System overview. (a) Training: PPO learns a single policy πWBC using Privileged Critic Obs together with Actor Obs (no history) under a three-stage curriculum. All observations and rewards in (a) come from the simulation environment. (b) Environment: The humanoid is 1.28 m tall, weighs 30 kg, and has 21 DoF. A 3D-printed mount attaches the racket orthogonally to the forearm. The robot is initialized above … view at source ↗
Figure 3
Figure 3. Figure 3: Simulation results. Figure (a) illustrates the Two￾Robot Rally scenario, where two identical humanoid robots sustain a rally of 21 consecutive returns. Figure (b) demon￾strates the Prediction-Free policy: the robot infers the op￾timal impact position and orientation solely from the first five recorded shuttlecock positions after serving. Figure (c) presents the Target-Known policy, where a predetermined hi… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between target-known and prediction￾free policy. The top part of this figure shows the position er￾ror for both strategies. The middle section of the figure shows the orientation error comparison, the orientation corresponds to the normal direction of the racket face. The bottom part of the figure compares swing velocity. not model additional delays, and we do not fit actuator network model [31]… view at source ↗
Figure 6
Figure 6. Figure 6: Virtual-Target Swinging. The upper portion of the figure depicts the Euclidean distance error between the racket center and the designated hitting position at the moment of impact, while the lower portion illustrates the corresponding racket speed at impact. flat trajectories in the indoor mocap environment leave only a very short reaction window, we constrain the interception area to a reasonable range fo… view at source ↗
Figure 8
Figure 8. Figure 8: Trajectory generation. Shuttlecock trajectories are filtered to ensure interception points within the region x ∈ [−0.8, 0.8] m, y ∈ [−1, 0.2] m and z ∈ [1.5, 1.6] m for robot training [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Individual trajectory analysis. An example of sam￾pled shuttle flight trajectory (gold) with the selected intercep￾tion point (red). Corresponding target frame at the intercept is drawn, where z is the incoming-flight direction. The anno￾tation reports the intercept position, orientation and time-to￾intercept. illustrated in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training trajectory statistics. Distribution of the shuttlecock interception time. We generated 2 million trajectories, from which 196,940 met the criteria and were selected for robot training. The majority of these trajectories reached the hitting zone within a time interval of [0.8, 1.4] seconds, as shown in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: EKF prediction accuracy under varying aerody￾namic characteristic lengths. The parameter α scales the characteristic length to emulate variations in shuttle aerody￾namics [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Trajectory of the racket center. For a designated hitting position at (50, -250, 1540) mm, the robot executed 20 swinging motions. The green spheres represent the positions of the racket center as it passed through the z = 1540 mm plane during each swing [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Swing Error Analysis. Over 20 repeated swings, the mean Euclidean distance error was measured at 23.21 mm, with a standard deviation of 10.55 mm. The maximum and minimum errors recorded were 51.07 mm and 6.34 mm, respectively [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Real-world striking motion. Three snapshots of a successful return in the real world. Left: in-place stepping before the shuttle is launched. Middle: approach and back￾swing phase as the shuttle launched (yellow box highlights the shuttle, the green arrow indicates the racket swing direction, and the white arrow indicates the stepping motion). Right: hit and follow-through: the robot simultaneously takes … view at source ↗
read the original abstract

Humanoid robots have demonstrated strong capabilities for interacting with static scenes across locomotion and manipulation, yet dynamic real-world interactions remain challenging. As a step toward fast-moving object interactions, we present a reinforcement-learning training pipeline that yields a unified whole-body controller for humanoid badminton, coordinating footwork and striking without motion priors or expert demonstrations. Training follows a three-stage curriculum (footwork acquisition, precision-guided swing generation, and task-focused refinement) so legs and arms jointly serve the hitting objective. For deployment, we use an Extended Kalman Filter (EKF) to estimate and predict shuttlecock trajectories for target striking, and also develop a prediction-free variant that removes the EKF and explicit prediction. We validate the framework with five sets of experiments in simulation and on hardware. In simulation, two robots sustain a rally of 21 consecutive hits. In real-world tests with both machine-fed shuttles and human-robot rallies, the robot achieves outgoing shuttle speeds up to 19.1~m/s with a mean return landing distance of 4~m. Moreover, the prediction-free variant attains comparable performance to the EKF-based target-known policy. Overall, our approach enables dynamic yet precise goal striking in humanoid badminton and suggests a path toward more dynamics-critical whole-body interaction tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a reinforcement-learning training pipeline that yields a unified whole-body controller for humanoid badminton, coordinating footwork and striking without motion priors or expert demonstrations. Training follows a three-stage curriculum (footwork acquisition, precision-guided swing generation, and task-focused refinement) so legs and arms jointly serve the hitting objective. For deployment, an Extended Kalman Filter (EKF) estimates and predicts shuttlecock trajectories, with a prediction-free variant also developed. Validation includes simulation experiments with 21-hit rallies and real-world tests achieving outgoing shuttle speeds up to 19.1 m/s with a mean return landing distance of 4 m.

Significance. If the central claims hold, this work advances dynamic whole-body control for humanoids in fast-moving object interactions. The empirical results in simulation and hardware, including comparable performance of the prediction-free variant, provide supporting evidence for sim-to-real transfer in dynamic tasks and suggest broader applicability to other dynamics-critical interactions. The lack of reliance on motion priors is a notable strength.

major comments (2)
  1. [§3 (Curriculum Design)] §3 (Curriculum Design): The central claim that the three-stage curriculum produces a single unified policy in which legs and arms co-optimize for hitting is not supported by ablations. No comparison is reported against a single-stage baseline using the same total compute budget and reward shaping, so it remains unclear whether the 21-hit rallies and hardware metrics arise from joint optimization or from sequential specialization of independent skills.
  2. [Reward Formulation (Methods)] Reward Formulation (Methods): Stage-specific reward weights are referenced but the explicit reward functions, their mathematical forms, and how they enforce joint leg-arm optimization across stages are not detailed. This omission directly affects assessment of whether the final policy discovers coordinated whole-body strategies rather than composing separately trained behaviors.
minor comments (2)
  1. [Abstract] The abstract would benefit from reporting the number of independent training runs and variance for the 21-hit rally result to strengthen the robustness claim.
  2. [Figures] Figure captions and legends for hardware experiments could more clearly distinguish results from the EKF-based and prediction-free policies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address the two major comments point by point below and outline the revisions we will make to strengthen the presentation of the curriculum and reward design.

read point-by-point responses
  1. Referee: [§3 (Curriculum Design)] §3 (Curriculum Design): The central claim that the three-stage curriculum produces a single unified policy in which legs and arms co-optimize for hitting is not supported by ablations. No comparison is reported against a single-stage baseline using the same total compute budget and reward shaping, so it remains unclear whether the 21-hit rallies and hardware metrics arise from joint optimization or from sequential specialization of independent skills.

    Authors: We acknowledge that the manuscript does not include a direct ablation against a single-stage baseline trained with an identical total compute budget and reward shaping. The three-stage curriculum was designed to progressively build the necessary skills for whole-body coordination in a high-dimensional task where direct end-to-end training often fails to converge to effective policies. While the current results demonstrate successful joint leg-arm behavior in both simulation rallies and hardware, we agree that a matched-compute single-stage comparison would provide clearer evidence. In the revised manuscript we will add this baseline experiment, using the same total training steps and reward components, to quantify the contribution of the staged approach. revision: yes

  2. Referee: [Reward Formulation (Methods)] Reward Formulation (Methods): Stage-specific reward weights are referenced but the explicit reward functions, their mathematical forms, and how they enforce joint leg-arm optimization across stages are not detailed. This omission directly affects assessment of whether the final policy discovers coordinated whole-body strategies rather than composing separately trained behaviors.

    Authors: We appreciate this observation. The manuscript describes the high-level structure and purpose of the stage-specific rewards but omits the full mathematical definitions. In the revised version we will provide the explicit reward equations for each stage, including all terms (e.g., foot placement, swing timing, shuttle velocity, and posture stability) and their respective weights. These formulations are constructed so that the hitting objective is shared across the body, encouraging the policy to discover coordinated strategies rather than independent sub-skills. The added detail will allow readers to evaluate the joint-optimization mechanism directly. revision: yes

Circularity Check

0 steps flagged

Empirical RL training pipeline exhibits no circular derivation

full rationale

The paper describes a multi-stage reinforcement learning pipeline for training a humanoid badminton controller, with results obtained via simulation training (21-hit rallies) and real-world hardware validation (19.1 m/s strikes). No mathematical derivation chain, first-principles equations, or uniqueness theorems are presented that reduce to the inputs by construction. Claims rest on empirical outcomes from curriculum-based policy optimization and EKF-based or prediction-free deployment, without fitted parameters renamed as predictions or self-citations serving as load-bearing justification for the central results. The approach is self-contained against external benchmarks of training success and transfer.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions and sim-to-real transfer; free parameters consist of stage-specific reward weights and curriculum hyperparameters that are tuned to achieve the reported performance.

free parameters (1)
  • stage-specific reward weights
    Weights balancing footwork acquisition, swing precision, and task success are adjusted during each curriculum stage to coordinate legs and arms.
axioms (1)
  • domain assumption The simulated environment sufficiently matches real-world dynamics for policy transfer
    Implicit in the reported hardware validation after simulation training.

pith-pipeline@v0.9.0 · 5536 in / 1133 out tokens · 42415 ms · 2026-05-17T22:39:17.128288+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rhythm: Learning Interactive Whole-Body Control for Dual Humanoids

    cs.RO 2026-03 unverdicted novelty 7.0

    Rhythm transfers interactive whole-body behaviors from simulation to real dual Unitree G1 humanoids via interaction-aware retargeting and graph-reward RL.

  2. SigLoMa: Learning Open-World Quadrupedal Loco-Manipulation from Ego-Centric Vision

    cs.RO 2026-05 unverdicted novelty 6.0

    SigLoMa enables dynamic loco-manipulation on quadrupeds from ego-centric 5 Hz vision alone by using Sigma Points for scalable exteroception, an ego-centric Kalman Filter for high-rate state estimation, and an active s...

  3. HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model

    cs.RO 2026-02 unverdicted novelty 6.0

    HAIC enables robust humanoid interactions with underactuated objects by predicting their dynamics from proprioceptive history and using a world model for adaptive control.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 3 Pith papers · 3 internal anchors

  1. [1]

    Real-world hu- manoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

    Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Real-world hu- manoid locomotion with reinforcement learning.Science Robotics, 9(89):eadi9579, 2024

  2. [2]

    BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion

    Takara E Truong, Qiayuan Liao, Xiaoyu Huang, Guy Tevet, C Karen Liu, and Koushil Sreenath. Be- yondmimic: From motion tracking to versatile hu- manoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2(3), 2025

  3. [3]

    Yoon, Ryan Hoque, Lars Paulsen, Ge Yang, Jian Zhang, Sha Yi, Guanya Shi, and Xiaolong Wang

    Ri-Zhao Qiu, Shiqi Yang, Xuxin Cheng, Chaitanya Chawla, Jialong Li, Tairan He, Ge Yan, David J Yoon, Ryan Hoque, Lars Paulsen, et al. Humanoid policy˜ human policy.arXiv preprint arXiv:2503.13441, 2025

  4. [4]

    Learning human- to-humanoid real-time whole-body teleoperation

    Tairan He, Zhengyi Luo, Wenli Xiao, Chong Zhang, Kris Kitani, Changliu Liu, and Guanya Shi. Learning human- to-humanoid real-time whole-body teleoperation. In2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8944–8951. IEEE, 2024

  5. [5]

    Cheng, Y

    Xuxin Cheng, Yandong Ji, Junming Chen, Ruihan Yang, Ge Yang, and Xiaolong Wang. Expressive whole- body control for humanoid robots.arXiv preprint arXiv:2402.16796, 2024

  6. [6]

    Humanoid parkour learning.arXiv preprint arXiv:2406.10759, 2024

    Ziwen Zhuang, Shenzhe Yao, and Hang Zhao. Humanoid parkour learning.arXiv preprint arXiv:2406.10759, 2024

  7. [7]

    Hover: Versatile neural whole- body controller for humanoid robots

    Tairan He, Wenli Xiao, Toru Lin, Zhengyi Luo, Zhenjia Xu, Zhenyu Jiang, Jan Kautz, Changliu Liu, Guanya Shi, Xiaolong Wang, et al. Hover: Versatile neural whole- body controller for humanoid robots. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 9989–9996. IEEE, 2025

  8. [8]

    Achieving human level competitive robot table tennis

    David B DAmbrosio, Saminda Abeyruwan, Laura Graesser, Atil Iscen, Heni Ben Amor, Alex Bewley, Barney J Reed, Krista Reymann, Leila Takayama, Yuval Tassa, et al. Achieving human level competitive robot table tennis. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 74–82. IEEE, 2025

  9. [9]

    Robotic table tennis: A case study into a high speed learning system

    David B D’Ambrosio, Navdeep Jaitly, Vikas Sindhwani, Ken Oslund, Peng Xu, Nevena Lazic, Anish Shankar, Tianli Ding, Jonathan Abelian, Erwin Coumans, et al. Robotic table tennis: A case study into a high speed learning system. InRobotics: Science and Systems, 2023

  10. [10]

    Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

    Zhi Su, Bike Zhang, Nima Rahmanian, Yuman Gao, Qiayuan Liao, Caitlin Regan, Koushil Sreenath, and S Shankar Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

  11. [11]

    Towards versatile humanoid table tennis: Unified reinforcement learning with prediction augmentation.arXiv preprint arXiv:2509.21690, 2025

    Muqun Hu, Wenxi Chen, Wenjing Li, Falak Mandali, Zijian He, Renhong Zhang, Praveen Krisna, Katherine Christian, Leo Benaharon, Dizhi Ma, et al. Towards versatile humanoid table tennis: Unified reinforcement learning with prediction augmentation.arXiv preprint arXiv:2509.21690, 2025

  12. [12]

    Learning coordinated badminton skills for legged manipulators.Science Robotics, 10(102):eadu3922, 2025

    Yuntao Ma, Andrei Cramariuc, Farbod Farshidian, and Marco Hutter. Learning coordinated badminton skills for legged manipulators.Science Robotics, 10(102):eadu3922, 2025

  13. [13]

    Combining learning-based loco- motion policy with model-based manipulation for legged mobile manipulators.IEEE Robotics and Automation Letters, 7(2):2377–2384, 2022

    Yuntao Ma, Farbod Farshidian, Takahiro Miki, Joonho Lee, and Marco Hutter. Combining learning-based loco- motion policy with model-based manipulation for legged mobile manipulators.IEEE Robotics and Automation Letters, 7(2):2377–2384, 2022

  14. [14]

    Catch it! learning to catch in flight with mobile dexterous hands

    Yuanhang Zhang, Tianhai Liang, Zhenyang Chen, Yanjie Ze, and Huazhe Xu. Catch it! learning to catch in flight with mobile dexterous hands. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 14385–14391, 2025

  15. [15]

    Hold my beer: Learning gentle humanoid locomotion and end-effector stabilization control, 2025

    Yitang Li, Yuanhang Zhang, Wenli Xiao, Chaoyi Pan, Haoyang Weng, Guanqi He, Tairan He, and Guanya Shi. Learning gentle humanoid locomotion and end-effector stabilization control.arXiv preprint arXiv:2505.24198, 2025

  16. [16]

    Integrating learning-based manipulation and physics-based locomotion for whole-body badminton robot control.arXiv preprint arXiv:2504.17771, 2025

    Haochen Wang, Zhiwei Shi, Chengxi Zhu, Yafei Qiao, Cheng Zhang, Fan Yang, Pengjie Ren, Lan Lu, and Dong Xuan. Integrating learning-based manipulation and physics-based locomotion for whole-body badminton robot control.arXiv preprint arXiv:2504.17771, 2025

  17. [17]

    Deep whole-body control: Learning a unified policy for ma- nipulation and locomotion

    Zipeng Fu, Xuxin Cheng, and Deepak Pathak. Deep whole-body control: Learning a unified policy for ma- nipulation and locomotion. In Karen Liu, Dana Kulic, and Jeff Ichnowski, editors,Proceedings of The 6th Con- ference on Robot Learning, volume 205 ofProceedings of Machine Learning Research, pages 138–149. PMLR, 14–18 Dec 2023

  18. [18]

    Multi-critic learning for whole-body end-effector twist tracking

    Aravind Elanjimattathil Vijayan, Andrei Cramariuc, Mat- tia Risiglione, Christian Gehring, and Marco Hutter. Multi-critic learning for whole-body end-effector twist tracking. In Joseph Lim, Shuran Song, and Hae-Won Park, editors,Proceedings of The 9th Conference on Robot Learning, volume 305 ofProceedings of Machine Learning Research, pages 1470–1485. PML...

  19. [19]

    Arm-constrained curriculum learning for loco- manipulation of a wheel-legged robot

    Zifan Wang, Yufei Jia, Lu Shi, Haoyu Wang, Haizhou Zhao, Xueyang Li, Jinni Zhou, Jun Ma, and Guyue Zhou. Arm-constrained curriculum learning for loco- manipulation of a wheel-legged robot. In2024 IEEE/RSJ International Conference on Intelligent Robots and Sys- tems (IROS), pages 10770–10776. IEEE, 2024

  20. [20]

    Efficient learning of a uni- fied policy for whole-body manipulation and locomotion skills.arXiv preprint arXiv:2507.04229, 2025

    Dianyong Hou, Chengrui Zhu, Zhen Zhang, Zhibin Li, Chuang Guo, and Yong Liu. Efficient learning of a uni- fied policy for whole-body manipulation and locomotion skills.arXiv preprint arXiv:2507.04229, 2025

  21. [21]

    Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers,

    Huy Ha, Yihuai Gao, Zipeng Fu, Jie Tan, and Shuran Song. Umi on legs: Making manipulation policies mobile with manipulation-centric whole-body controllers.arXiv preprint arXiv:2407.10353, 2024

  22. [22]

    Learning accurate whole-body throwing with high- frequency residual policy and pullback tube acceleration

    Yuntao Ma, Yang Liu, Kaixian Qu, and Marco Hut- ter. Learning accurate whole-body throwing with high- frequency residual policy and pullback tube acceleration. arXiv preprint arXiv:2506.16986, 2025

  23. [23]

    Varsm: Versatile autonomous racquet sports machine

    Fan Yang, Zhiwei Shi, Sixian Ye, Jiazhong Qian, Wenjie Wang, and Dong Xuan. Varsm: Versatile autonomous racquet sports machine. In2022 ACM/IEEE 13th Interna- tional Conference on Cyber-Physical Systems (ICCPS), pages 203–214, 2022

  24. [24]

    Athletic mobile manipulator system for robotic wheelchair tennis.IEEE Robotics and Automation Let- ters, 8(4):2245–2252, 2023

    Zulfiqar Zaidi, Daniel Martin, Nathaniel Belles, Viach- eslav Zakharov, Arjun Krishna, Kin Man Lee, Peter Wagstaff, Sumedh Naik, Matthew Sklar, Sugju Choi, et al. Athletic mobile manipulator system for robotic wheelchair tennis.IEEE Robotics and Automation Let- ters, 8(4):2245–2252, 2023

  25. [25]

    Rein- forcement learning for versatile, dynamic, and robust bipedal locomotion control.The International Journal of Robotics Research, 44(5):840–888, 2025

    Zhongyu Li, Xue Bin Peng, Pieter Abbeel, Sergey Levine, Glen Berseth, and Koushil Sreenath. Rein- forcement learning for versatile, dynamic, and robust bipedal locomotion control.The International Journal of Robotics Research, 44(5):840–888, 2025

  26. [26]

    Asymmetric actor critic for image-based robot learning

    Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wo- jciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning. InProceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018

  27. [27]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  28. [28]

    Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning

    Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  29. [29]

    Learning to walk in minutes using massively parallel deep reinforcement learning

    Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

  30. [30]

    The physics of badminton.New Journal of Physics, 17(6):063001, 2015

    Caroline Cohen, Baptiste Darbois Texier, David Qu ´er´e, and Christophe Clanet. The physics of badminton.New Journal of Physics, 17(6):063001, 2015

  31. [31]

    Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

    Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter. Learning agile and dynamic motor skills for legged robots.Science Robotics, 4(26):eaau5872, 2019

  32. [32]

    Towards bridging the gap: Systematic sim-to- real transfer for diverse legged robots.arXiv preprint arXiv:2509.06342, 2025

    Filip Bjelonic, Fabian Tischhauser, and Marco Hut- ter. Towards bridging the gap: Systematic sim-to- real transfer for diverse legged robots.arXiv preprint arXiv:2509.06342, 2025. APPENDIX A. System Parameters Per-joint PD gains(Table I) list the proportional and derivative gains used by the 500 Hz joint-space PD controller for each controllable DoF in bo...