pith. sign in

arxiv: 2606.26215 · v1 · pith:ROG53WRSnew · submitted 2026-06-24 · 💻 cs.RO · cs.CV

TaskNPoint: How to Teach Your Humanoid to Hit a Backhand in Minutes

Pith reviewed 2026-06-26 01:55 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords TaskNPointhumanoid robot learningdynamic skillsimitation learninginteraction windowsimulation trainingzero-shot generalizationtennis skills
0
0 comments X

The pith

Dynamic humanoid skills reduce to mastering a few actions until a short human-identified interaction window is hit correctly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the outcome of dynamic skills is decided in a brief crucial portion of the trajectory, such as the 20 cm of racket travel at ball contact. This structural property means learning reduces to a human coach naming a handful of skills, giving one short video demo per skill, pointing out the interaction window, and stating the goal, while simulation training fills in the full trajectory and adds robustness. Randomized target sampling during training lets the single demo generalize to new goal locations without extra data. The result is successful learning of forehands, backhands, kicks, and pick-and-place on a Unitree G1 from short human videos after under an hour of training on one GPU and with no per-task reward tuning.

Core claim

The outcome of dynamic skills is decided by a short, crucial portion of the trajectory. Learning thus reduces to mastering a handful of distinct actions and, for each, practicing until the interaction window comes out right. TaskNPoint makes the coach-learner division explicit: the human contributes a discrete set of skills, one demonstration per skill, identification of the interaction window, and the goal; learning in a physically realistic simulation environment fills in each action trajectory and provides robustness to unmodeled events, with randomized target sampling enabling zero-shot generalization to unseen goal locations.

What carries the argument

TaskNPoint training protocol, which uses the human coach's identification of the short interaction window that decides the outcome to focus simulation-based practice on coordinating the full motion so control, physics, and morphology align at that window.

If this is right

  • A single human video demonstration per skill suffices when the interaction window is supplied.
  • Randomized target sampling during training produces zero-shot generalization to unseen goal locations.
  • No per-task reward tuning is required for successful policy learning across multiple skills.
  • Policies learned this way transfer from simulation to the physical Unitree G1 for real dynamic tasks.
  • The same protocol applies to other skills whose outcomes hinge on a brief interaction segment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If interaction window identification can be automated from video, the method could scale to larger skill sets with less human time.
  • The coach-learner split may serve as a template for efficient learning of other dynamic tasks on different robot platforms.
  • Combining the protocol with stronger domain randomization could further improve robustness when simulation-reality gaps are larger.
  • Success on contact-rich tasks suggests the window-focused approach could help with locomotion or other whole-body dynamic behaviors.

Load-bearing premise

The human coach can correctly identify the short interaction window that decides the outcome for each skill, and the simulation is realistic enough for the learned policies to transfer to the physical robot.

What would settle it

A test on the physical Unitree G1 where the robot fails to hit balls or place boxes at novel locations after the described short simulation training from one human demo per skill and the identified windows.

Figures

Figures reproduced from arXiv: 2606.26215 by Aaron D. Ames, Blake Werner, Ilona Demler, Pietro Perona.

Figure 1
Figure 1. Figure 1: TaskNPoint is a training protocol that teaches humanoid robots dynamic skills from a single human video demonstration per motion. We have a human “coach” specify the discrete skills and critical interaction window, from which we generate goals G. Reinforcement learning in simulation with randomized targets lets a single demonstration generalize zero-shot to new target locations, trainable in under an hour … view at source ↗
Figure 2
Figure 2. Figure 2: TaskNPoint Overview. From a small collection of videos of human demonstrations (one video per task), we provide a pipeline for learning a repertoire of goal-conditioned motions interacting with dynamic environments. We first reconstruct the human motions using SMPLX parameters (section 4), which we then kinematically retarget to the humanoid (section 4). A higher level planner conditions action selection o… view at source ↗
Figure 3
Figure 3. Figure 3: Video Demonstrations. We collect single-view (left) and multi-view (right) demonstra￾tions and reconstruct human poses using state-of-the-art reconstruction methods (section 4). For the multi-view demonstrations, we fuse per-view estimates into a maximum-likelihood pose estimate. racket will meet the ball (both are traveling at high speed, thus such uncertainty is unavoidable), we generate goals Gtrain = {… view at source ↗
Figure 4
Figure 4. Figure 4: Policy Architecture. (Top): We reconstruct reference single-view or multi-view human demonstrations via SMPL-X parameters (section 4). For multi-view video, we lift reconstructions into a shared coordinate space and calculate a maximum likelihood pose. (Middle): In training we use Asymmetric Actor Critic Policy Optimization [49] to optimize motions Ai from our motion library A, from which we sample randomi… view at source ↗
Figure 5
Figure 5. Figure 5: Training Rewards. TaskNPoint learns a policy given a nominal target point and motion reference from a human demonstration (section 4). Motions and contact points are randomly sam￾pled throughout training (section 5); position, velocity, and orientation rewards are assigned during the duration of contact. Segments in each motion tubule are color-coded by reward value. gravity vector, and previous action, wh… view at source ↗
Figure 6
Figure 6. Figure 6: TaskNPoint Space Coverage. Our motion abstraction formulation allows us to cover a wide task space. Each colored sphere corresponds to a distinct reference human motion demonstra￾tion, and is centered around the point of contact. During training (section 5) we randomly sample points around each point of contact to provide the learning algorithm with a diverse set of possible ball trajectories. The radius o… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Results. We train TaskNPoint on a small set of reference demonstrations of tennis shots, soccer kicks, and box pick-and-place. By randomizing incoming target trajectories and positions, we are able to generalize to unseen target locations in deployment. We show successful execution of tennis shots (1st and 2nd row), soccer kicks (3rd row), and box pick-and-place (4th row). The columns show samp… view at source ↗
Figure 8
Figure 8. Figure 8: MLE poses. We test how well TaskNPoint scales with the number of demonstrations by training on MLE-estimated human poses from in-the-wild multi-view data of tennis practices and matches (section 4). Here we visualize the nominal point sampling volume during training. Each colored sphere is centered around the nominal point and corresponds to a distinct human demonstration, and the radius of each sphere cor… view at source ↗
Figure 9
Figure 9. Figure 9: TaskNPoint Phase Histogram. We get density of target hit time even with very sparse rewards. We define the goal window as the phase range that we emperically determine is sufficient to get a solid hit. D Hardware Results We evaluate on three hardware tasks. Tennis success is a racket–ball contact; soccer success is a foot–ball contact; box pick-and-place success requires lifting a 0.3m x 0.3m x 0.5m box an… view at source ↗
read the original abstract

How do we learn to hit a tennis backhand? Not from a thousand hours of tennis tournaments on TV - we work with a coach and practice. We argue this is also the right recipe for teaching dynamic skills to humanoid robots. This follows from a structural property of dynamic skills: the outcome is decided by a short, crucial portion of the trajectory - for a backhand, the ~20cm of racket travel around ball contact. Getting this interaction window right requires coordinating the whole motion, so that control, physics, and morphology act in concert. Learning thus reduces to mastering a handful of distinct actions and, for each, practicing until the window comes out right. To this end, we introduce TaskNPoint, a training protocol which makes the coach-learner division of labor explicit. The human coach contributes four inputs: a discrete set of skills (e.g. different shots), one demonstration per skill, identification of the interaction window, and the goal. Learning in a physically realistic simulation environment fills in each action trajectory and provides robustness to unmodeled events. Crucially, randomized target sampling during training lets a single demonstration generalize zero-shot to unseen goal locations. We test this approach on a Unitree G1 humanoid that hits forehands and backhands against balls thrown by a human, kicks incoming soccer balls, and picks and places boxes from novel locations. We find that learning is successful from short human video demonstrations and under an hour of training on a single GPU, with no per-task reward tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that dynamic skills for humanoid robots are determined by short interaction windows (e.g., ~20 cm of racket travel at ball contact), so learning reduces to a human coach supplying a discrete skill set, one video demo per skill, window identification, and goal; simulation training with randomized target sampling then fills in trajectories and enables zero-shot generalization to novel locations. The TaskNPoint protocol is tested on a Unitree G1 for forehand/backhand tennis shots, soccer kicks, and box pick-and-place, reporting success from short human demos and <1 hour of single-GPU training with no per-task reward tuning.

Significance. If the real-robot results hold with adequate quantification, the work would be significant for lowering the barrier to dynamic skill acquisition on humanoids by replacing reward engineering with explicit human-provided structure and randomized sampling. The zero-shot generalization via target randomization is a concrete, falsifiable strength that directly supports the central claim.

major comments (2)
  1. [Abstract] Abstract: the claim of 'successful' real-robot tests on the Unitree G1 is load-bearing for the central claim yet supplies no quantitative success rates, trial counts, baselines, error bars, or failure-mode statistics; without these the transfer from simulation to physical robot cannot be assessed.
  2. [Method] Method (TaskNPoint protocol): the human coach's identification of the interaction window is presented as a key input, but no sensitivity analysis, ablation, or robustness test is reported on how mis-specified windows affect policy learning or sim-to-real transfer; this assumption is load-bearing for the 'handful of actions + practice' reduction.
minor comments (1)
  1. [Abstract] The abstract states 'under an hour of training on a single GPU' but does not specify the exact GPU model, batch size, or whether this includes data collection time; add these details for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'successful' real-robot tests on the Unitree G1 is load-bearing for the central claim yet supplies no quantitative success rates, trial counts, baselines, error bars, or failure-mode statistics; without these the transfer from simulation to physical robot cannot be assessed.

    Authors: We agree that the abstract must include quantitative metrics to allow assessment of sim-to-real transfer. We will revise the abstract to report success rates, trial counts, and failure statistics for the Unitree G1 experiments on tennis shots, soccer kicks, and box pick-and-place. revision: yes

  2. Referee: [Method] Method (TaskNPoint protocol): the human coach's identification of the interaction window is presented as a key input, but no sensitivity analysis, ablation, or robustness test is reported on how mis-specified windows affect policy learning or sim-to-real transfer; this assumption is load-bearing for the 'handful of actions + practice' reduction.

    Authors: The interaction window is a load-bearing human input. We acknowledge the absence of sensitivity analysis in the submitted manuscript. We will add a dedicated discussion of robustness to window mis-specification together with an ablation study on window placement in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method defined by explicit external inputs

full rationale

The paper introduces TaskNPoint as a protocol whose inputs are four explicit human contributions (discrete skills, one demo per skill, interaction window identification, goal) plus simulation practice; the central claim that outcomes are decided by short interaction windows is presented as an observed structural property of the skills rather than derived from any fitted quantity or self-referential equation. No equations appear, no parameters are fitted to subsets and then relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. The approach is therefore self-contained against external benchmarks (human video demos, GPU training, physical transfer) without any step reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract was available; the central structural assumption is stated directly but no free parameters, additional axioms, or invented entities are enumerated.

axioms (1)
  • domain assumption The outcome of dynamic skills is decided by a short, crucial portion of the trajectory (the interaction window).
    Invoked in the first paragraph of the abstract as the structural property motivating the approach.

pith-pipeline@v0.9.1-grok · 5813 in / 1369 out tokens · 31206 ms · 2026-06-26T01:55:35.989330+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 2 canonical work pages

  1. [1]

    Makoviychuk, L

    V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning.arXiv preprint arXiv:2108.10470, 2021

  2. [2]

    Allshire, H

    A. Allshire, H. Choi, J. Zhang, D. McAllister, A. Zhang, C. M. Kim, T. Darrell, P. Abbeel, J. Malik, and A. Kanazawa. Visual imitation enables contextual humanoid control.arXiv preprint arXiv:2505.03729, 2025

  3. [3]

    H. Weng, Y . Li, N. Sobanbabu, Z. Wang, Z. Luo, T. He, D. Ramanan, and G. Shi. Hdmi: Learning interactive humanoid whole-body control from human videos.arXiv preprint arXiv:2509.16757, 2025

  4. [4]

    L. Yang, X. Huang, Z. Wu, A. Kanazawa, P. Abbeel, C. Sferrazza, C. K. Liu, R. Duan, and G. Shi. Omniretarget: Interaction-preserving data generation for humanoid whole-body loco- manipulation and scene interaction.arXiv preprint arXiv:2509.26633, 2025

  5. [5]

    Zhang, H

    Z. Zhang, H. Lu, Y . Lian, Z. Chen, Y . Liu, C. Lin, H. Xue, Z. Zeng, Z. Qi, S. Zheng, et al. Learning athletic humanoid tennis skills from imperfect human motion data.arXiv preprint arXiv:2603.12686, 2026

  6. [6]

    Z. Su, B. Zhang, N. Rahmanian, Y . Gao, Q. Liao, C. Regan, K. Sreenath, and S. S. Sastry. Hitter: A humanoid table tennis robot via hierarchical planning and learning.arXiv preprint arXiv:2508.21043, 2025

  7. [7]

    Y . Wang, Q. Zhao, Y . F. Lau, R. Yu, H. W. Tsui, Q. Chen, J. Wang, J. Pang, and P. Tan. Humanx: Toward agile and generalizable humanoid interaction skills from human videos.arXiv preprint arXiv:2602.02473, 2026

  8. [8]

    X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control.ACM Transactions on Graphics (ToG), 40(4): 1–20, 2021

  9. [9]

    X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne. Deepmimic: Example-guided deep re- inforcement learning of physics-based character skills.ACM Transactions On Graphics (TOG), 37(4):1–14, 2018

  10. [10]

    Q. Liao, T. E. Truong, X. Huang, Y . Gao, G. Tevet, K. Sreenath, and C. K. Liu. Beyondmimic: From motion tracking to versatile humanoid control via guided diffusion.arXiv preprint arXiv:2508.08241, 2025

  11. [11]

    Hwangbo, J

    J. Hwangbo, J. Lee, and M. Hutter. Per-contact iteration method for solving contact dynamics. IEEE Robotics and Automation Letters, 3(2):895–902, 2018

  12. [12]

    Rudin, D

    N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. InConference on robot learning, pages 91–100. PMLR, 2022

  13. [13]

    Haarnoja, B

    T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Humplik, M. Wulfmeier, S. Tun- yasuvunakool, N. Y . Siegel, R. Hafner, et al. Learning agile soccer skills for a bipedal robot with deep reinforcement learning.Science Robotics, 9(89):eadi8022, 2024. 13

  14. [14]

    Y . Ma, A. Cramariuc, F. Farshidian, and M. Hutter. Learning coordinated badminton skills for legged manipulators.Science robotics, 10(102):eadu3922, 2025

  15. [15]

    H. Wang, Z. Wang, J. Ren, Q. Ben, T. Huang, W. Zhang, and J. Pang. Beamdojo: Learning agile humanoid locomotion on sparse footholds.arXiv preprint arXiv:2502.10363, 2025

  16. [16]

    Z. Wu, X. Huang, L. Yang, Y . Zhang, K. Sreenath, X. Chen, P. Abbeel, R. Duan, A. Kanazawa, C. Sferrazza, et al. Perceptive humanoid parkour: Chaining dynamic human skills via motion matching.arXiv preprint arXiv:2602.15827, 2026

  17. [17]

    Y . Ze, S. Zhao, W. Wang, A. Kanazawa, R. Duan, P. Abbeel, G. Shi, J. Wu, and C. K. Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832, 2025

  18. [18]

    R. Yu, H. Park, and J. Lee. Human dynamics from monocular video with dynamic camera movements.ACM Transactions on Graphics (TOG), 40(6):1–14, 2021

  19. [19]

    Z. Luo, J. Cao, J. Merel, A. Winkler, J. Huang, K. Kitani, and W. Xu. Universal humanoid motion representations for physics-based control. InInternational Conference on Learning Representations, volume 2024, pages 56766–56782, 2024

  20. [20]

    S. Peng, Y . Zhang, Y . Xu, Q. Wang, Q. Shuai, H. Bao, and X. Zhou. Neural body: Implicit neu- ral representations with structured latent codes for novel view synthesis of dynamic humans. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9054–9063, 2021

  21. [21]

    G. Moon, T. Shiratori, and S. Saito. Expressive whole-body 3d gaussian avatar. InEuropean Conference on Computer Vision, pages 19–35. Springer, 2024

  22. [22]

    Rajasegaran, G

    J. Rajasegaran, G. Pavlakos, A. Kanazawa, and J. Malik. Tracking people by predicting 3d appearance, location and pose. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2740–2749, 2022

  23. [23]

    D. C. Luvizon, D. Picard, and H. Tabia. 2d/3d pose estimation and action recognition using multitask deep learning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5137–5146, 2018

  24. [24]

    Rajasegaran, G

    J. Rajasegaran, G. Pavlakos, A. Kanazawa, C. Feichtenhofer, and J. Malik. On the benefits of 3d pose and tracking for human action recognition. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 640–649, 2023

  25. [25]

    H. Choi, G. Moon, J. Y . Chang, and K. M. Lee. Beyond static features for temporally consistent 3d human pose and shape from a video, 2021. URLhttps://arxiv.org/abs/2011.08627

  26. [26]

    Kanazawa, J

    A. Kanazawa, J. Y . Zhang, P. Felsen, and J. Malik. Learning 3d human dynamics from video,

  27. [27]

    URLhttps://arxiv.org/abs/1812.01601

  28. [28]

    Y . Wang, Y . Sun, P. Patel, K. Daniilidis, M. J. Black, and M. Kocabas. Prompthmr: Promptable human mesh recovery, 2025. URLhttps://arxiv.org/abs/2504.06397

  29. [29]

    Y . Wang, Z. Wang, L. Liu, and K. Daniilidis. Tram: Global trajectory and motion of 3d humans from in-the-wild videos, 2024. URLhttps://arxiv.org/abs/2403.17346

  30. [30]

    Z. Shen, H. Pi, Y . Xia, Z. Cen, S. Peng, Z. Hu, H. Bao, R. Hu, and X. Zhou. World-grounded human motion recovery via gravity-view coordinates. InSIGGRAPH Asia 2024 Conference Papers, SA ’24, page 1–11. ACM, Dec. 2024. doi:10.1145/3680528.3687565. URLhttp: //dx.doi.org/10.1145/3680528.3687565

  31. [31]

    J. Li, J. Cao, H. Zhang, D. Rempe, J. Kautz, U. Iqbal, and Y . Yuan. Genmo: A generalist model for human motion, 2025. URLhttps://arxiv.org/abs/2505.01425. 14

  32. [32]

    Yuan, S.-E

    Y . Yuan, S.-E. Wei, T. Simon, K. Kitani, and J. Saragih. Simpoe: Simulated character control for 3d human pose estimation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7159–7169, 2021

  33. [33]

    Y . Yuan, V . Makoviychuk, Y . Guo, S. Fidler, X. Peng, and K. Fatahalian. Learning physically simulated tennis skills from broadcast videos.ACM Trans. Graph, 42(4):66, 2023

  34. [34]

    Ugrinovic, B

    N. Ugrinovic, B. Pan, G. Pavlakos, D. Paschalidou, B. Shen, J. Sanchez-Riera, F. Moreno- Noguer, and L. Guibas. Multiphys: Multi-person physics-aware 3d motion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2331–2340, 2024

  35. [35]

    Zhang, Y

    S. Zhang, Y . Zhang, F. Bogo, M. Pollefeys, and S. Tang. Learning motion priors for 4d human body capture in 3d scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11343–11353, 2021

  36. [36]

    J. Li, S. Bian, C. Xu, G. Liu, G. Yu, and C. Lu. D &d: Learning human dynamics from dynamic camera. InEuropean Conference on Computer Vision, pages 479–496. Springer, 2022

  37. [37]

    Q. Wang, M. Zhu, R. Hou, K. Gillespie, A. Zhu, S. Wang, Y . Wang, G. I. Fernandez, Y . Liu, C. Togashi, et al. A hierarchical, model-based system for high-performance humanoid soccer. arXiv preprint arXiv:2512.09431, 2025

  38. [38]

    J. Ren, J. Long, T. Huang, H. Wang, Z. Wang, F. Jia, W. Zhang, J. Wang, P. Luo, and J. Pang. Humanoid goalkeeper: Learning from position conditioned task-motion constraints.arXiv preprint arXiv:2510.18002, 2025

  39. [39]

    Z. Luo, J. Wang, K. Liu, H. Zhang, C. Tessler, J. Wang, Y . Yuan, J. Cao, Z. Lin, F. Wang, et al. Smplolympics: Sports environments for physically simulated humanoids.arXiv preprint arXiv:2407.00187, 2024

  40. [40]

    Liu and J

    L. Liu and J. Hodgins. Learning basketball dribbling skills using trajectory optimization and deep reinforcement learning.Acm transactions on graphics (tog), 37(4):1–14, 2018

  41. [41]

    C. Liu, L. Jiang, Y . Wang, K. Yao, J. Fu, and X. Ren. Humanoid whole-body badminton via multi-stage reinforcement learning.arXiv preprint arXiv:2511.11218, 2025

  42. [42]

    M. Kim, E. Jung, and Y . Lee. Physicsfc: Learning user-controlled skills for a physics-based football player controller.ACM Transactions on Graphics (TOG), 44(4):1–21, 2025

  43. [43]

    Y . Chen, S. Dong, X. Ji, J. Sun, Z. Luo, L. Zhao, J. Zhang, W. Li, J. Ma, B. Xu, et al. Learning human-like badminton skills for humanoid robots.arXiv preprint arXiv:2602.08370, 2026

  44. [44]

    Calinon and A

    S. Calinon and A. G. Billard. What is the teacher’s role in robot programming by demonstra- tion? – toward benchmarks for improved learning.Interaction Studies, 8(3):441–464, 2007. doi:10.1075/is.8.3.08cal

  45. [45]

    C. L. Nehaniv and K. Dautenhahn. The correspondence problem. In K. Dautenhahn and C. L. Nehaniv, editors,Imitation in Animals and Artifacts, pages 41–61. MIT Press, Cambridge, MA, 2002. ISBN 9780262042031

  46. [46]

    Pavlakos, V

    G. Pavlakos, V . Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image, 2019. URL https://arxiv.org/abs/1904.05866

  47. [47]

    Yeung, T

    C. Yeung, T. Suzuki, R. Tanaka, Z. Yin, and K. Fujii. Athletepose3d: A benchmark dataset for 3d human pose estimation and kinematic validation in athletic movements, 2025. URL https://arxiv.org/abs/2503.07499. 15

  48. [48]

    Demler, X

    I. Demler, X. Xie, B. Werner, A. Szczuka, and P. Perona. Caltennis: Large multi-view tennis video dataset and benchmark of monocular-to-3d pose estimation, 2026. URLhttps:// arxiv.org/abs/2606.20542

  49. [49]

    J. P. Araujo, Y . Ze, P. Xu, J. Wu, and C. K. Liu. Retargeting matters: General motion retargeting for humanoid motion tracking.arXiv preprint arXiv:2510.02252, 2025

  50. [50]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  51. [51]

    Bronars, Y

    A. Bronars, Y . Park, and P. Agrawal. Tune to learn: How controller gains shape robot policy learning.arXiv preprint arXiv:2604.02523, 2026

  52. [52]

    Welch, G

    G. Welch, G. Bishop, et al. An introduction to the kalman filter. 1995

  53. [53]

    Nguyen, K

    D. Nguyen, K. D. Cancio, and S. Kim. High speed robotic table tennis swinging using lightweight hardware with model predictive control. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 15278–15284. IEEE, 2025

  54. [54]

    Y . Wang, Q. Zhao, R. Yu, H. W. Tsui, A. Zeng, J. Lin, Z. Luo, J. Yu, X. Li, Q. Chen, et al. Skillmimic: Learning basketball interaction skills from demonstrations. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17540–17549, 2025

  55. [55]

    Y . Xu, J. Zhang, Q. Zhang, and D. Tao. Vitpose: Simple vision transformer baselines for human pose estimation, 2022. URLhttps://arxiv.org/abs/2204.12484

  56. [56]

    N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V . Alwala, N. Carion, C.-Y . Wu, R. Girshick, P. Doll ´ar, and C. Feichtenhofer. Sam 2: Segment anything in images and videos, 2024. URLhttps: //arxiv.org/abs/2408.00714

  57. [57]

    Teed and J

    Z. Teed and J. Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,

  58. [58]

    URLhttps://arxiv.org/abs/2108.10869

  59. [59]

    S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023. URLhttps://arxiv.org/abs/2302.12288

  60. [60]

    Zakka, Q

    K. Zakka, Q. Liao, B. Yi, L. L. Lay, K. Sreenath, and P. Abbeel. mjlab: A lightweight frame- work for gpu-accelerated robot learning.arXiv preprint arXiv:2601.22074, 2026. 16 A Notation Table Table 4: Notation and Definitions Variable Definition Video capture and camera calibration NNumber of cameras capturing the scene ci Thei-th camera V i Video collect...