pith. sign in

arxiv: 2606.25620 · v1 · pith:IES5IYSCnew · submitted 2026-06-24 · 💻 cs.RO · cs.CV· eess.IV

1000 Rallies: An Event-Camera Dataset and Real-Time Learned Ball-State Estimation for Robotic Table Tennis

Pith reviewed 2026-06-25 20:59 UTC · model grok-4.3

classification 💻 cs.RO cs.CVeess.IV
keywords event cameratable tennisrobotic perceptionball state estimationKalman filterdatasetconvolutional neural networktrajectory prediction
0
0 comments X

The pith

Event cameras with a velocity-aware Kalman filter reduce robotic table tennis bounce prediction error by 36 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first large-scale event-camera dataset for table tennis, built from more than 1000 rallies recorded simultaneously with 14 high-speed frame cameras. A convolutional network is trained on the event streams to produce joint estimates of ball position and velocity in the image plane. When the velocity output is supplied as an extra measurement to a Kalman filter, the error at the predicted bounce point drops 36 percent relative to a position-only baseline. The resulting perception pipeline is integrated with a robotic arm to close the loop and produce the first real-time human-robot table tennis rallies driven solely by event-based sensing.

Core claim

Over 1000 table tennis rallies captured with an event camera and 14 synchronized 200 FPS frame cameras yield 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. A convolutional network robust to background player motion is trained to regress image-plane ball position and velocity directly from events. Supplying the predicted velocity as an additional measurement to the Kalman filter lowers bounce-point prediction error by 36 percent compared with position-only tracking. The same pipeline is coupled to a Stäubli arm, enabling sustained real-time rallies between the robot and human players.

What carries the argument

Convolutional neural network that jointly regresses image-plane ball position and velocity from the event stream, with its velocity output treated as an extra measurement inside the Kalman filter for 3D trajectory and bounce prediction.

If this is right

  • The system achieves high-frequency ball-state estimates without the motion blur or data-volume penalty of fast frame cameras.
  • Velocity-augmented filtering improves trajectory forecasts enough to support timely robotic control actions.
  • The dataset supplies the first public benchmark for event-based perception under realistic, high-speed sports conditions.
  • Integration with the robotic arm demonstrates that event-driven perception can close the full perception-action loop at table-tennis speeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar event-based pipelines could be applied to other fast-moving objects where frame cameras lose temporal resolution.
  • The lower data rate of event streams may allow the same accuracy on embedded hardware that currently requires frame cameras.
  • Training across amateur-to-elite players may produce models that generalize better to novel opponents or lighting.

Load-bearing premise

The 1 kHz labels produced by the 14 synchronized 200 FPS frame cameras are accurate and unbiased enough to serve as training targets for the event-based model.

What would settle it

A controlled test in which the reported 36 percent reduction in bounce-point error disappears or the robot fails to return balls in closed-loop rallies with human players.

Figures

Figures reproduced from arXiv: 2606.25620 by Asude Aydin, Claudio Fanconi, Naoya Takahashi, Peter D\"urr, Raphaela Kreiser, Yin Bi.

Figure 1
Figure 1. Figure 1: Three event cameras capture the playing area at high [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 3D schematic of the multi-camera setup for table tennis data collection, featuring four event cameras [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of EVS update rate on EKF-based trajectory [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Robotic table tennis has emerged as a compelling benchmark for real-time robotic perception due to its fast ball dynamics and stringent timing requirements. Accurate, high-frequency, and low-latency ball state estimation is critical for reliable trajectory prediction and timely control. Traditional frame-based cameras face an inherent trade-off: low frame rates leave temporal blind spots that miss fast-moving objects and high frame rates raise data and computational cost. Event cameras instead offer microsecond temporal resolution and, under sufficient illumination, remain largely free of motion blur even at high ball speeds. However, the community lacks large-scale datasets to develop and benchmark event-based perception in realistic sports scenarios. We address this gap by introducing the first large-scale event-camera dataset for table tennis, comprising over 1000 rallies from a diverse group of players ranging from amateurs to elite-level athletes. Each recording captures the event stream alongside 14 synchronized high-speed frame-based cameras at 200 FPS, which we use to produce 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. Building on this dataset, we train a convolutional neural network robust to background player motion that jointly estimates the ball's position and velocity in the image-plane from events. Treating the predicted velocity as an additional measurement in the Kalman filter reduces bounce-point prediction error by 36% relative to a position-only baseline. Finally, we close the perception-action loop by integrating the event-based system with a St\"aubli robotic arm, enabling the first real-time human-robot table tennis rallies driven by event-based perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the first large-scale event-camera dataset for table tennis comprising over 1000 rallies, captured alongside 14 synchronized 200 FPS frame-based cameras that generate 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. It trains a CNN to jointly estimate ball position and velocity in the image plane from events (robust to background motion), shows that feeding the predicted velocity as an additional measurement into a Kalman filter yields a 36% reduction in bounce-point prediction error versus a position-only baseline, and integrates the system with a Stäubli robotic arm to achieve the first real-time event-driven human-robot table tennis rallies.

Significance. If the quantitative claims hold, the work is significant as the first large public event-camera benchmark for high-speed sports perception and as a demonstration of closing the perception-action loop with event-based sensing on a real robot. The dataset scale, the learned velocity estimation, and the robot integration are concrete advances that address a documented gap in the field.

major comments (2)
  1. [Abstract (dataset creation)] Abstract, paragraph on dataset creation: the claim that 14 synchronized 200 FPS cameras produce reliable 1 kHz pseudo ground-truth labels for position, velocity, and spin supplies no interpolation equations, triangulation details, error bounds, or validation against independent high-speed references. Because both CNN training and the reported 36% error reduction rest on these labels (especially near bounces and under spin), the absence of quantified accuracy undermines the central experimental claims.
  2. [Abstract (results)] Abstract, results paragraph: the 36% bounce-point error reduction is stated relative to a position-only baseline, yet the manuscript provides no description of the baseline implementation, evaluation data splits, error bars, or statistical significance. Without these, it is impossible to determine whether the improvement is robust or load-bearing for the Kalman-filter contribution.
minor comments (1)
  1. [Methods] Clarify in the methods section how the CNN architecture handles background player motion and whether any ablation isolates the contribution of velocity prediction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. The two major comments highlight areas where the manuscript can be strengthened by adding missing technical details. We will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract (dataset creation)] Abstract, paragraph on dataset creation: the claim that 14 synchronized 200 FPS cameras produce reliable 1 kHz pseudo ground-truth labels for position, velocity, and spin supplies no interpolation equations, triangulation details, error bounds, or validation against independent high-speed references. Because both CNN training and the reported 36% error reduction rest on these labels (especially near bounces and under spin), the absence of quantified accuracy undermines the central experimental claims.

    Authors: We agree that the abstract omits these details and that the manuscript should provide them to support the central claims. The full paper contains a methods section describing the 14-camera setup, multi-view triangulation, and linear interpolation to 1 kHz, but we acknowledge the absence of explicit error bounds and independent validation. We will add a new subsection (or expanded paragraph) in the dataset section that reports triangulation residuals, cross-validation against a subset of high-speed camera frames, and any available error statistics, particularly around bounce events. This revision will directly address the concern about label reliability for CNN training and bounce prediction. revision: yes

  2. Referee: [Abstract (results)] Abstract, results paragraph: the 36% bounce-point error reduction is stated relative to a position-only baseline, yet the manuscript provides no description of the baseline implementation, evaluation data splits, error bars, or statistical significance. Without these, it is impossible to determine whether the improvement is robust or load-bearing for the Kalman-filter contribution.

    Authors: We agree that the abstract (and results section) lacks sufficient implementation and evaluation details for the 36% claim. The position-only baseline is a standard Kalman filter that uses only the CNN-derived position measurements; the improved version augments it with the CNN velocity estimate as an additional measurement. We will expand the experimental results section to include: (1) a precise description of the baseline filter equations and tuning, (2) the rally-level train/validation/test splits used, (3) error bars computed across multiple folds or independent test rallies, and (4) a statistical significance test (e.g., paired t-test) on the bounce-point error reduction. These additions will make the Kalman-filter contribution verifiable and will be reflected in an updated abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's claims rest on dataset creation via independent 200 FPS frame cameras for pseudo ground-truth, CNN training on that data, and experimental measurement of 36% error reduction against a position-only Kalman baseline. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the velocity prediction is learned from external labels and validated empirically rather than reducing to its own inputs by construction. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on the accuracy of the pseudo-ground-truth generation process from 200 FPS cameras and the assumption that the CNN trained on this data generalizes to live robotic play without overfitting to the recorded players or conditions.

free parameters (1)
  • CNN model parameters
    Neural network weights fitted to the event data and pseudo-ground-truth labels from the new dataset.
axioms (1)
  • domain assumption The pseudo ground-truth from 14 synchronized 200 FPS cameras is sufficiently accurate for 1 kHz ball state labels.
    Invoked to train and evaluate the event-based estimator as described in the abstract.

pith-pipeline@v0.9.1-grok · 5839 in / 1372 out tokens · 30291 ms · 2026-06-25T20:59:55.580409+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 1 linked inside Pith

  1. [1]

    A learning approach to robotic table tennis,

    M. Matsushima, T. Hashimoto, M. Takeuchi, and F. Miyazaki, “A learning approach to robotic table tennis,”IEEE Transactions on Robotics, vol. 21, no. 4, pp. 767–771, 2005

  2. [2]

    Achieving human level competitive robot table tennis,

    D. B. D’Ambrosio, S. W. Abeyruwan, L. Graesser, A. Iscen, H. B. Amor, A. Bewley, B. Reed, K. Reymann, L. Takayama, Y . Tassa et al., “Achieving human level competitive robot table tennis,” in 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities, 2024

  3. [3]

    A table tennis robot system using an industrial kuka robot arm,

    J. Tebbe, Y . Gao, M. Sastre-Rienietz, and A. Zell, “A table tennis robot system using an industrial kuka robot arm,” inPattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 40. Springer, 2019, pp. 33–45

  4. [4]

    Speed and spin characteristics of the 40mm table tennis ball,

    H. Tang, “Speed and spin characteristics of the 40mm table tennis ball,”Table Tennis Sciences, vol. 4, pp. 278–284, 2002

  5. [5]

    Achieving human level competitive robot table tennis,

    D. B. DAmbrosio, S. Abeyruwan, L. Graesser, A. Iscen, H. B. Amor, A. Bewley, B. J. Reed, K. Reymann, L. Takayama, Y . Tassaet al., “Achieving human level competitive robot table tennis,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 74–82

  6. [6]

    Hitter: A humanoid table ten- nis robot via hierarchical planning and learning,

    Z. Su, B. Zhang, N. Rahmanian, Y . Gao, Q. Liao, C. Regan, K. Sreenath, and S. S. Sastry, “Hitter: A humanoid table ten- nis robot via hierarchical planning and learning,”arXiv preprint arXiv:2508.21043, 2025

  7. [7]

    Learning coor- dinated badminton skills for legged manipulators,

    Y . Ma, A. Cramariuc, F. Farshidian, and M. Hutter, “Learning coor- dinated badminton skills for legged manipulators,”Science robotics, vol. 10, no. 102, p. eadu3922, 2025

  8. [8]

    Integrating learning-based manipulation and physics- based locomotion for whole-body badminton robot control,

    H. Wang, Z. Shi, C. Zhu, Y . Qiao, C. Zhang, F. Yang, P. Ren, L. Lu, and D. Xuan, “Integrating learning-based manipulation and physics- based locomotion for whole-body badminton robot control,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 15 114–15 120

  9. [9]

    Outplaying elite table tennis players with an autonomous robot,

    P. D ¨urr, M. E. Gheche, G. J. Maeda, N. Mukai, N. Takahashi, and et al., “Outplaying elite table tennis players with an autonomous robot,” Nature, vol. 652, pp. 886–891, 2026

  10. [10]

    Event- based vision: A survey,

    G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidiset al., “Event- based vision: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020

  11. [11]

    An event- based perception pipeline for a table tennis robot,

    A. Ziegler, T. Gossard, A. Glover, and A. Zell, “An event- based perception pipeline for a table tennis robot,”arXiv preprint arXiv:2502.00749, 2025

  12. [12]

    Egocentric event-based vision for ping pong ball trajectory prediction,

    I. Alberico, M. Cannici, G. Cioffi, and D. Scaramuzza, “Egocentric event-based vision for ping pong ball trajectory prediction,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5025–5034

  13. [13]

    Spindoe: A ball spin estimation method for table tennis robot,

    T. Gossard, J. Tebbe, A. Ziegler, and A. Zell, “Spindoe: A ball spin estimation method for table tennis robot,”arXiv preprint arXiv:2303.03879, 2023

  14. [14]

    Ttnet: Real-time temporal and spatial video analysis of table tennis,

    R. V oeikov, N. Falaleev, and R. Baikulov, “Ttnet: Real-time temporal and spatial video analysis of table tennis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 884–885

  15. [15]

    Real time trajectory prediction using deep conditional generative models,

    S. Gomez-Gonzalez, S. Prokudin, B. Sch ¨olkopf, and J. Peters, “Real time trajectory prediction using deep conditional generative models,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 970–976, 2020

  16. [16]

    Spikemot: Event- based multi-object tracking with sparse motion features,

    S. Wang, Z. Wang, C. Li, X. Qi, and H. K.-H. So, “Spikemot: Event- based multi-object tracking with sparse motion features,”IEEE Access, vol. 13, pp. 214–230, 2024

  17. [17]

    Evtracker: An event-driven spatiotemporal method for dynamic object tracking,

    S. Zhang, W. Wang, H. Li, and S. Zhang, “Evtracker: An event-driven spatiotemporal method for dynamic object tracking,”Sensors, vol. 22, no. 16, p. 6090, 2022

  18. [18]

    Event-based visual tracking in dynamic environments,

    I. Perez-Salesa, R. Aldana-L ´opez, and C. Sag ¨u´es, “Event-based visual tracking in dynamic environments,” inIberian Robotics conference. Springer, 2022, pp. 175–186

  19. [19]

    A density-based algorithm for discovering clusters in large spatial databases with noise,

    M. Ester, H.-P. Kriegel, J. Sander, X. Xuet al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” inkdd, vol. 96, no. 34, 1996, pp. 226–231

  20. [20]

    Event-based high-speed ball detection in sports video,

    T. Nakabayashi, A. Kondo, K. Higa, A. Girbau, S. Satoh, and H. Saito, “Event-based high-speed ball detection in sports video,” in Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023, pp. 55–62

  21. [21]

    Ev-catcher: High-speed object catching using low-latency event-based neural networks,

    Z. Wang, F. Cladera, A. Bisulco, D. Lee, C. J. Taylor, K. Daniilidis, M. A. Hsieh, D. D. Lee, and V . Isler, “Ev-catcher: High-speed object catching using low-latency event-based neural networks,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 8737–8744, 2022

  22. [22]

    Event- based agile object catching with a quadrupedal robot,

    B. Forrai, T. Miki, D. Gehrig, M. Hutter, and D. Scaramuzza, “Event- based agile object catching with a quadrupedal robot,”arXiv preprint arXiv:2303.17479, 2023

  23. [23]

    Detection of fast-moving objects with neuromorphic hardware,

    A. Ziegler, K. Vetter, T. Gossard, J. Tebbe, S. Otte, and A. Zell, “Detection of fast-moving objects with neuromorphic hardware,” in 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 8709–8717

  24. [24]

    Spikepingpong: High-frequency spike vision-based robot learning for precise striking in table tennis game,

    H. Wang, C. Hou, X. Li, Y . Fu, C. Li, N. Chen, G. Dai, J. Liu, T. Huang, and S. Zhang, “Spikepingpong: High-frequency spike vision-based robot learning for precise striking in table tennis game,” arXiv preprint arXiv:2506.06690, 2025

  25. [25]

    Jerk-limited real-time trajectory genera- tion with arbitrary target states,

    L. Berscheid and T. Kr ¨oger, “Jerk-limited real-time trajectory genera- tion with arbitrary target states,”Robotics: Science and Systems XVII, 2021

  26. [26]

    Learning to detect objects with a 1 megapixel event camera,

    E. Perot, P. De Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,”Advances in Neural Information Processing Systems, vol. 33, pp. 16 639–16 652, 2020

  27. [27]

    Dvs benchmark datasets for object tracking, action recognition, and object recognition,

    Y . Hu, H. Liu, M. Pfeiffer, and T. Delbruck, “Dvs benchmark datasets for object tracking, action recognition, and object recognition,”Fron- tiers in neuroscience, vol. 10, p. 405, 2016

  28. [28]

    Pedro: an event-based dataset for person detection in robotics,

    C. Boretti, P. Bich, F. Pareschi, L. Prono, R. Rovatti, and G. Setti, “Pedro: an event-based dataset for person detection in robotics,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 4065–4070

  29. [29]

    Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,

    X. Wang, S. Wang, C. Tang, L. Zhu, B. Jiang, Y . Tian, and J. Tang, “Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 248–19 257

  30. [30]

    Unified temporal and spatial calibration for multi-sensor systems,

    P. Furgale, J. Rehder, and R. Siegwart, “Unified temporal and spatial calibration for multi-sensor systems,” in2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 1280– 1286

  31. [31]

    How to calibrate your event camera,

    M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1403–1409

  32. [32]

    Spin detection in robotic table tennis,

    J. Tebbe, L. Klamt, Y . Gao, and A. Zell, “Spin detection in robotic table tennis,” in2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 9694–9700

  33. [33]

    Yolov4: Op- timal speed and accuracy of object detection,

    A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Op- timal speed and accuracy of object detection,”arXiv preprint arXiv:2004.10934, 2020

  34. [34]

    Robotic table tennis based on physical models of aerodynamics and rebounds,

    A. Nakashima, Y . Ogawa, C. Liu, and Y . Hayakawa, “Robotic table tennis based on physical models of aerodynamics and rebounds,” in 2011 IEEE International Conference on Robotics and Biomimetics. IEEE, 2011, pp. 2348–2354

  35. [35]

    Hots: a hierarchy of event-based time-surfaces for pattern recogni- tion,

    X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman, “Hots: a hierarchy of event-based time-surfaces for pattern recogni- tion,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1346–1359, 2016

  36. [36]

    TX2-60L Industrial Robot,

    St ¨aubli Robotics, “TX2-60L Industrial Robot,” https://www.staubli. com, accessed: 2026-02-12

  37. [37]

    Robust visual tracking with a freely- moving event camera. in 2017 ieee,

    A. Glover and C. Bartolozzi, “Robust visual tracking with a freely- moving event camera. in 2017 ieee,” inRSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 3769–3776