1000 Rallies: An Event-Camera Dataset and Real-Time Learned Ball-State Estimation for Robotic Table Tennis
Pith reviewed 2026-06-25 20:59 UTC · model grok-4.3
The pith
Event cameras with a velocity-aware Kalman filter reduce robotic table tennis bounce prediction error by 36 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Over 1000 table tennis rallies captured with an event camera and 14 synchronized 200 FPS frame cameras yield 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. A convolutional network robust to background player motion is trained to regress image-plane ball position and velocity directly from events. Supplying the predicted velocity as an additional measurement to the Kalman filter lowers bounce-point prediction error by 36 percent compared with position-only tracking. The same pipeline is coupled to a Stäubli arm, enabling sustained real-time rallies between the robot and human players.
What carries the argument
Convolutional neural network that jointly regresses image-plane ball position and velocity from the event stream, with its velocity output treated as an extra measurement inside the Kalman filter for 3D trajectory and bounce prediction.
If this is right
- The system achieves high-frequency ball-state estimates without the motion blur or data-volume penalty of fast frame cameras.
- Velocity-augmented filtering improves trajectory forecasts enough to support timely robotic control actions.
- The dataset supplies the first public benchmark for event-based perception under realistic, high-speed sports conditions.
- Integration with the robotic arm demonstrates that event-driven perception can close the full perception-action loop at table-tennis speeds.
Where Pith is reading between the lines
- Similar event-based pipelines could be applied to other fast-moving objects where frame cameras lose temporal resolution.
- The lower data rate of event streams may allow the same accuracy on embedded hardware that currently requires frame cameras.
- Training across amateur-to-elite players may produce models that generalize better to novel opponents or lighting.
Load-bearing premise
The 1 kHz labels produced by the 14 synchronized 200 FPS frame cameras are accurate and unbiased enough to serve as training targets for the event-based model.
What would settle it
A controlled test in which the reported 36 percent reduction in bounce-point error disappears or the robot fails to return balls in closed-loop rallies with human players.
Figures
read the original abstract
Robotic table tennis has emerged as a compelling benchmark for real-time robotic perception due to its fast ball dynamics and stringent timing requirements. Accurate, high-frequency, and low-latency ball state estimation is critical for reliable trajectory prediction and timely control. Traditional frame-based cameras face an inherent trade-off: low frame rates leave temporal blind spots that miss fast-moving objects and high frame rates raise data and computational cost. Event cameras instead offer microsecond temporal resolution and, under sufficient illumination, remain largely free of motion blur even at high ball speeds. However, the community lacks large-scale datasets to develop and benchmark event-based perception in realistic sports scenarios. We address this gap by introducing the first large-scale event-camera dataset for table tennis, comprising over 1000 rallies from a diverse group of players ranging from amateurs to elite-level athletes. Each recording captures the event stream alongside 14 synchronized high-speed frame-based cameras at 200 FPS, which we use to produce 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. Building on this dataset, we train a convolutional neural network robust to background player motion that jointly estimates the ball's position and velocity in the image-plane from events. Treating the predicted velocity as an additional measurement in the Kalman filter reduces bounce-point prediction error by 36% relative to a position-only baseline. Finally, we close the perception-action loop by integrating the event-based system with a St\"aubli robotic arm, enabling the first real-time human-robot table tennis rallies driven by event-based perception.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the first large-scale event-camera dataset for table tennis comprising over 1000 rallies, captured alongside 14 synchronized 200 FPS frame-based cameras that generate 1 kHz pseudo ground-truth labels for ball position, velocity, and spin. It trains a CNN to jointly estimate ball position and velocity in the image plane from events (robust to background motion), shows that feeding the predicted velocity as an additional measurement into a Kalman filter yields a 36% reduction in bounce-point prediction error versus a position-only baseline, and integrates the system with a Stäubli robotic arm to achieve the first real-time event-driven human-robot table tennis rallies.
Significance. If the quantitative claims hold, the work is significant as the first large public event-camera benchmark for high-speed sports perception and as a demonstration of closing the perception-action loop with event-based sensing on a real robot. The dataset scale, the learned velocity estimation, and the robot integration are concrete advances that address a documented gap in the field.
major comments (2)
- [Abstract (dataset creation)] Abstract, paragraph on dataset creation: the claim that 14 synchronized 200 FPS cameras produce reliable 1 kHz pseudo ground-truth labels for position, velocity, and spin supplies no interpolation equations, triangulation details, error bounds, or validation against independent high-speed references. Because both CNN training and the reported 36% error reduction rest on these labels (especially near bounces and under spin), the absence of quantified accuracy undermines the central experimental claims.
- [Abstract (results)] Abstract, results paragraph: the 36% bounce-point error reduction is stated relative to a position-only baseline, yet the manuscript provides no description of the baseline implementation, evaluation data splits, error bars, or statistical significance. Without these, it is impossible to determine whether the improvement is robust or load-bearing for the Kalman-filter contribution.
minor comments (1)
- [Methods] Clarify in the methods section how the CNN architecture handles background player motion and whether any ablation isolates the contribution of velocity prediction.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. The two major comments highlight areas where the manuscript can be strengthened by adding missing technical details. We will revise accordingly.
read point-by-point responses
-
Referee: [Abstract (dataset creation)] Abstract, paragraph on dataset creation: the claim that 14 synchronized 200 FPS cameras produce reliable 1 kHz pseudo ground-truth labels for position, velocity, and spin supplies no interpolation equations, triangulation details, error bounds, or validation against independent high-speed references. Because both CNN training and the reported 36% error reduction rest on these labels (especially near bounces and under spin), the absence of quantified accuracy undermines the central experimental claims.
Authors: We agree that the abstract omits these details and that the manuscript should provide them to support the central claims. The full paper contains a methods section describing the 14-camera setup, multi-view triangulation, and linear interpolation to 1 kHz, but we acknowledge the absence of explicit error bounds and independent validation. We will add a new subsection (or expanded paragraph) in the dataset section that reports triangulation residuals, cross-validation against a subset of high-speed camera frames, and any available error statistics, particularly around bounce events. This revision will directly address the concern about label reliability for CNN training and bounce prediction. revision: yes
-
Referee: [Abstract (results)] Abstract, results paragraph: the 36% bounce-point error reduction is stated relative to a position-only baseline, yet the manuscript provides no description of the baseline implementation, evaluation data splits, error bars, or statistical significance. Without these, it is impossible to determine whether the improvement is robust or load-bearing for the Kalman-filter contribution.
Authors: We agree that the abstract (and results section) lacks sufficient implementation and evaluation details for the 36% claim. The position-only baseline is a standard Kalman filter that uses only the CNN-derived position measurements; the improved version augments it with the CNN velocity estimate as an additional measurement. We will expand the experimental results section to include: (1) a precise description of the baseline filter equations and tuning, (2) the rally-level train/validation/test splits used, (3) error bars computed across multiple folds or independent test rallies, and (4) a statistical significance test (e.g., paired t-test) on the bounce-point error reduction. These additions will make the Kalman-filter contribution verifiable and will be reflected in an updated abstract. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's claims rest on dataset creation via independent 200 FPS frame cameras for pseudo ground-truth, CNN training on that data, and experimental measurement of 36% error reduction against a position-only Kalman baseline. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear; the velocity prediction is learned from external labels and validated empirically rather than reducing to its own inputs by construction. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- CNN model parameters
axioms (1)
- domain assumption The pseudo ground-truth from 14 synchronized 200 FPS cameras is sufficiently accurate for 1 kHz ball state labels.
Reference graph
Works this paper leans on
-
[1]
A learning approach to robotic table tennis,
M. Matsushima, T. Hashimoto, M. Takeuchi, and F. Miyazaki, “A learning approach to robotic table tennis,”IEEE Transactions on Robotics, vol. 21, no. 4, pp. 767–771, 2005
2005
-
[2]
Achieving human level competitive robot table tennis,
D. B. D’Ambrosio, S. W. Abeyruwan, L. Graesser, A. Iscen, H. B. Amor, A. Bewley, B. Reed, K. Reymann, L. Takayama, Y . Tassa et al., “Achieving human level competitive robot table tennis,” in 7th Robot Learning Workshop: Towards Robots with Human-Level Abilities, 2024
2024
-
[3]
A table tennis robot system using an industrial kuka robot arm,
J. Tebbe, Y . Gao, M. Sastre-Rienietz, and A. Zell, “A table tennis robot system using an industrial kuka robot arm,” inPattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 40. Springer, 2019, pp. 33–45
2018
-
[4]
Speed and spin characteristics of the 40mm table tennis ball,
H. Tang, “Speed and spin characteristics of the 40mm table tennis ball,”Table Tennis Sciences, vol. 4, pp. 278–284, 2002
2002
-
[5]
Achieving human level competitive robot table tennis,
D. B. DAmbrosio, S. Abeyruwan, L. Graesser, A. Iscen, H. B. Amor, A. Bewley, B. J. Reed, K. Reymann, L. Takayama, Y . Tassaet al., “Achieving human level competitive robot table tennis,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 74–82
2025
-
[6]
Hitter: A humanoid table ten- nis robot via hierarchical planning and learning,
Z. Su, B. Zhang, N. Rahmanian, Y . Gao, Q. Liao, C. Regan, K. Sreenath, and S. S. Sastry, “Hitter: A humanoid table ten- nis robot via hierarchical planning and learning,”arXiv preprint arXiv:2508.21043, 2025
arXiv 2025
-
[7]
Learning coor- dinated badminton skills for legged manipulators,
Y . Ma, A. Cramariuc, F. Farshidian, and M. Hutter, “Learning coor- dinated badminton skills for legged manipulators,”Science robotics, vol. 10, no. 102, p. eadu3922, 2025
2025
-
[8]
Integrating learning-based manipulation and physics- based locomotion for whole-body badminton robot control,
H. Wang, Z. Shi, C. Zhu, Y . Qiao, C. Zhang, F. Yang, P. Ren, L. Lu, and D. Xuan, “Integrating learning-based manipulation and physics- based locomotion for whole-body badminton robot control,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 15 114–15 120
2025
-
[9]
Outplaying elite table tennis players with an autonomous robot,
P. D ¨urr, M. E. Gheche, G. J. Maeda, N. Mukai, N. Takahashi, and et al., “Outplaying elite table tennis players with an autonomous robot,” Nature, vol. 652, pp. 886–891, 2026
2026
-
[10]
Event- based vision: A survey,
G. Gallego, T. Delbr ¨uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidiset al., “Event- based vision: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020
2020
-
[11]
An event- based perception pipeline for a table tennis robot,
A. Ziegler, T. Gossard, A. Glover, and A. Zell, “An event- based perception pipeline for a table tennis robot,”arXiv preprint arXiv:2502.00749, 2025
arXiv 2025
-
[12]
Egocentric event-based vision for ping pong ball trajectory prediction,
I. Alberico, M. Cannici, G. Cioffi, and D. Scaramuzza, “Egocentric event-based vision for ping pong ball trajectory prediction,” inPro- ceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5025–5034
2025
-
[13]
Spindoe: A ball spin estimation method for table tennis robot,
T. Gossard, J. Tebbe, A. Ziegler, and A. Zell, “Spindoe: A ball spin estimation method for table tennis robot,”arXiv preprint arXiv:2303.03879, 2023
arXiv 2023
-
[14]
Ttnet: Real-time temporal and spatial video analysis of table tennis,
R. V oeikov, N. Falaleev, and R. Baikulov, “Ttnet: Real-time temporal and spatial video analysis of table tennis,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 884–885
2020
-
[15]
Real time trajectory prediction using deep conditional generative models,
S. Gomez-Gonzalez, S. Prokudin, B. Sch ¨olkopf, and J. Peters, “Real time trajectory prediction using deep conditional generative models,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 970–976, 2020
2020
-
[16]
Spikemot: Event- based multi-object tracking with sparse motion features,
S. Wang, Z. Wang, C. Li, X. Qi, and H. K.-H. So, “Spikemot: Event- based multi-object tracking with sparse motion features,”IEEE Access, vol. 13, pp. 214–230, 2024
2024
-
[17]
Evtracker: An event-driven spatiotemporal method for dynamic object tracking,
S. Zhang, W. Wang, H. Li, and S. Zhang, “Evtracker: An event-driven spatiotemporal method for dynamic object tracking,”Sensors, vol. 22, no. 16, p. 6090, 2022
2022
-
[18]
Event-based visual tracking in dynamic environments,
I. Perez-Salesa, R. Aldana-L ´opez, and C. Sag ¨u´es, “Event-based visual tracking in dynamic environments,” inIberian Robotics conference. Springer, 2022, pp. 175–186
2022
-
[19]
A density-based algorithm for discovering clusters in large spatial databases with noise,
M. Ester, H.-P. Kriegel, J. Sander, X. Xuet al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” inkdd, vol. 96, no. 34, 1996, pp. 226–231
1996
-
[20]
Event-based high-speed ball detection in sports video,
T. Nakabayashi, A. Kondo, K. Higa, A. Girbau, S. Satoh, and H. Saito, “Event-based high-speed ball detection in sports video,” in Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, 2023, pp. 55–62
2023
-
[21]
Ev-catcher: High-speed object catching using low-latency event-based neural networks,
Z. Wang, F. Cladera, A. Bisulco, D. Lee, C. J. Taylor, K. Daniilidis, M. A. Hsieh, D. D. Lee, and V . Isler, “Ev-catcher: High-speed object catching using low-latency event-based neural networks,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 8737–8744, 2022
2022
-
[22]
Event- based agile object catching with a quadrupedal robot,
B. Forrai, T. Miki, D. Gehrig, M. Hutter, and D. Scaramuzza, “Event- based agile object catching with a quadrupedal robot,”arXiv preprint arXiv:2303.17479, 2023
arXiv 2023
-
[23]
Detection of fast-moving objects with neuromorphic hardware,
A. Ziegler, K. Vetter, T. Gossard, J. Tebbe, S. Otte, and A. Zell, “Detection of fast-moving objects with neuromorphic hardware,” in 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 8709–8717
2025
-
[24]
H. Wang, C. Hou, X. Li, Y . Fu, C. Li, N. Chen, G. Dai, J. Liu, T. Huang, and S. Zhang, “Spikepingpong: High-frequency spike vision-based robot learning for precise striking in table tennis game,” arXiv preprint arXiv:2506.06690, 2025
arXiv 2025
-
[25]
Jerk-limited real-time trajectory genera- tion with arbitrary target states,
L. Berscheid and T. Kr ¨oger, “Jerk-limited real-time trajectory genera- tion with arbitrary target states,”Robotics: Science and Systems XVII, 2021
2021
-
[26]
Learning to detect objects with a 1 megapixel event camera,
E. Perot, P. De Tournemire, D. Nitti, J. Masci, and A. Sironi, “Learning to detect objects with a 1 megapixel event camera,”Advances in Neural Information Processing Systems, vol. 33, pp. 16 639–16 652, 2020
2020
-
[27]
Dvs benchmark datasets for object tracking, action recognition, and object recognition,
Y . Hu, H. Liu, M. Pfeiffer, and T. Delbruck, “Dvs benchmark datasets for object tracking, action recognition, and object recognition,”Fron- tiers in neuroscience, vol. 10, p. 405, 2016
2016
-
[28]
Pedro: an event-based dataset for person detection in robotics,
C. Boretti, P. Bich, F. Pareschi, L. Prono, R. Rovatti, and G. Setti, “Pedro: an event-based dataset for person detection in robotics,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 4065–4070
2023
-
[29]
Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,
X. Wang, S. Wang, C. Tang, L. Zhu, B. Jiang, Y . Tian, and J. Tang, “Event stream-based visual object tracking: A high-resolution bench- mark dataset and a novel baseline,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 248–19 257
2024
-
[30]
Unified temporal and spatial calibration for multi-sensor systems,
P. Furgale, J. Rehder, and R. Siegwart, “Unified temporal and spatial calibration for multi-sensor systems,” in2013 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 1280– 1286
2013
-
[31]
How to calibrate your event camera,
M. Muglikar, M. Gehrig, D. Gehrig, and D. Scaramuzza, “How to calibrate your event camera,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 1403–1409
2021
-
[32]
Spin detection in robotic table tennis,
J. Tebbe, L. Klamt, Y . Gao, and A. Zell, “Spin detection in robotic table tennis,” in2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 9694–9700
2020
-
[33]
Yolov4: Op- timal speed and accuracy of object detection,
A. Bochkovskiy, C.-Y . Wang, and H.-Y . M. Liao, “Yolov4: Op- timal speed and accuracy of object detection,”arXiv preprint arXiv:2004.10934, 2020
Pith/arXiv arXiv 2004
-
[34]
Robotic table tennis based on physical models of aerodynamics and rebounds,
A. Nakashima, Y . Ogawa, C. Liu, and Y . Hayakawa, “Robotic table tennis based on physical models of aerodynamics and rebounds,” in 2011 IEEE International Conference on Robotics and Biomimetics. IEEE, 2011, pp. 2348–2354
2011
-
[35]
Hots: a hierarchy of event-based time-surfaces for pattern recogni- tion,
X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman, “Hots: a hierarchy of event-based time-surfaces for pattern recogni- tion,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1346–1359, 2016
2016
-
[36]
TX2-60L Industrial Robot,
St ¨aubli Robotics, “TX2-60L Industrial Robot,” https://www.staubli. com, accessed: 2026-02-12
2026
-
[37]
Robust visual tracking with a freely- moving event camera. in 2017 ieee,
A. Glover and C. Bartolozzi, “Robust visual tracking with a freely- moving event camera. in 2017 ieee,” inRSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 3769–3776
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.