pith. machine review for the scientific record. sign in

arxiv: 2605.12297 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.RO· eess.IV

Recognition: 2 theorem links

· Lean Theorem

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

Authors on Pith no claims yet

Pith reviewed 2026-05-13 05:48 UTC · model grok-4.3

classification 💻 cs.CV cs.ROeess.IV
keywords egocentric hand posestereo event cameras3D hand estimationgesture recognitionbimanual trackingevent-based visionbird's-eye-view fusion
0
0 comments X

The pith

Stereo event cameras with a bird's-eye-view fusion module deliver accurate 3D bimanual hand tracking and gesture recognition even in low light and occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EgoEV-HandPose, an end-to-end system that takes stereo event-camera streams as input and jointly estimates 3D poses for both hands while recognizing one of 38 gestures. Its core module, KeypointBEV, lifts per-camera features into a shared top-down coordinate frame and runs an iterative loop that reprojects points to enforce depth consistency and hand kinematics. To support this, the authors release EgoEVHands, a real-world dataset of more than five thousand annotated egocentric sequences captured under varied illumination. Experiments show the method reaches 30.54 mm MPJPE and 86.87 percent top-1 gesture accuracy, beating prior RGB stereo and event-camera baselines especially when lighting is poor or the hands overlap.

Core claim

Lifting stereo event features into a canonical bird's-eye-view space and refining them through iterative reprojection-guided loops resolves monocular depth ambiguity and maintains kinematic consistency across bimanual motions, yielding state-of-the-art 3D pose and gesture results on real-world egocentric data where frame-based cameras degrade.

What carries the argument

KeypointBEV, a stereo fusion module that projects features into a canonical bird's-eye-view space and runs an iterative reprojection-guided refinement loop to resolve depth uncertainty.

If this is right

  • Event-based egocentric systems can now operate in lighting conditions where conventional RGB cameras produce motion blur.
  • Bimanual hand tracking becomes feasible without explicit occlusion handling stages.
  • A single end-to-end network can output both 3D keypoints and discrete gesture labels from the same event stream.
  • Real-world stereo event datasets enable training and benchmarking of future event-vision models beyond simulated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bird's-eye-view lifting could be applied to full upper-body or object-interaction tracking if additional keypoints are annotated.
  • Because event data is sparse, the approach may scale to higher-resolution sensors without proportional compute growth.
  • Integration with inertial measurements could further stabilize the refinement loop against ego-motion.

Load-bearing premise

The iterative reprojection loop will reliably correct depth errors and keep hand kinematics consistent even when the camera itself is moving quickly or when calibration is imperfect.

What would settle it

Record a new sequence with rapid head motion or known calibration drift and measure whether MPJPE rises substantially above 30 mm while gesture accuracy drops below 80 percent.

Figures

Figures reproduced from arXiv: 2605.12297 by Hao Shi, Jiajun Zhai, Kailun Yang, Kaiwei Wang, Luming Wang.

Figure 1
Figure 1. Figure 1: Overview of the proposed EgoEV-HandPose framework and the established EgoEVHands dataset. The framework addresses (Left) the current challenges of RGB-based or monocular systems, such as lighting sensitivity and depth ambiguity, by introducing (Middle) EgoEVHands, the first large-scale, real-world stereo event-based dataset for egocentric hand perception, characterized by its HDR properties, 3D geometric c… view at source ↗
Figure 3
Figure 3. Figure 3: The head-mounted capture system employed for Ego [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Manifold visualization of the EgoEVHands dataset [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overall pipeline of the proposed EgoEV-HandPose. Synchronized left and right event streams are converted to LNES [ [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of EgoEV-HandPose results in four repre [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between our method and state [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Evolution of 3D and 2D MPJPE across iterative [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Confusion matrix of gesture recognition on the Ego [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: PCK curve on the full EgoEVHands test set. The [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative comparison on the DHP19 dataset [ [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
read the original abstract

Egocentric 3D hand pose estimation and gesture recognition are essential for immersive augmented/virtual reality, human-computer interaction, and robotics. However, conventional frame-based cameras suffer from motion blur and limited dynamic range, while existing event-based methods are hindered by ego-motion interference, monocular depth ambiguity, and the lack of large-scale real-world stereo datasets. To overcome these limitations, we propose EgoEV-HandPose, an end-to-end framework for joint 3D bimanual pose estimation and gesture recognition from stereo event streams. Central to our approach is KeypointBEV, a flexible stereo fusion module that lifts features into a canonical bird's-eye-view space and employs an iterative reprojection-guided refinement loop to progressively resolve depth uncertainty and enforce kinematic consistency. In addition, we introduce EgoEVHands, the first large-scale real-world stereo event-camera dataset for egocentric hand perception, containing 5,419 annotated sequences with dense 3D/2D keypoints across 38 gesture classes under varying illumination. Extensive experiments demonstrate that EgoEV-HandPose achieves state-of-the-art performance with an MPJPE of 30.54mm and 86.87% Top-1 gesture recognition accuracy, significantly outperforming RGB-based stereo and prior event-camera methods, particularly in low-light and bimanual occlusion scenarios, thereby setting a new benchmark for event-based egocentric perception. The established dataset and source code will be publicly released at https://github.com/ZJUWang01/EgoEV-HandPose.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes EgoEV-HandPose, an end-to-end framework for joint 3D bimanual hand pose estimation and gesture recognition from stereo event streams. Its core is KeypointBEV, a stereo fusion module that lifts features into bird's-eye-view space and applies an iterative reprojection-guided refinement loop to resolve depth ambiguity and enforce kinematic consistency. The authors also release EgoEVHands, a new large-scale real-world stereo event dataset with 5,419 sequences, dense 3D/2D keypoints, and 38 gesture classes under varying illumination. Experiments report SOTA results of 30.54 mm MPJPE and 86.87% Top-1 gesture accuracy, outperforming RGB stereo and prior event methods especially under low light and bimanual occlusion.

Significance. If the performance claims hold after addressing the noted concerns, the work would meaningfully advance event-based egocentric perception by mitigating motion blur and dynamic-range limitations of frame cameras while providing the first large real-world stereo event dataset for this task. Public release of the dataset and code would further support reproducibility and benchmarking in immersive AR/VR and robotics applications.

major comments (1)
  1. [KeypointBEV / §3] KeypointBEV description (abstract and §3): the iterative reprojection-guided refinement loop is presented as the mechanism that reliably resolves monocular depth ambiguity and enforces kinematic consistency, yet no quantitative sensitivity analysis is provided for ego-motion magnitude, residual calibration drift, event noise levels, or iteration count. Because the headline MPJPE of 30.54 mm and gesture accuracy rest on this loop converging to correct minima, the absence of such analysis leaves the central empirical claims vulnerable to the exact perturbations that arise in real egocentric stereo streams.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'significantly outperforming' should be accompanied by the specific baseline MPJPE and accuracy numbers (or a reference to Table X) so readers can immediately gauge the margin.
  2. [Dataset] Dataset section: clarify the train/validation/test split ratios and whether any sequences were held out by subject or illumination condition; this directly affects claims of generalization to low-light and occlusion scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address the single major comment below and have revised the manuscript accordingly to strengthen the validation of KeypointBEV.

read point-by-point responses
  1. Referee: [KeypointBEV / §3] KeypointBEV description (abstract and §3): the iterative reprojection-guided refinement loop is presented as the mechanism that reliably resolves monocular depth ambiguity and enforces kinematic consistency, yet no quantitative sensitivity analysis is provided for ego-motion magnitude, residual calibration drift, event noise levels, or iteration count. Because the headline MPJPE of 30.54 mm and gesture accuracy rest on this loop converging to correct minima, the absence of such analysis leaves the central empirical claims vulnerable to the exact perturbations that arise in real egocentric stereo streams.

    Authors: We agree that the original submission lacked a dedicated quantitative sensitivity analysis of the iterative reprojection-guided refinement loop. This is a fair observation, as the loop is central to resolving depth ambiguity. In the revised manuscript we have added a new subsection (now §3.4) and corresponding supplementary figures that report controlled experiments varying ego-motion magnitude (0–1.0 m/s), residual calibration drift (0–3 pixels), event noise density (0–20% polarity flips), and iteration count (1–6). Within the operating ranges observed in EgoEVHands, MPJPE remains below 35 mm and Top-1 gesture accuracy above 85%, with convergence typically reached by iteration 3. These results are now explicitly tied to the headline metrics and support the loop’s reliability under realistic egocentric perturbations. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims

full rationale

The paper describes an end-to-end neural framework (KeypointBEV) for stereo event-based hand pose estimation and introduces a new dataset (EgoEVHands) for empirical evaluation. Reported metrics (MPJPE 30.54 mm, 86.87% accuracy) are standard held-out test-set results on real-world sequences, not quantities defined in terms of fitted parameters or self-referential equations. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or method summary; the derivation chain consists of architectural choices and supervised training whose outputs are independently falsifiable on external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard computer-vision assumptions about camera calibration, event generation models, and kinematic hand priors; no new physical entities are postulated. Free parameters consist of the usual neural-network weights and hyperparameters learned from the new dataset.

free parameters (1)
  • network weights and hyperparameters
    Learned during end-to-end training on EgoEVHands; typical for deep-learning pose estimators and not counted as ad-hoc inventions.
axioms (2)
  • domain assumption Stereo event cameras are calibrated and synchronized
    Implicit in the stereo fusion module and reprojection refinement; standard for any stereo vision work.
  • domain assumption Hand kinematics provide useful consistency constraints
    Used in the iterative refinement loop to resolve depth uncertainty.

pith-pipeline@v0.9.0 · 5594 in / 1535 out tokens · 36619 ms · 2026-05-13T05:48:54.661209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

  1. [1]

    Ego4D: Around the world in 3,000 hours of egocentric video,

    K. Graumanet al., “Ego4D: Around the world in 3,000 hours of egocentric video,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 47, no. 11, pp. 9468–9509, 2025

  2. [2]

    MEgATrack: monochrome egocentric articulated hand- tracking for virtual reality,

    S. Hanet al., “MEgATrack: monochrome egocentric articulated hand- tracking for virtual reality,”ACM Transactions on Graphics, vol. 39, no. 4, pp. 87–1, 2020

  3. [3]

    EgoGesture: A new dataset and benchmark for egocentric hand gesture recognition,

    Y . Zhang, C. Cao, J. Cheng, and H. Lu, “EgoGesture: A new dataset and benchmark for egocentric hand gesture recognition,”IEEE Transactions on Multimedia, vol. 20, no. 5, pp. 1038–1050, 2018

  4. [4]

    Assembly101: A large-scale multi-view video dataset for understanding procedural activities,

    F. Seneret al., “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” inProc. CVPR, 2022, pp. 21 064– 21 074

  5. [5]

    First-person hand action benchmark with RGB-D videos and 3D hand pose annota- tions,

    G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim, “First-person hand action benchmark with RGB-D videos and 3D hand pose annota- tions,” inProc. CVPR, 2018, pp. 409–419. 14

  6. [6]

    Egovsr: To- wards high-quality egocentric video super-resolution,

    Y . Chi, J. Gu, J. Zhang, W. Yang, and Y . Tian, “Egovsr: To- wards high-quality egocentric video super-resolution,”arXiv preprint arXiv:2305.14708, 2023

  7. [7]

    Event-based vision: A survey,

    G. Gallegoet al., “Event-based vision: A survey,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 154–180, 2022

  8. [8]

    EventEgoHands: Event-based egocentric 3D hand mesh reconstruction,

    R. Hara, W. Ikeda, M. Hatano, and M. Isogawa, “EventEgoHands: Event-based egocentric 3D hand mesh reconstruction,” inProc. ICIP, 2025, pp. 1199–1204

  9. [9]

    EventEgo3D: 3D human motion capture from egocentric event streams,

    C. Millerdurai, H. Akada, J. Wang, D. Luvizon, C. Theobalt, and V . Golyanik, “EventEgo3D: 3D human motion capture from egocentric event streams,” inProc. CVPR, 2024, pp. 1186–1195

  10. [10]

    EventHands: Real-time neural 3D hand pose estima- tion from an event stream,

    V . Rudnevet al., “EventHands: Real-time neural 3D hand pose estima- tion from an event stream,” inProc. ICCV, 2021, pp. 12 365–12 375

  11. [11]

    EvHandPose: Event- based 3D hand pose estimation with sparse supervision,

    J. Jiang, J. Li, B. Zhang, X. Deng, and B. Shi, “EvHandPose: Event- based 3D hand pose estimation with sparse supervision,”IEEE Trans- actions on Pattern Analysis and Machine Intelligence, vol. 46, no. 9, pp. 6416–6430, 2024

  12. [12]

    3D pose estimation of two interacting hands from a monocular event camera,

    C. Millerduraiet al., “3D pose estimation of two interacting hands from a monocular event camera,” inProc. 3DV, 2024, pp. 291–301

  13. [13]

    HandMvNet: Real-time 3D hand pose estimation using multi-view cross-attention fusion,

    M. A. Ali, N. Robertini, and D. Stricker, “HandMvNet: Real-time 3D hand pose estimation using multi-view cross-attention fusion,” inProc. VISAPP, 2025, pp. 555–562

  14. [14]

    Exploring event-based human pose estimation with 3D event representations,

    X. Yinet al., “Exploring event-based human pose estimation with 3D event representations,”Computer Vision and Image Understanding, vol. 249, p. 104189, 2024

  15. [15]

    DHP19: Dynamic vision sensor 3D human pose dataset,

    E. Calabreseet al., “DHP19: Dynamic vision sensor 3D human pose dataset,” inProc. CVPRW, 2019, pp. 1695–1704

  16. [16]

    EgoEvGesture: Gesture recognition based on egocentric event camera,

    L. Wang, H. Shi, X. Yin, K. Yang, K. Wang, and J. Bai, “EgoEvGesture: Gesture recognition based on egocentric event camera,” inProc. SMC, 2025, pp. 6606–6613

  17. [17]

    The EPIC-KITCHENS dataset: Collection, challenges and baselines,

    D. Damenet al., “The EPIC-KITCHENS dataset: Collection, challenges and baselines,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4125–4141, 2021

  18. [18]

    uhmer, F. Bogo, and M. Pollefeys, “H2O: Two hands manipulating objects for first person interaction recognition,

    T. Kwon, B. Tekin, J. St"uhmer, F. Bogo, and M. Pollefeys, “H2O: Two hands manipulating objects for first person interaction recognition,” in Proc. ICCV, 2021, pp. 10 118–10 128

  19. [19]

    ARCTIC: A dataset for dexterous bimanual hand-object manipulation,

    Z. Fanet al., “ARCTIC: A dataset for dexterous bimanual hand-object manipulation,” inProc. CVPR, 2023, pp. 12 943–12 954

  20. [20]

    HOI4D: A 4D egocentric dataset for category-level human-object interaction,

    Y . Liuet al., “HOI4D: A 4D egocentric dataset for category-level human-object interaction,” inProc. CVPR, 2022, pp. 20 981–20 990

  21. [21]

    A low power, fully event-based gesture recognition system,

    A. Amiret al., “A low power, fully event-based gesture recognition system,” inProc. CVPR, 2017, pp. 7388–7397

  22. [22]

    E2(GO)MOTION: Motion augmented event stream for egocentric action recognition,

    C. Plizzariet al., “E2(GO)MOTION: Motion augmented event stream for egocentric action recognition,” inProc. CVPR, 2022, pp. 19 903– 19 915

  23. [23]

    EHoA: A benchmark for task- oriented hand-object action recognition via event vision,

    W. Chen, S.-C. Liu, and J. Zhang, “EHoA: A benchmark for task- oriented hand-object action recognition via event vision,”IEEE Transac- tions on Industrial Informatics, vol. 20, no. 8, pp. 10 304–10 313, 2024

  24. [24]

    Helios: An extremely low power event-based gesture recognition for always-on smart eyewear,

    P. Bhattacharyyaet al., “Helios: An extremely low power event-based gesture recognition for always-on smart eyewear,” inProc. ECCV, 2024, pp. 168–184

  25. [25]

    Complement- ing event streams and RGB frames for hand mesh reconstruction,

    J. Jiang, X. Zhou, B. Wang, X. Deng, C. Xu, and B. Shi, “Complement- ing event streams and RGB frames for hand mesh reconstruction,” in Proc. CVPR, 2024, pp. 24 944–24 954

  26. [26]

    AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation,

    T. Ohkawa, K. He, F. Sener, T. Hodan, L. Tran, and C. Keskin, “AssemblyHands: Towards egocentric activity understanding via 3D hand pose estimation,” inProc. CVPR, 2023, pp. 12 999–13 008

  27. [27]

    Ego2HandsPose: A dataset for egocentric two- hand 3D global pose estimation,

    F. Lin and T. Martinez, “Ego2HandsPose: A dataset for egocentric two- hand 3D global pose estimation,” inProc. WACV, 2024, pp. 4363–4371

  28. [28]

    Learning to estimate 3D hand pose from single RGB images,

    C. Zimmermann and T. Brox, “Learning to estimate 3D hand pose from single RGB images,” inProc. ICCV, 2017, pp. 4913–4921

  29. [29]

    HOPE-Net: A graph-based model for hand-object pose estimation,

    B. Doosti, S. Naha, M. Mirbagheri, and D. J. Crandall, “HOPE-Net: A graph-based model for hand-object pose estimation,” inProc. CVPR, 2020, pp. 6607–6616

  30. [30]

    HandJoKe: Joint-guided keypoint denoising transformer for depth-based 3D hand pose estimation,

    J. Gan, L. Chen, P. Hu, J. Leng, W. Li, and X. Gao, “HandJoKe: Joint-guided keypoint denoising transformer for depth-based 3D hand pose estimation,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 36, no. 1, pp. 861–873, 2026

  31. [31]

    Embodied hands: modeling and capturing hands and bodies together,

    J. Romero, D. Tzionas, and M. J. Black, “Embodied hands: modeling and capturing hands and bodies together,”ACM Transactions on Graph- ics (TOG), vol. 36, no. 6, pp. 1–17, 2017

  32. [32]

    I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image,

    G. Moon and K. M. Lee, “I2L-MeshNet: Image-to-lixel prediction network for accurate 3D human pose and mesh estimation from a single RGB image,” inProc. ECCV, 2020, pp. 752–768

  33. [33]

    Pushing the envelope for RGB- based dense 3D hand pose estimation via neural rendering,

    S. Baek, K. I. Kim, and T.-K. Kim, “Pushing the envelope for RGB- based dense 3D hand pose estimation via neural rendering,” inProc. CVPR, 2019, pp. 1067–1076

  34. [34]

    InterHand2. 6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image,

    G. Moon, S.-I. Yu, H. Wen, T. Shiratori, and K. M. Lee, “InterHand2. 6M: A dataset and baseline for 3D interacting hand pose estimation from a single RGB image,” inProc. ECCV, 2020, pp. 548–564

  35. [35]

    3D hand shape and pose estimation from a single RGB image,

    L. Geet al., “3D hand shape and pose estimation from a single RGB image,” inProc. CVPR, 2019, pp. 10 825–10 834

  36. [36]

    HOnnotate: A method for 3D annotation of hand and object poses,

    S. Hampali, M. Rad, M. Oberweger, and V . Lepetit, “HOnnotate: A method for 3D annotation of hand and object poses,” inProc. CVPR, 2020, pp. 3193–3203

  37. [37]

    3D hand pose estimation from monocular RGB with feature interaction module,

    S. Guo, E. Rigall, Y . Ju, and J. Dong, “3D hand pose estimation from monocular RGB with feature interaction module,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 8, pp. 5293– 5306, 2022

  38. [38]

    POEM: Reconstructing hand in a point embedded multi- view stereo,

    L. Yanget al., “POEM: Reconstructing hand in a point embedded multi- view stereo,” inProc. CVPR, 2023, pp. 21 108–21 112

  39. [39]

    In my perspective, in my hands: Accurate egocentric 2D hand pose and action recognition,

    W. Mucha and M. Kampel, “In my perspective, in my hands: Accurate egocentric 2D hand pose and action recognition,” inProc. FG, 2024, pp. 1–9

  40. [40]

    Spatial–temporal synchronous transformer for skeleton-based hand gesture recognition,

    D. Zhao, H. Li, and S. Yan, “Spatial–temporal synchronous transformer for skeleton-based hand gesture recognition,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 3, pp. 1403– 1412, 2023

  41. [41]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D,

    J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D,” inProc. ECCV, 2020, pp. 194–210

  42. [42]

    MPL: Lifting 3D human pose from multi-view 2D poses,

    S. A. Ghasemzadeh, A. Alahi, and C. De Vleeschouwer, “MPL: Lifting 3D human pose from multi-view 2D poses,” inProc. ECCV, 2024, pp. 36–52

  43. [43]

    EV- TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation,

    X. Zhao, L. Yang, W. Huang, Q. Wang, X. Wang, and Y . Lou, “EV- TIFNet: lightweight binocular fusion network assisted by event camera time information for 3D human pose estimation,”Journal of Real-Time Image Processing, vol. 21, no. 4, p. 150, 2024

  44. [44]

    Learnable triangulation of human pose,

    K. Iskakov, E. Burkov, V . Lempitsky, and Y . Malkov, “Learnable triangulation of human pose,” inProc. ICCV, 2019, pp. 7717–7726

  45. [45]

    BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,

    Z. Liet al., “BEVFormer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” inProc. ECCV, 2022, pp. 1–18

  46. [46]

    BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

    J. Huanget al., “BEVDet: High-performance multi-camera 3D object detection in bird-eye-view,” inarXiv preprint arXiv:2112.11790, 2021

  47. [47]

    EgoPoseFormer: A simple baseline for stereo egocentric 3D human pose estimation,

    C. Yang, A. Tkach, S. Hampali, L. Zhang, E. J. Crowley, and C. Keskin, “EgoPoseFormer: A simple baseline for stereo egocentric 3D human pose estimation,” inProc. ECCV, 2024, pp. 401–417

  48. [48]

    MLPHand: Real time multi-view 3D hand reconstruction via MLP modeling,

    J. Yang, J. Li, G. Li, H.-Y . Wu, Z. Shen, and Z. Fan, “MLPHand: Real time multi-view 3D hand reconstruction via MLP modeling,” inProc. ECCV, 2024, pp. 407–424

  49. [49]

    Two viewpoints based real-time recognition for hand gestures,

    A. Krishan Kumar, A. Kaushal Kumar, and S. Guo, “Two viewpoints based real-time recognition for hand gestures,”IET Image Processing, vol. 14, no. 17, pp. 4606–4613, 2020

  50. [50]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    T. Renet al., “Grounded SAM: Assembling open-world models for diverse visual tasks,”arXiv preprint arXiv:2401.14159, 2024

  51. [51]

    MediaPipe: A Framework for Building Perception Pipelines

    C. Lugaresiet al., “MediaPipe: A framework for building perception pipelines,”arXiv preprint arXiv:1906.08172, 2019

  52. [52]

    MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices,

    S. Choi, S. Choi, and C. Kim, “MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices,” inProc. CVPRW, 2021, pp. 2328–2338

  53. [53]

    Simple baselines for human pose estimation and tracking,

    B. Xiao, H. Wu, and Y . Wei, “Simple baselines for human pose estimation and tracking,” inProc. ECCV, 2018, pp. 466–481

  54. [54]

    PointNet: Deep learning on point sets for 3D classification and segmentation,

    R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, “PointNet: Deep learning on point sets for 3D classification and segmentation,” inProc. CVPR, 2017, pp. 77–85

  55. [55]

    Dynamic graph CNN for learning on point clouds,

    Y . Wang, Y . Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph CNN for learning on point clouds,”ACM Transactions on Graphics (TOG), vol. 38, no. 5, pp. 1–12, 2019

  56. [56]

    Point transformer,

    H. Zhao, L. Jiang, J. Jia, P. Torr, and V . Koltun, “Point transformer,” in Proc. ICCV, 2021, pp. 16 239–16 248

  57. [57]

    VMV-GCN: V olumetric multi-view based graph CNN for event stream classification,

    B. Xie, Y . Deng, Z. Shao, H. Liu, and Y . Li, “VMV-GCN: V olumetric multi-view based graph CNN for event stream classification,”IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1976–1983, 2022

  58. [58]

    V oxel-based multi-scale transformer network for event stream processing,

    D. Liu, T. Wang, and C. Sun, “V oxel-based multi-scale transformer network for event stream processing,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 4, pp. 2112–2124, 2024