NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition
Pith reviewed 2026-05-10 09:23 UTC · model grok-4.3
The pith
Event-based lip motion processing enables speaker recognition that generalizes to unseen viewpoints and lighting without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NeuroLip is an event-driven spatiotemporal framework for lip-motion-based visual speaker recognition that trains under a single controlled condition and generalizes to unseen viewpoints and illuminations. It uses a Temporal-aware Voxel Encoding module with adaptive event weighting, a Structure-aware Spatial Enhancer to amplify discriminative patterns while suppressing noise, and a Polarity Consistency Regularization mechanism to retain motion-direction cues. On the DVSpeaker dataset of 50 subjects across four scenarios, the system reaches near-perfect matched-scene accuracy, over 71% on unseen viewpoints, and nearly 76% under low light, exceeding prior methods by at least 8.54%.
What carries the argument
The NeuroLip pipeline, which processes event-camera data through adaptive temporal voxel encoding, structure-preserving spatial enhancement, and polarity regularization to extract stable lip-motion dynamics for cross-scene recognition.
If this is right
- Speaker identification can operate silently using only lip dynamics without audio input or controlled lighting.
- Event cameras overcome motion blur and low dynamic range that limit conventional frame-based lip-reading methods.
- Cross-scene generalization reduces the need to collect training data under every possible viewpoint and illumination.
- Behavioral patterns in lip motion can serve as a biometric that supplements appearance-based approaches.
Where Pith is reading between the lines
- The same event-processing approach could extend to other motion-based biometrics such as gait or hand gestures under variable conditions.
- Deployment in surveillance or access-control settings with fluctuating lights becomes more feasible if the generalization holds.
- Larger-scale tests with additional subjects and more extreme scene shifts would reveal the practical boundaries of this stability.
Load-bearing premise
Lip-motion dynamics are sufficiently unique to each person and stable enough to support recognition when viewpoint and illumination change completely from the single training condition.
What would settle it
Apply the trained NeuroLip model to lip-motion event data from the same subjects recorded at a new viewpoint and lighting level never seen during training, then check whether accuracy stays above 70 percent or falls sharply relative to matched-scene results.
Figures
read the original abstract
Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes NeuroLip, an event-based spatiotemporal framework for lip-motion visual speaker recognition under a strict cross-scene protocol (train on one controlled condition, test on unseen viewpoints and lighting). It introduces three components—Temporal-aware Voxel Encoding with adaptive event weighting, Structure-aware Spatial Enhancer for preserving vertical motion structure, and Polarity Consistency Regularization—along with the new DVSpeaker dataset (50 subjects, four scenarios). Experiments report near-perfect matched-scene accuracy, >71% on unseen viewpoints, ~76% in low-light, and at least 8.54% outperformance over baselines, with public release of dataset and code.
Significance. If the cross-scene results hold after verification, the work advances event-camera biometrics by exploiting stable behavioral lip dynamics rather than appearance, offering robustness to motion blur, viewpoint shifts, and illumination changes. Public dataset and code release is a clear strength supporting reproducibility.
major comments (2)
- [Abstract] Abstract: The central cross-scene generalization claim (training under single controlled condition yielding >71% unseen-viewpoint and ~76% low-light accuracy) is load-bearing, yet the abstract provides no architecture details, training protocol, baseline implementations, or statistical tests; this absence directly limits assessment of whether the 8.54% margin reflects true invariance or dataset correlations.
- [Methods] Methods (Temporal-aware Voxel Encoding, Structure-aware Spatial Enhancer, Polarity Consistency Regularization): The assertion that these modules extract subject-specific patterns stable across drastic viewpoint/illumination shifts requires explicit verification; without ablations showing that adaptive weighting and polarity cues do not encode residual scene statistics (e.g., event density changes or foreshortened vertical motion), the reported generalization numbers risk being inflated by unaccounted dataset leakage.
minor comments (2)
- Define all acronyms at first use (e.g., DVSpeaker) and ensure consistent notation for event polarity and voxel grids throughout.
- [Conclusion] Add a limitations paragraph discussing applicability to continuous speech, multi-speaker settings, or real-time deployment constraints.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions made to strengthen the presentation of our cross-scene results.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central cross-scene generalization claim (training under single controlled condition yielding >71% unseen-viewpoint and ~76% low-light accuracy) is load-bearing, yet the abstract provides no architecture details, training protocol, baseline implementations, or statistical tests; this absence directly limits assessment of whether the 8.54% margin reflects true invariance or dataset correlations.
Authors: We acknowledge the referee's point that the abstract is concise. The full architecture (Temporal-aware Voxel Encoding, Structure-aware Spatial Enhancer, and Polarity Consistency Regularization) is detailed in Section 3, the training protocol and cross-scene split in Section 4.1, baseline implementations and hyperparameters in Section 4.2, and statistical tests (including means, standard deviations, and significance levels) in Tables 2–4 and Section 4.4. To improve accessibility of the generalization claim, we have revised the abstract to briefly reference the three core modules and the strict single-condition training protocol while remaining within length limits. revision: yes
-
Referee: [Methods] Methods (Temporal-aware Voxel Encoding, Structure-aware Spatial Enhancer, Polarity Consistency Regularization): The assertion that these modules extract subject-specific patterns stable across drastic viewpoint/illumination shifts requires explicit verification; without ablations showing that adaptive weighting and polarity cues do not encode residual scene statistics (e.g., event density changes or foreshortened vertical motion), the reported generalization numbers risk being inflated by unaccounted dataset leakage.
Authors: We agree that explicit verification against scene leakage is necessary. The original manuscript already includes ablation studies in Section 4.3 that quantify the performance drop when each module is removed, showing consistent gains in cross-scene accuracy. In the revised version we have expanded these with targeted analyses: (i) event-density normalization experiments demonstrating that adaptive weighting reduces scene-dependent density variations while preserving subject identity; (ii) vertical-motion structure comparisons on viewpoint-controlled subsets confirming that the Spatial Enhancer captures articulation patterns rather than foreshortening artifacts; and (iii) polarity-consistency ablations with feature visualizations illustrating retention of motion-direction cues independent of illumination. These additions directly address the concern that reported numbers may reflect dataset correlations. revision: yes
Circularity Check
No circularity: empirical evaluation on newly introduced dataset with no algebraic reduction or self-referential fitting
full rationale
The paper presents an event-based neural framework (NeuroLip) with three modules and evaluates it on a new dataset (DVSpeaker) under a cross-scene protocol. All reported accuracies (matched-scene near-perfect, cross-scene >71% and ~76%) are obtained via standard train/test splits and comparison to baselines, not by deriving quantities from fitted parameters that are then re-predicted. No equations, uniqueness theorems, or ansatzes are invoked that reduce to self-definition or prior self-citations. The central claim of generalization is an empirical assertion about the learned features, not a mathematical identity. This is the expected non-circular outcome for an applied ML paper introducing a dataset and model.
Axiom & Free-Parameter Ledger
free parameters (2)
- adaptive event weighting coefficients
- regularization weight for polarity consistency
axioms (2)
- domain assumption Lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination that remain stable across environmental changes.
- domain assumption Event cameras capture fine-grained lip dynamics without motion blur and with high dynamic range, overcoming limitations of frame-based cameras.
Reference graph
Works this paper leans on
-
[1]
Sparse coding based lip texture representation for visual speaker identification,
J.-Y . Lai, S.-L. Wang, X.-J. Shi, and A. W.-C. Liew, “Sparse coding based lip texture representation for visual speaker identification,” in Proc. Int. Conf. Digit. Signal Process. (DSP), 2014, pp. 607–610
work page 2014
-
[2]
Lip feature disentangle- ment for visual speaker authentication in natural scenes,
Y . He, L. Yang, S. Wang, and A. W.-C. Liew, “Lip feature disentangle- ment for visual speaker authentication in natural scenes,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 10, pp. 9898–9909, 2024
work page 2024
-
[3]
Studies on personal identification by means of lip prints,
Y . Tsuchihashi, “Studies on personal identification by means of lip prints,”Forensic Sci., vol. 3, pp. 233–248, 1974
work page 1974
-
[4]
Lip as biometric and beyond: A survey,
D. P. Chowdhury, R. Kumari, S. Bakshi, M. N. Sahoo, and A. Das, “Lip as biometric and beyond: A survey,”Multimed. Tools Appl., vol. 81, no. 3, pp. 3831–3865, 2022
work page 2022
-
[5]
The role of facial action units in investigating facial movements during speech,
A. A. Newby, A. Bhatta, C. Kirkland III, N. Arnold, and L. A. Thomp- son, “The role of facial action units in investigating facial movements during speech,”Electronics, vol. 14, no. 10, p. 2066, 2025
work page 2066
-
[6]
Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis,
C.-Z. Yang, J. Ma, S. Wang, and A. W.-C. Liew, “Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis,” IEEE Trans. Inf. Forensics Security, vol. 16, pp. 1841–1854, 2020
work page 2020
-
[7]
Securing face liveness detection on mobile devices using unforgeable lip motion patterns,
M. Zhou, Q. Wang, Q. Li, W. Zhou, J. Yang, and C. Shen, “Securing face liveness detection on mobile devices using unforgeable lip motion patterns,”IEEE Trans. Mobile Comput., vol. 23, no. 10, pp. 9772–9788, 2024
work page 2024
-
[8]
B. Koch and R. Grbi ´c, “One-shot lip-based biometric authentication: Extending behavioral features with authentication phrase information,” Image Vis. Comput., vol. 142, p. 104900, 2024
work page 2024
-
[9]
Lip biometric template security framework using spatial steganography,
S. Das, K. Muhammad, S. Bakshi, I. Mukherjee, P. K. Sa, A. K. Sangaiah, and A. Bruno, “Lip biometric template security framework using spatial steganography,”Pattern Recognit. Lett., vol. 126, pp. 102– 110, 2019
work page 2019
-
[10]
Understanding visual lip-based biometric authentication for mobile devices,
C. Wright and D. W. Stewart, “Understanding visual lip-based biometric authentication for mobile devices,”EURASIP J. Inf. Secur., vol. 2020, no. 1, p. 3, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
work page 2020
-
[11]
Discriminative analysis of lip motion features for speaker identification and speech- reading,
H. E. Cetingul, Y . Yemez, E. Erzin, and A. M. Tekalp, “Discriminative analysis of lip motion features for speaker identification and speech- reading,”IEEE Trans. Image Process., vol. 15, no. 10, pp. 2879–2891, 2006
work page 2006
-
[12]
Physiological and behavioral lip bio- metrics: A comprehensive study of their discriminative power,
S.-L. Wang and A. W.-C. Liew, “Physiological and behavioral lip bio- metrics: A comprehensive study of their discriminative power,”Pattern Recognit., vol. 45, no. 9, pp. 3328–3335, 2012
work page 2012
-
[13]
M. Chora ´s, “The lip as a biometric,”Pattern Anal. Appl., vol. 13, no. 1, pp. 105–112, 2010
work page 2010
-
[14]
Speaker identification by lipreading,
J. Luettin, N. A. Thacker, and S. W. Beet, “Speaker identification by lipreading,” inProc. Int. Conf. Spoken Lang. Process. (ICSLP), vol. 1, 1996, pp. 62–65
work page 1996
-
[15]
Local ordinal contrast pattern histograms for spatiotemporal, lip-based speaker authentication,
C.-H. Chan, B. Goswami, J. Kittler, and W. J. Christmas, “Local ordinal contrast pattern histograms for spatiotemporal, lip-based speaker authentication,”IEEE Trans. Inf. Forensics Security, vol. 7, no. 2, pp. 602–612, 2012
work page 2012
-
[16]
LipAuth: Securing smartphone user authentication with lip motion patterns,
L. Kuang, F. Zeng, D. Liu, H. Cao, H. Jiang, and J. Liu, “LipAuth: Securing smartphone user authentication with lip motion patterns,”IEEE Internet Things J., vol. 11, no. 1, pp. 1096–1109, 2023
work page 2023
-
[17]
DynamicLip: Shape-independent continuous authentication via lip articulator dynamics,
H. Chen, Y . Xu, Y . Feng, M. Jian, F. Liu, P. Hu, K. Peng, S. He, and Z. Wang, “DynamicLip: Shape-independent continuous authentication via lip articulator dynamics,”arXiv preprint arXiv:2501.01032, 2025
-
[18]
WhisperNetV2: SlowFast siamese network for lip-based biometrics,
A. Zakeri, H. Hassanpour, M. H. Khosravi, and A. M. Nourollah, “WhisperNetV2: SlowFast siamese network for lip-based biometrics,” arXiv preprint arXiv:2407.08717, 2024
-
[19]
Neuromor- phic event-based face identity recognition,
G. Moreira, A. Grac ¸a, B. Silva, P. Martins, and J. Batista, “Neuromor- phic event-based face identity recognition,” inProc. Int. Conf. Pattern Recognit. (ICPR), 2022, pp. 922–929
work page 2022
-
[20]
On the facilitative effects of face motion on face recognition and its development,
N. G. Xiao, S. Perrotta, P. C. Quinn, Z. Wang, Y .-H. P. Sun, and K. Lee, “On the facilitative effects of face motion on face recognition and its development,”Front. Psychol., vol. 5, p. 633, 2014
work page 2014
-
[21]
Towards mobile sensing with event cameras on high-agility resource-constrained devices: A survey,
H. Wang, R. Guo, P. Ma, C. Ruan, X. Luo, W. Ding, T. Zhong, J. Xu, Y . Liu, and X. Chen, “Towards mobile sensing with event cameras on high-agility resource-constrained devices: A survey,”arXiv preprint arXiv:2503.22943, 2025
-
[22]
LipPass: Lip reading-based user authentication on smartphones leveraging acoustic signals,
L. Lu, J. Yu, Y . Chen, H. Liu, Y . Zhu, Y . Liu, and M. Li, “LipPass: Lip reading-based user authentication on smartphones leveraging acoustic signals,” inProc. IEEE Conf. Comput. Commun. (INFOCOM), 2018, pp. 1466–1474
work page 2018
-
[23]
K. Yang, D. Zhu, C. Han, J. Guo, S. Sun, and L. Sun, “Lip-TWUID: Noninvasive through-wall user identification using SISO radar and lip movement micro-doppler signatures with limited samples,”IEEE Trans. Instrum. Meas., 2025
work page 2025
-
[24]
A framework for event-based computer vision on a mobile device,
G. Lenz, S. Picaud, and S.-H. Ieng, “A framework for event-based computer vision on a mobile device,”arXiv preprint arXiv:2205.06836, 2022
-
[25]
Using a probabilistic neural network for lip-based biometric verification,
K. Wrobel, R. Doroz, P. Porwik, J. Naruniec, and M. Kowalski, “Using a probabilistic neural network for lip-based biometric verification,”Eng. Appl. Artif. Intell., vol. 64, pp. 112–127, 2017
work page 2017
-
[26]
Lip print recognition based on convolutional spiking neural network,
B. Niu, L. Wang, T. Wu, and X. Zhang, “Lip print recognition based on convolutional spiking neural network,” inProc. Int. Conf. Image Signal Process. Pattern Recognit. (ISPP), vol. 12707, 2023, pp. 890–894
work page 2023
-
[27]
Neuromorphic lip-reading with signed spiking gated recurrent units,
M. Dampfhoffer and T. Mesquida, “Neuromorphic lip-reading with signed spiking gated recurrent units,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 2141–2151
work page 2024
-
[28]
Spikepoint: An efficient point-based spiking neural network for event cam- eras action recognition
H. Ren, Y . Zhou, Y . Huang, H. Fu, X. Lin, J. Song, and B. Cheng, “SpikePoint: An efficient point-based spiking neural network for event cameras action recognition,”arXiv preprint arXiv:2310.07189, 2023
-
[29]
Multi-grained spatio-temporal features perceived network for event-based lip-reading,
G. Tan, Y . Wang, H. Han, Y . Cao, F. Wu, and Z.-J. Zha, “Multi-grained spatio-temporal features perceived network for event-based lip-reading,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 20 062–20 071
work page 2022
-
[30]
HOTS: A hierarchy of event-based time-surfaces for pattern recogni- tion,
X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman, “HOTS: A hierarchy of event-based time-surfaces for pattern recogni- tion,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1346– 1359, 2016
work page 2016
-
[31]
Time- ordered recent event (TORE) volumes for event cameras,
R. W. Baldwin, R. Liu, M. Almatrafi, V . Asari, and K. Hirakawa, “Time- ordered recent event (TORE) volumes for event cameras,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2519–2532, 2022
work page 2022
-
[32]
EGST: An efficient solution for human gaits recognition using neuromorphic vision sensor,
L. Chen, Z. Zhang, Y . Xiao, and Y . Wang, “EGST: An efficient solution for human gaits recognition using neuromorphic vision sensor,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 6144–6154, 2024
work page 2024
-
[33]
Hypergraph-based multi-view action recognition using event cameras,
Y . Gao, J. Lu, S. Li, Y . Li, and S. Du, “Hypergraph-based multi-view action recognition using event cameras,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6610–6622, 2024
work page 2024
-
[34]
Space-time event clouds for gesture recognition: From RGB cameras to event cameras,
Q. Wang, Y . Zhang, J. Yuan, and Y . Lu, “Space-time event clouds for gesture recognition: From RGB cameras to event cameras,” inProc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2019, pp. 1826–1835
work page 2019
-
[35]
Y . Tang, R. Salakhutdinov, and G. E. Hinton, “Deep lambertian net- works,” inProc. Int. Conf. Mach. Learn. (ICML), 2012, pp. 1419–1426
work page 2012
-
[36]
PAM: Pose attention module for pose- invariant face recognition,
E.-J. Tsai and W.-C. Yeh, “PAM: Pose attention module for pose- invariant face recognition,”arXiv preprint arXiv:2111.11940, 2021
-
[37]
Disentangled representation learning GAN for pose-invariant face recognition,
L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN for pose-invariant face recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1415–1424
work page 2017
-
[38]
R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2439–2448
work page 2017
-
[39]
Face recognition based on fitting a 3D morphable model,
V . Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1063–1074, 2003
work page 2003
-
[40]
Face recognition using a unified 3D morphable model,
G. Hu, F. Yan, C.-H. Chan, W. Deng, W. Christmas, J. Kittler, and N. M. Robertson, “Face recognition using a unified 3D morphable model,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 73–89
work page 2016
-
[41]
GaitSpike: Event-based gait recognition with spiking neural network,
Y . Tao, C.-H. Chang, S. Sa ¨ıghi, and S. Gao, “GaitSpike: Event-based gait recognition with spiking neural network,” inProc. IEEE Int. Conf. AI Circuits Syst. (AICAS), 2024, pp. 357–361
work page 2024
-
[42]
A comprehensive study on cross-view gait based human identification with deep cnns,
Z. Wu, Y . Huang, L. Wang, X. Wang, and T. Tan, “A comprehensive study on cross-view gait based human identification with deep cnns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 2, pp. 209–226, 2016
work page 2016
-
[43]
OuluVS2: A multi- view audiovisual database for non-rigid mouth motion analysis,
I. Anina, Z. Zhou, G. Zhao, and M. Pietik ¨ainen, “OuluVS2: A multi- view audiovisual database for non-rigid mouth motion analysis,” inProc. IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), vol. 1, 2015, pp. 1–5
work page 2015
-
[44]
End-to-end multi-view lipreading,
S. Petridis, Y . Wang, Z. Li, and M. Pantic, “End-to-end multi-view lipreading,”arXiv preprint arXiv:1709.00443, 2017
-
[45]
Multi-view automatic lip-reading using neural network,
D. Lee, J. Lee, and K.-E. Kim, “Multi-view automatic lip-reading using neural network,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 290–302
work page 2016
-
[46]
Low cost and latency event camera background activity denoising,
S. Guo and T. Delbruck, “Low cost and latency event camera background activity denoising,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 785–795, 2022
work page 2022
-
[47]
XM2VTSDB: The extended M2VTS database,
K. Messer, J. Matas, J. Kittler, J. Luettin, G. Maitreet al., “XM2VTSDB: The extended M2VTS database,” inProc. Int. Conf. Audio-Video-Based Biometric Person Authentication (AVBPA), 1999, pp. 965–966
work page 1999
-
[48]
Visual speech recognition with stochastic networks,
J. Movellan, “Visual speech recognition with stochastic networks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 7, 1994
work page 1994
-
[49]
E. Erzin, Y . Yemez, and A. M. Tekalp, “Multimodal speaker identifica- tion using an adaptive classifier cascade based on modality reliability,” IEEE Trans. Multimedia, vol. 7, no. 5, pp. 840–852, 2005
work page 2005
-
[50]
J. S. Chung and A. Zisserman, “Lip reading in the wild,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 87–103
work page 2016
-
[51]
CREMA-D: Crowd-sourced emotional multimodal actors dataset,
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,”IEEE Trans. Affect. Comput., vol. 5, no. 4, pp. 377–390, 2014
work page 2014
-
[52]
Speech database development at MIT: TIMIT and beyond,
V . Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,”Speech Commun., vol. 9, no. 4, pp. 351–356, 1990
work page 1990
-
[53]
An audio-visual corpus for speech perception and automatic speech recognition,
M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,”J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2421–2424, 2006
work page 2006
-
[54]
Assessing the uniqueness and permanence of facial actions for use in biometric applications,
L. Benedikt, D. Cosker, P. L. Rosin, and D. Marshall, “Assessing the uniqueness and permanence of facial actions for use in biometric applications,”IEEE Trans. Syst., Man, Cybern. A, vol. 40, no. 3, pp. 449–460, May 2010
work page 2010
-
[55]
EgoEvGesture: Gesture recognition based on egocentric event camera,
L. Wang, H. Shi, X. Yin, K. Yang, K. Wang, and J. Bai, “EgoEvGesture: Gesture recognition based on egocentric event camera,”arXiv preprint arXiv:2503.12419, 2025
-
[56]
MTGA: Multi-view temporal granularity aligned aggregation for event-based lip- reading,
W. Zhang, J. Wang, Y . Luo, L. Yu, W. Yu, Z. He, and J. Shen, “MTGA: Multi-view temporal granularity aligned aggregation for event-based lip- reading,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 39, no. 10, 2025, pp. 10 176–10 184
work page 2025
-
[57]
GET: Group event transformer for event-based vision,
Y . Peng, Y . Zhang, Z. Xiong, X. Sun, and F. Wu, “GET: Group event transformer for event-based vision,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 6038–6048
work page 2023
-
[58]
Event-stream representation for human gaits identification using deep neural networks,
Y . Wang, X. Zhang, Y . Shen, B. Du, G. Zhao, L. Cui, and H. Wen, “Event-stream representation for human gaits identification using deep neural networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3436–3449, 2021. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15
work page 2021
-
[59]
End-to- end learning of representations for asynchronous event-based data,
D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to- end learning of representations for asynchronous event-based data,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 5633–5643
work page 2019
-
[60]
Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer,
K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, L. Wang, and Y . Qiao, “UniFormerV2: Spatiotemporal learning by arming image ViTs with video UniFormer,”arXiv preprint arXiv:2211.09552, 2022
-
[61]
TAM: Temporal adaptive module for video recognition,
Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “TAM: Temporal adaptive module for video recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 13 708–13 718
work page 2021
-
[62]
Is space-time attention all you need for video understanding?
G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” inProc. Int. Conf. Mach. Learn. (ICML), 2021
work page 2021
-
[63]
H. Shao, S. Qian, and Y . Liu, “Temporal interlacing network,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 34, no. 07, 2020, pp. 11 966– 11 973
work page 2020
-
[64]
SlowFast networks for video recognition,
C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast networks for video recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6202–6211
work page 2019
-
[65]
TSM: Temporal shift module for efficient video understanding,
J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for efficient video understanding,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 7083–7093
work page 2019
-
[66]
A closer look at spatiotemporal convolutions for action recognition,
D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 6450–6459
work page 2018
-
[67]
Temporal relational reasoning in videos,
B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 803–818
work page 2018
-
[68]
X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7794–7803
work page 2018
-
[69]
Temporal segment networks: Towards good practices for deep action recognition,
L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 20–36
work page 2016
-
[70]
Learning spatiotemporal features with 3D convolutional networks,
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 4489–4497. Junguang Yao(S’26) received the B.E. degree from the College of Electronic Engineering, South China Agricultural University, Guangzhou, China, in 2021, and the M.S...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.