NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition

Junguang Yao; Stjepan Picek; Wenye Liu; Yue Zheng

arxiv: 2604.15718 · v1 · submitted 2026-04-17 · 💻 cs.CV · cs.AI· cs.CR· cs.DB· cs.LG

NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition

Junguang Yao , Wenye Liu , Stjepan Picek , Yue Zheng This is my paper

Pith reviewed 2026-05-10 09:23 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CRcs.DBcs.LG

keywords event cameralip motionspeaker recognitioncross-scene generalizationspatiotemporal learningvisual biometricsdynamic vision sensor

0 comments

The pith

Event-based lip motion processing enables speaker recognition that generalizes to unseen viewpoints and lighting without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that lip motion carries stable, subject-specific behavioral patterns that remain recognizable even when camera angle and illumination differ from training data. It presents NeuroLip, an event-camera framework that converts sparse event streams into voxel representations, enhances spatial motion structure, and regularizes polarity signals to preserve direction information. Training occurs under one controlled scene while testing uses entirely new viewpoints and lights, yielding high accuracy where standard frame cameras suffer from blur and limited range. This setup demonstrates a practical route to reliable visual biometrics in changing real-world conditions.

Core claim

NeuroLip is an event-driven spatiotemporal framework for lip-motion-based visual speaker recognition that trains under a single controlled condition and generalizes to unseen viewpoints and illuminations. It uses a Temporal-aware Voxel Encoding module with adaptive event weighting, a Structure-aware Spatial Enhancer to amplify discriminative patterns while suppressing noise, and a Polarity Consistency Regularization mechanism to retain motion-direction cues. On the DVSpeaker dataset of 50 subjects across four scenarios, the system reaches near-perfect matched-scene accuracy, over 71% on unseen viewpoints, and nearly 76% under low light, exceeding prior methods by at least 8.54%.

What carries the argument

The NeuroLip pipeline, which processes event-camera data through adaptive temporal voxel encoding, structure-preserving spatial enhancement, and polarity regularization to extract stable lip-motion dynamics for cross-scene recognition.

If this is right

Speaker identification can operate silently using only lip dynamics without audio input or controlled lighting.
Event cameras overcome motion blur and low dynamic range that limit conventional frame-based lip-reading methods.
Cross-scene generalization reduces the need to collect training data under every possible viewpoint and illumination.
Behavioral patterns in lip motion can serve as a biometric that supplements appearance-based approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same event-processing approach could extend to other motion-based biometrics such as gait or hand gestures under variable conditions.
Deployment in surveillance or access-control settings with fluctuating lights becomes more feasible if the generalization holds.
Larger-scale tests with additional subjects and more extreme scene shifts would reveal the practical boundaries of this stability.

Load-bearing premise

Lip-motion dynamics are sufficiently unique to each person and stable enough to support recognition when viewpoint and illumination change completely from the single training condition.

What would settle it

Apply the trained NeuroLip model to lip-motion event data from the same subjects recorded at a new viewpoint and lighting level never seen during training, then check whether accuracy stays above 70 percent or falls sharply relative to matched-scene results.

Figures

Figures reproduced from arXiv: 2604.15718 by Junguang Yao, Stjepan Picek, Wenye Liu, Yue Zheng.

**Figure 2.** Figure 2: An overview of NeuroLip: The pipeline converts raw events into discriminative features via four cascaded stages. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Acquisition setup using a synchronized dual-camera (event and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Representative samples from DVSpeaker across four experimental [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of λ (a) and C in CCL (b). matched conditions and robustness to cross-scene variations, benefiting from the complementary contributions of TVE, SSE, and PCR. 2) Impact of PCR strength λ: The parameter λ in Eq. (23) controls the strength of the lpcr, with larger values enforcing stronger polarity constraints. We investigate the impact of λ in the range of 0.01 to 0.50. As shown in [PITH_FULL_IMAGE:f… view at source ↗

read the original abstract

Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuroLip adds an event-camera take on lip-motion speaker ID with a new dataset and cross-scene protocol, but the generalization numbers rest on an untested assumption that the features ignore viewpoint and lighting cues.

read the letter

The core takeaway is that this paper introduces NeuroLip, an event-based pipeline for recognizing speakers from lip motion alone. It trains on data from one controlled scene and evaluates on unseen viewpoints and lighting, plus it ships the DVSpeaker dataset and code. That setup is more realistic than most prior lip-reading work, and the reported gains (over 71% on new views, nearly 76% in low light, beating baselines by at least 8.5%) are the main claim to watch.

Referee Report

2 major / 2 minor

Summary. The paper proposes NeuroLip, an event-based spatiotemporal framework for lip-motion visual speaker recognition under a strict cross-scene protocol (train on one controlled condition, test on unseen viewpoints and lighting). It introduces three components—Temporal-aware Voxel Encoding with adaptive event weighting, Structure-aware Spatial Enhancer for preserving vertical motion structure, and Polarity Consistency Regularization—along with the new DVSpeaker dataset (50 subjects, four scenarios). Experiments report near-perfect matched-scene accuracy, >71% on unseen viewpoints, ~76% in low-light, and at least 8.54% outperformance over baselines, with public release of dataset and code.

Significance. If the cross-scene results hold after verification, the work advances event-camera biometrics by exploiting stable behavioral lip dynamics rather than appearance, offering robustness to motion blur, viewpoint shifts, and illumination changes. Public dataset and code release is a clear strength supporting reproducibility.

major comments (2)

[Abstract] Abstract: The central cross-scene generalization claim (training under single controlled condition yielding >71% unseen-viewpoint and ~76% low-light accuracy) is load-bearing, yet the abstract provides no architecture details, training protocol, baseline implementations, or statistical tests; this absence directly limits assessment of whether the 8.54% margin reflects true invariance or dataset correlations.
[Methods] Methods (Temporal-aware Voxel Encoding, Structure-aware Spatial Enhancer, Polarity Consistency Regularization): The assertion that these modules extract subject-specific patterns stable across drastic viewpoint/illumination shifts requires explicit verification; without ablations showing that adaptive weighting and polarity cues do not encode residual scene statistics (e.g., event density changes or foreshortened vertical motion), the reported generalization numbers risk being inflated by unaccounted dataset leakage.

minor comments (2)

Define all acronyms at first use (e.g., DVSpeaker) and ensure consistent notation for event polarity and voxel grids throughout.
[Conclusion] Add a limitations paragraph discussing applicability to continuous speech, multi-speaker settings, or real-time deployment constraints.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions made to strengthen the presentation of our cross-scene results.

read point-by-point responses

Referee: [Abstract] Abstract: The central cross-scene generalization claim (training under single controlled condition yielding >71% unseen-viewpoint and ~76% low-light accuracy) is load-bearing, yet the abstract provides no architecture details, training protocol, baseline implementations, or statistical tests; this absence directly limits assessment of whether the 8.54% margin reflects true invariance or dataset correlations.

Authors: We acknowledge the referee's point that the abstract is concise. The full architecture (Temporal-aware Voxel Encoding, Structure-aware Spatial Enhancer, and Polarity Consistency Regularization) is detailed in Section 3, the training protocol and cross-scene split in Section 4.1, baseline implementations and hyperparameters in Section 4.2, and statistical tests (including means, standard deviations, and significance levels) in Tables 2–4 and Section 4.4. To improve accessibility of the generalization claim, we have revised the abstract to briefly reference the three core modules and the strict single-condition training protocol while remaining within length limits. revision: yes
Referee: [Methods] Methods (Temporal-aware Voxel Encoding, Structure-aware Spatial Enhancer, Polarity Consistency Regularization): The assertion that these modules extract subject-specific patterns stable across drastic viewpoint/illumination shifts requires explicit verification; without ablations showing that adaptive weighting and polarity cues do not encode residual scene statistics (e.g., event density changes or foreshortened vertical motion), the reported generalization numbers risk being inflated by unaccounted dataset leakage.

Authors: We agree that explicit verification against scene leakage is necessary. The original manuscript already includes ablation studies in Section 4.3 that quantify the performance drop when each module is removed, showing consistent gains in cross-scene accuracy. In the revised version we have expanded these with targeted analyses: (i) event-density normalization experiments demonstrating that adaptive weighting reduces scene-dependent density variations while preserving subject identity; (ii) vertical-motion structure comparisons on viewpoint-controlled subsets confirming that the Spatial Enhancer captures articulation patterns rather than foreshortening artifacts; and (iii) polarity-consistency ablations with feature visualizations illustrating retention of motion-direction cues independent of illumination. These additions directly address the concern that reported numbers may reflect dataset correlations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation on newly introduced dataset with no algebraic reduction or self-referential fitting

full rationale

The paper presents an event-based neural framework (NeuroLip) with three modules and evaluates it on a new dataset (DVSpeaker) under a cross-scene protocol. All reported accuracies (matched-scene near-perfect, cross-scene >71% and ~76%) are obtained via standard train/test splits and comparison to baselines, not by deriving quantities from fitted parameters that are then re-predicted. No equations, uniqueness theorems, or ansatzes are invoked that reduce to self-definition or prior self-citations. The central claim of generalization is an empirical assertion about the learned features, not a mathematical identity. This is the expected non-circular outcome for an applied ML paper introducing a dataset and model.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that lip motion encodes stable subject-specific dynamics and on standard deep-learning assumptions about generalization from limited training conditions. No new physical entities are postulated.

free parameters (2)

adaptive event weighting coefficients
Introduced inside the Temporal-aware Voxel Encoding module to emphasize recent events.
regularization weight for polarity consistency
Controls the strength of the Polarity Consistency Regularization term during training.

axioms (2)

domain assumption Lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination that remain stable across environmental changes.
Stated explicitly in the abstract as the foundation for cross-scene generalization.
domain assumption Event cameras capture fine-grained lip dynamics without motion blur and with high dynamic range, overcoming limitations of frame-based cameras.
Used to justify the choice of sensing modality.

pith-pipeline@v0.9.0 · 5613 in / 1479 out tokens · 63027 ms · 2026-05-10T09:23:14.457520+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Sparse coding based lip texture representation for visual speaker identification,

J.-Y . Lai, S.-L. Wang, X.-J. Shi, and A. W.-C. Liew, “Sparse coding based lip texture representation for visual speaker identification,” in Proc. Int. Conf. Digit. Signal Process. (DSP), 2014, pp. 607–610

work page 2014
[2]

Lip feature disentangle- ment for visual speaker authentication in natural scenes,

Y . He, L. Yang, S. Wang, and A. W.-C. Liew, “Lip feature disentangle- ment for visual speaker authentication in natural scenes,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 10, pp. 9898–9909, 2024

work page 2024
[3]

Studies on personal identification by means of lip prints,

Y . Tsuchihashi, “Studies on personal identification by means of lip prints,”Forensic Sci., vol. 3, pp. 233–248, 1974

work page 1974
[4]

Lip as biometric and beyond: A survey,

D. P. Chowdhury, R. Kumari, S. Bakshi, M. N. Sahoo, and A. Das, “Lip as biometric and beyond: A survey,”Multimed. Tools Appl., vol. 81, no. 3, pp. 3831–3865, 2022

work page 2022
[5]

The role of facial action units in investigating facial movements during speech,

A. A. Newby, A. Bhatta, C. Kirkland III, N. Arnold, and L. A. Thomp- son, “The role of facial action units in investigating facial movements during speech,”Electronics, vol. 14, no. 10, p. 2066, 2025

work page 2066
[6]

Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis,

C.-Z. Yang, J. Ma, S. Wang, and A. W.-C. Liew, “Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis,” IEEE Trans. Inf. Forensics Security, vol. 16, pp. 1841–1854, 2020

work page 2020
[7]

Securing face liveness detection on mobile devices using unforgeable lip motion patterns,

M. Zhou, Q. Wang, Q. Li, W. Zhou, J. Yang, and C. Shen, “Securing face liveness detection on mobile devices using unforgeable lip motion patterns,”IEEE Trans. Mobile Comput., vol. 23, no. 10, pp. 9772–9788, 2024

work page 2024
[8]

One-shot lip-based biometric authentication: Extending behavioral features with authentication phrase information,

B. Koch and R. Grbi ´c, “One-shot lip-based biometric authentication: Extending behavioral features with authentication phrase information,” Image Vis. Comput., vol. 142, p. 104900, 2024

work page 2024
[9]

Lip biometric template security framework using spatial steganography,

S. Das, K. Muhammad, S. Bakshi, I. Mukherjee, P. K. Sa, A. K. Sangaiah, and A. Bruno, “Lip biometric template security framework using spatial steganography,”Pattern Recognit. Lett., vol. 126, pp. 102– 110, 2019

work page 2019
[10]

Understanding visual lip-based biometric authentication for mobile devices,

C. Wright and D. W. Stewart, “Understanding visual lip-based biometric authentication for mobile devices,”EURASIP J. Inf. Secur., vol. 2020, no. 1, p. 3, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page 2020
[11]

Discriminative analysis of lip motion features for speaker identification and speech- reading,

H. E. Cetingul, Y . Yemez, E. Erzin, and A. M. Tekalp, “Discriminative analysis of lip motion features for speaker identification and speech- reading,”IEEE Trans. Image Process., vol. 15, no. 10, pp. 2879–2891, 2006

work page 2006
[12]

Physiological and behavioral lip bio- metrics: A comprehensive study of their discriminative power,

S.-L. Wang and A. W.-C. Liew, “Physiological and behavioral lip bio- metrics: A comprehensive study of their discriminative power,”Pattern Recognit., vol. 45, no. 9, pp. 3328–3335, 2012

work page 2012
[13]

The lip as a biometric,

M. Chora ´s, “The lip as a biometric,”Pattern Anal. Appl., vol. 13, no. 1, pp. 105–112, 2010

work page 2010
[14]

Speaker identification by lipreading,

J. Luettin, N. A. Thacker, and S. W. Beet, “Speaker identification by lipreading,” inProc. Int. Conf. Spoken Lang. Process. (ICSLP), vol. 1, 1996, pp. 62–65

work page 1996
[15]

Local ordinal contrast pattern histograms for spatiotemporal, lip-based speaker authentication,

C.-H. Chan, B. Goswami, J. Kittler, and W. J. Christmas, “Local ordinal contrast pattern histograms for spatiotemporal, lip-based speaker authentication,”IEEE Trans. Inf. Forensics Security, vol. 7, no. 2, pp. 602–612, 2012

work page 2012
[16]

LipAuth: Securing smartphone user authentication with lip motion patterns,

L. Kuang, F. Zeng, D. Liu, H. Cao, H. Jiang, and J. Liu, “LipAuth: Securing smartphone user authentication with lip motion patterns,”IEEE Internet Things J., vol. 11, no. 1, pp. 1096–1109, 2023

work page 2023
[17]

DynamicLip: Shape-independent continuous authentication via lip articulator dynamics,

H. Chen, Y . Xu, Y . Feng, M. Jian, F. Liu, P. Hu, K. Peng, S. He, and Z. Wang, “DynamicLip: Shape-independent continuous authentication via lip articulator dynamics,”arXiv preprint arXiv:2501.01032, 2025

work page arXiv 2025
[18]

WhisperNetV2: SlowFast siamese network for lip-based biometrics,

A. Zakeri, H. Hassanpour, M. H. Khosravi, and A. M. Nourollah, “WhisperNetV2: SlowFast siamese network for lip-based biometrics,” arXiv preprint arXiv:2407.08717, 2024

work page arXiv 2024
[19]

Neuromor- phic event-based face identity recognition,

G. Moreira, A. Grac ¸a, B. Silva, P. Martins, and J. Batista, “Neuromor- phic event-based face identity recognition,” inProc. Int. Conf. Pattern Recognit. (ICPR), 2022, pp. 922–929

work page 2022
[20]

On the facilitative effects of face motion on face recognition and its development,

N. G. Xiao, S. Perrotta, P. C. Quinn, Z. Wang, Y .-H. P. Sun, and K. Lee, “On the facilitative effects of face motion on face recognition and its development,”Front. Psychol., vol. 5, p. 633, 2014

work page 2014
[21]

Towards mobile sensing with event cameras on high-agility resource-constrained devices: A survey,

H. Wang, R. Guo, P. Ma, C. Ruan, X. Luo, W. Ding, T. Zhong, J. Xu, Y . Liu, and X. Chen, “Towards mobile sensing with event cameras on high-agility resource-constrained devices: A survey,”arXiv preprint arXiv:2503.22943, 2025

work page arXiv 2025
[22]

LipPass: Lip reading-based user authentication on smartphones leveraging acoustic signals,

L. Lu, J. Yu, Y . Chen, H. Liu, Y . Zhu, Y . Liu, and M. Li, “LipPass: Lip reading-based user authentication on smartphones leveraging acoustic signals,” inProc. IEEE Conf. Comput. Commun. (INFOCOM), 2018, pp. 1466–1474

work page 2018
[23]

Lip-TWUID: Noninvasive through-wall user identification using SISO radar and lip movement micro-doppler signatures with limited samples,

K. Yang, D. Zhu, C. Han, J. Guo, S. Sun, and L. Sun, “Lip-TWUID: Noninvasive through-wall user identification using SISO radar and lip movement micro-doppler signatures with limited samples,”IEEE Trans. Instrum. Meas., 2025

work page 2025
[24]

A framework for event-based computer vision on a mobile device,

G. Lenz, S. Picaud, and S.-H. Ieng, “A framework for event-based computer vision on a mobile device,”arXiv preprint arXiv:2205.06836, 2022

work page arXiv 2022
[25]

Using a probabilistic neural network for lip-based biometric verification,

K. Wrobel, R. Doroz, P. Porwik, J. Naruniec, and M. Kowalski, “Using a probabilistic neural network for lip-based biometric verification,”Eng. Appl. Artif. Intell., vol. 64, pp. 112–127, 2017

work page 2017
[26]

Lip print recognition based on convolutional spiking neural network,

B. Niu, L. Wang, T. Wu, and X. Zhang, “Lip print recognition based on convolutional spiking neural network,” inProc. Int. Conf. Image Signal Process. Pattern Recognit. (ISPP), vol. 12707, 2023, pp. 890–894

work page 2023
[27]

Neuromorphic lip-reading with signed spiking gated recurrent units,

M. Dampfhoffer and T. Mesquida, “Neuromorphic lip-reading with signed spiking gated recurrent units,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 2141–2151

work page 2024
[28]

Spikepoint: An efficient point-based spiking neural network for event cam- eras action recognition

H. Ren, Y . Zhou, Y . Huang, H. Fu, X. Lin, J. Song, and B. Cheng, “SpikePoint: An efficient point-based spiking neural network for event cameras action recognition,”arXiv preprint arXiv:2310.07189, 2023

work page arXiv 2023
[29]

Multi-grained spatio-temporal features perceived network for event-based lip-reading,

G. Tan, Y . Wang, H. Han, Y . Cao, F. Wu, and Z.-J. Zha, “Multi-grained spatio-temporal features perceived network for event-based lip-reading,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 20 062–20 071

work page 2022
[30]

HOTS: A hierarchy of event-based time-surfaces for pattern recogni- tion,

X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman, “HOTS: A hierarchy of event-based time-surfaces for pattern recogni- tion,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1346– 1359, 2016

work page 2016
[31]

Time- ordered recent event (TORE) volumes for event cameras,

R. W. Baldwin, R. Liu, M. Almatrafi, V . Asari, and K. Hirakawa, “Time- ordered recent event (TORE) volumes for event cameras,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2519–2532, 2022

work page 2022
[32]

EGST: An efficient solution for human gaits recognition using neuromorphic vision sensor,

L. Chen, Z. Zhang, Y . Xiao, and Y . Wang, “EGST: An efficient solution for human gaits recognition using neuromorphic vision sensor,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 6144–6154, 2024

work page 2024
[33]

Hypergraph-based multi-view action recognition using event cameras,

Y . Gao, J. Lu, S. Li, Y . Li, and S. Du, “Hypergraph-based multi-view action recognition using event cameras,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6610–6622, 2024

work page 2024
[34]

Space-time event clouds for gesture recognition: From RGB cameras to event cameras,

Q. Wang, Y . Zhang, J. Yuan, and Y . Lu, “Space-time event clouds for gesture recognition: From RGB cameras to event cameras,” inProc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2019, pp. 1826–1835

work page 2019
[35]

Deep lambertian net- works,

Y . Tang, R. Salakhutdinov, and G. E. Hinton, “Deep lambertian net- works,” inProc. Int. Conf. Mach. Learn. (ICML), 2012, pp. 1419–1426

work page 2012
[36]

PAM: Pose attention module for pose- invariant face recognition,

E.-J. Tsai and W.-C. Yeh, “PAM: Pose attention module for pose- invariant face recognition,”arXiv preprint arXiv:2111.11940, 2021

work page arXiv 2021
[37]

Disentangled representation learning GAN for pose-invariant face recognition,

L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN for pose-invariant face recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1415–1424

work page 2017
[38]

Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis,

R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2439–2448

work page 2017
[39]

Face recognition based on fitting a 3D morphable model,

V . Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1063–1074, 2003

work page 2003
[40]

Face recognition using a unified 3D morphable model,

G. Hu, F. Yan, C.-H. Chan, W. Deng, W. Christmas, J. Kittler, and N. M. Robertson, “Face recognition using a unified 3D morphable model,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 73–89

work page 2016
[41]

GaitSpike: Event-based gait recognition with spiking neural network,

Y . Tao, C.-H. Chang, S. Sa ¨ıghi, and S. Gao, “GaitSpike: Event-based gait recognition with spiking neural network,” inProc. IEEE Int. Conf. AI Circuits Syst. (AICAS), 2024, pp. 357–361

work page 2024
[42]

A comprehensive study on cross-view gait based human identification with deep cnns,

Z. Wu, Y . Huang, L. Wang, X. Wang, and T. Tan, “A comprehensive study on cross-view gait based human identification with deep cnns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 2, pp. 209–226, 2016

work page 2016
[43]

OuluVS2: A multi- view audiovisual database for non-rigid mouth motion analysis,

I. Anina, Z. Zhou, G. Zhao, and M. Pietik ¨ainen, “OuluVS2: A multi- view audiovisual database for non-rigid mouth motion analysis,” inProc. IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), vol. 1, 2015, pp. 1–5

work page 2015
[44]

End-to-end multi-view lipreading,

S. Petridis, Y . Wang, Z. Li, and M. Pantic, “End-to-end multi-view lipreading,”arXiv preprint arXiv:1709.00443, 2017

work page arXiv 2017
[45]

Multi-view automatic lip-reading using neural network,

D. Lee, J. Lee, and K.-E. Kim, “Multi-view automatic lip-reading using neural network,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 290–302

work page 2016
[46]

Low cost and latency event camera background activity denoising,

S. Guo and T. Delbruck, “Low cost and latency event camera background activity denoising,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 785–795, 2022

work page 2022
[47]

XM2VTSDB: The extended M2VTS database,

K. Messer, J. Matas, J. Kittler, J. Luettin, G. Maitreet al., “XM2VTSDB: The extended M2VTS database,” inProc. Int. Conf. Audio-Video-Based Biometric Person Authentication (AVBPA), 1999, pp. 965–966

work page 1999
[48]

Visual speech recognition with stochastic networks,

J. Movellan, “Visual speech recognition with stochastic networks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 7, 1994

work page 1994
[49]

Multimodal speaker identifica- tion using an adaptive classifier cascade based on modality reliability,

E. Erzin, Y . Yemez, and A. M. Tekalp, “Multimodal speaker identifica- tion using an adaptive classifier cascade based on modality reliability,” IEEE Trans. Multimedia, vol. 7, no. 5, pp. 840–852, 2005

work page 2005
[50]

Lip reading in the wild,

J. S. Chung and A. Zisserman, “Lip reading in the wild,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 87–103

work page 2016
[51]

CREMA-D: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,”IEEE Trans. Affect. Comput., vol. 5, no. 4, pp. 377–390, 2014

work page 2014
[52]

Speech database development at MIT: TIMIT and beyond,

V . Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,”Speech Commun., vol. 9, no. 4, pp. 351–356, 1990

work page 1990
[53]

An audio-visual corpus for speech perception and automatic speech recognition,

M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,”J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2421–2424, 2006

work page 2006
[54]

Assessing the uniqueness and permanence of facial actions for use in biometric applications,

L. Benedikt, D. Cosker, P. L. Rosin, and D. Marshall, “Assessing the uniqueness and permanence of facial actions for use in biometric applications,”IEEE Trans. Syst., Man, Cybern. A, vol. 40, no. 3, pp. 449–460, May 2010

work page 2010
[55]

EgoEvGesture: Gesture recognition based on egocentric event camera,

L. Wang, H. Shi, X. Yin, K. Yang, K. Wang, and J. Bai, “EgoEvGesture: Gesture recognition based on egocentric event camera,”arXiv preprint arXiv:2503.12419, 2025

work page arXiv 2025
[56]

MTGA: Multi-view temporal granularity aligned aggregation for event-based lip- reading,

W. Zhang, J. Wang, Y . Luo, L. Yu, W. Yu, Z. He, and J. Shen, “MTGA: Multi-view temporal granularity aligned aggregation for event-based lip- reading,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 39, no. 10, 2025, pp. 10 176–10 184

work page 2025
[57]

GET: Group event transformer for event-based vision,

Y . Peng, Y . Zhang, Z. Xiong, X. Sun, and F. Wu, “GET: Group event transformer for event-based vision,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 6038–6048

work page 2023
[58]

Event-stream representation for human gaits identification using deep neural networks,

Y . Wang, X. Zhang, Y . Shen, B. Du, G. Zhao, L. Cui, and H. Wen, “Event-stream representation for human gaits identification using deep neural networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3436–3449, 2021. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

work page 2021
[59]

End-to- end learning of representations for asynchronous event-based data,

D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to- end learning of representations for asynchronous event-based data,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 5633–5643

work page 2019
[60]

Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer,

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, L. Wang, and Y . Qiao, “UniFormerV2: Spatiotemporal learning by arming image ViTs with video UniFormer,”arXiv preprint arXiv:2211.09552, 2022

work page arXiv 2022
[61]

TAM: Temporal adaptive module for video recognition,

Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “TAM: Temporal adaptive module for video recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 13 708–13 718

work page 2021
[62]

Is space-time attention all you need for video understanding?

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” inProc. Int. Conf. Mach. Learn. (ICML), 2021

work page 2021
[63]

Temporal interlacing network,

H. Shao, S. Qian, and Y . Liu, “Temporal interlacing network,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 34, no. 07, 2020, pp. 11 966– 11 973

work page 2020
[64]

SlowFast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast networks for video recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6202–6211

work page 2019
[65]

TSM: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for efficient video understanding,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 7083–7093

work page 2019
[66]

A closer look at spatiotemporal convolutions for action recognition,

D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 6450–6459

work page 2018
[67]

Temporal relational reasoning in videos,

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 803–818

work page 2018
[68]

Non-local neural net- works,

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7794–7803

work page 2018
[69]

Temporal segment networks: Towards good practices for deep action recognition,

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 20–36

work page 2016
[70]

Learning spatiotemporal features with 3D convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 4489–4497. Junguang Yao(S’26) received the B.E. degree from the College of Electronic Engineering, South China Agricultural University, Guangzhou, China, in 2021, and the M.S...

work page 2015

[1] [1]

Sparse coding based lip texture representation for visual speaker identification,

J.-Y . Lai, S.-L. Wang, X.-J. Shi, and A. W.-C. Liew, “Sparse coding based lip texture representation for visual speaker identification,” in Proc. Int. Conf. Digit. Signal Process. (DSP), 2014, pp. 607–610

work page 2014

[2] [2]

Lip feature disentangle- ment for visual speaker authentication in natural scenes,

Y . He, L. Yang, S. Wang, and A. W.-C. Liew, “Lip feature disentangle- ment for visual speaker authentication in natural scenes,”IEEE Trans. Circuits Syst. Video Technol., vol. 34, no. 10, pp. 9898–9909, 2024

work page 2024

[3] [3]

Studies on personal identification by means of lip prints,

Y . Tsuchihashi, “Studies on personal identification by means of lip prints,”Forensic Sci., vol. 3, pp. 233–248, 1974

work page 1974

[4] [4]

Lip as biometric and beyond: A survey,

D. P. Chowdhury, R. Kumari, S. Bakshi, M. N. Sahoo, and A. Das, “Lip as biometric and beyond: A survey,”Multimed. Tools Appl., vol. 81, no. 3, pp. 3831–3865, 2022

work page 2022

[5] [5]

The role of facial action units in investigating facial movements during speech,

A. A. Newby, A. Bhatta, C. Kirkland III, N. Arnold, and L. A. Thomp- son, “The role of facial action units in investigating facial movements during speech,”Electronics, vol. 14, no. 10, p. 2066, 2025

work page 2066

[6] [6]

Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis,

C.-Z. Yang, J. Ma, S. Wang, and A. W.-C. Liew, “Preventing deepfake attacks on speaker authentication by dynamic lip movement analysis,” IEEE Trans. Inf. Forensics Security, vol. 16, pp. 1841–1854, 2020

work page 2020

[7] [7]

Securing face liveness detection on mobile devices using unforgeable lip motion patterns,

M. Zhou, Q. Wang, Q. Li, W. Zhou, J. Yang, and C. Shen, “Securing face liveness detection on mobile devices using unforgeable lip motion patterns,”IEEE Trans. Mobile Comput., vol. 23, no. 10, pp. 9772–9788, 2024

work page 2024

[8] [8]

One-shot lip-based biometric authentication: Extending behavioral features with authentication phrase information,

B. Koch and R. Grbi ´c, “One-shot lip-based biometric authentication: Extending behavioral features with authentication phrase information,” Image Vis. Comput., vol. 142, p. 104900, 2024

work page 2024

[9] [9]

Lip biometric template security framework using spatial steganography,

S. Das, K. Muhammad, S. Bakshi, I. Mukherjee, P. K. Sa, A. K. Sangaiah, and A. Bruno, “Lip biometric template security framework using spatial steganography,”Pattern Recognit. Lett., vol. 126, pp. 102– 110, 2019

work page 2019

[10] [10]

Understanding visual lip-based biometric authentication for mobile devices,

C. Wright and D. W. Stewart, “Understanding visual lip-based biometric authentication for mobile devices,”EURASIP J. Inf. Secur., vol. 2020, no. 1, p. 3, 2020. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14

work page 2020

[11] [11]

Discriminative analysis of lip motion features for speaker identification and speech- reading,

H. E. Cetingul, Y . Yemez, E. Erzin, and A. M. Tekalp, “Discriminative analysis of lip motion features for speaker identification and speech- reading,”IEEE Trans. Image Process., vol. 15, no. 10, pp. 2879–2891, 2006

work page 2006

[12] [12]

Physiological and behavioral lip bio- metrics: A comprehensive study of their discriminative power,

S.-L. Wang and A. W.-C. Liew, “Physiological and behavioral lip bio- metrics: A comprehensive study of their discriminative power,”Pattern Recognit., vol. 45, no. 9, pp. 3328–3335, 2012

work page 2012

[13] [13]

The lip as a biometric,

M. Chora ´s, “The lip as a biometric,”Pattern Anal. Appl., vol. 13, no. 1, pp. 105–112, 2010

work page 2010

[14] [14]

Speaker identification by lipreading,

J. Luettin, N. A. Thacker, and S. W. Beet, “Speaker identification by lipreading,” inProc. Int. Conf. Spoken Lang. Process. (ICSLP), vol. 1, 1996, pp. 62–65

work page 1996

[15] [15]

Local ordinal contrast pattern histograms for spatiotemporal, lip-based speaker authentication,

C.-H. Chan, B. Goswami, J. Kittler, and W. J. Christmas, “Local ordinal contrast pattern histograms for spatiotemporal, lip-based speaker authentication,”IEEE Trans. Inf. Forensics Security, vol. 7, no. 2, pp. 602–612, 2012

work page 2012

[16] [16]

LipAuth: Securing smartphone user authentication with lip motion patterns,

L. Kuang, F. Zeng, D. Liu, H. Cao, H. Jiang, and J. Liu, “LipAuth: Securing smartphone user authentication with lip motion patterns,”IEEE Internet Things J., vol. 11, no. 1, pp. 1096–1109, 2023

work page 2023

[17] [17]

DynamicLip: Shape-independent continuous authentication via lip articulator dynamics,

H. Chen, Y . Xu, Y . Feng, M. Jian, F. Liu, P. Hu, K. Peng, S. He, and Z. Wang, “DynamicLip: Shape-independent continuous authentication via lip articulator dynamics,”arXiv preprint arXiv:2501.01032, 2025

work page arXiv 2025

[18] [18]

WhisperNetV2: SlowFast siamese network for lip-based biometrics,

A. Zakeri, H. Hassanpour, M. H. Khosravi, and A. M. Nourollah, “WhisperNetV2: SlowFast siamese network for lip-based biometrics,” arXiv preprint arXiv:2407.08717, 2024

work page arXiv 2024

[19] [19]

Neuromor- phic event-based face identity recognition,

G. Moreira, A. Grac ¸a, B. Silva, P. Martins, and J. Batista, “Neuromor- phic event-based face identity recognition,” inProc. Int. Conf. Pattern Recognit. (ICPR), 2022, pp. 922–929

work page 2022

[20] [20]

On the facilitative effects of face motion on face recognition and its development,

N. G. Xiao, S. Perrotta, P. C. Quinn, Z. Wang, Y .-H. P. Sun, and K. Lee, “On the facilitative effects of face motion on face recognition and its development,”Front. Psychol., vol. 5, p. 633, 2014

work page 2014

[21] [21]

Towards mobile sensing with event cameras on high-agility resource-constrained devices: A survey,

H. Wang, R. Guo, P. Ma, C. Ruan, X. Luo, W. Ding, T. Zhong, J. Xu, Y . Liu, and X. Chen, “Towards mobile sensing with event cameras on high-agility resource-constrained devices: A survey,”arXiv preprint arXiv:2503.22943, 2025

work page arXiv 2025

[22] [22]

LipPass: Lip reading-based user authentication on smartphones leveraging acoustic signals,

L. Lu, J. Yu, Y . Chen, H. Liu, Y . Zhu, Y . Liu, and M. Li, “LipPass: Lip reading-based user authentication on smartphones leveraging acoustic signals,” inProc. IEEE Conf. Comput. Commun. (INFOCOM), 2018, pp. 1466–1474

work page 2018

[23] [23]

Lip-TWUID: Noninvasive through-wall user identification using SISO radar and lip movement micro-doppler signatures with limited samples,

K. Yang, D. Zhu, C. Han, J. Guo, S. Sun, and L. Sun, “Lip-TWUID: Noninvasive through-wall user identification using SISO radar and lip movement micro-doppler signatures with limited samples,”IEEE Trans. Instrum. Meas., 2025

work page 2025

[24] [24]

A framework for event-based computer vision on a mobile device,

G. Lenz, S. Picaud, and S.-H. Ieng, “A framework for event-based computer vision on a mobile device,”arXiv preprint arXiv:2205.06836, 2022

work page arXiv 2022

[25] [25]

Using a probabilistic neural network for lip-based biometric verification,

K. Wrobel, R. Doroz, P. Porwik, J. Naruniec, and M. Kowalski, “Using a probabilistic neural network for lip-based biometric verification,”Eng. Appl. Artif. Intell., vol. 64, pp. 112–127, 2017

work page 2017

[26] [26]

Lip print recognition based on convolutional spiking neural network,

B. Niu, L. Wang, T. Wu, and X. Zhang, “Lip print recognition based on convolutional spiking neural network,” inProc. Int. Conf. Image Signal Process. Pattern Recognit. (ISPP), vol. 12707, 2023, pp. 890–894

work page 2023

[27] [27]

Neuromorphic lip-reading with signed spiking gated recurrent units,

M. Dampfhoffer and T. Mesquida, “Neuromorphic lip-reading with signed spiking gated recurrent units,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 2141–2151

work page 2024

[28] [28]

Spikepoint: An efficient point-based spiking neural network for event cam- eras action recognition

H. Ren, Y . Zhou, Y . Huang, H. Fu, X. Lin, J. Song, and B. Cheng, “SpikePoint: An efficient point-based spiking neural network for event cameras action recognition,”arXiv preprint arXiv:2310.07189, 2023

work page arXiv 2023

[29] [29]

Multi-grained spatio-temporal features perceived network for event-based lip-reading,

G. Tan, Y . Wang, H. Han, Y . Cao, F. Wu, and Z.-J. Zha, “Multi-grained spatio-temporal features perceived network for event-based lip-reading,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 20 062–20 071

work page 2022

[30] [30]

HOTS: A hierarchy of event-based time-surfaces for pattern recogni- tion,

X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman, “HOTS: A hierarchy of event-based time-surfaces for pattern recogni- tion,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 7, pp. 1346– 1359, 2016

work page 2016

[31] [31]

Time- ordered recent event (TORE) volumes for event cameras,

R. W. Baldwin, R. Liu, M. Almatrafi, V . Asari, and K. Hirakawa, “Time- ordered recent event (TORE) volumes for event cameras,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 2, pp. 2519–2532, 2022

work page 2022

[32] [32]

EGST: An efficient solution for human gaits recognition using neuromorphic vision sensor,

L. Chen, Z. Zhang, Y . Xiao, and Y . Wang, “EGST: An efficient solution for human gaits recognition using neuromorphic vision sensor,”IEEE Trans. Inf. Forensics Security, vol. 19, pp. 6144–6154, 2024

work page 2024

[33] [33]

Hypergraph-based multi-view action recognition using event cameras,

Y . Gao, J. Lu, S. Li, Y . Li, and S. Du, “Hypergraph-based multi-view action recognition using event cameras,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 10, pp. 6610–6622, 2024

work page 2024

[34] [34]

Space-time event clouds for gesture recognition: From RGB cameras to event cameras,

Q. Wang, Y . Zhang, J. Yuan, and Y . Lu, “Space-time event clouds for gesture recognition: From RGB cameras to event cameras,” inProc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2019, pp. 1826–1835

work page 2019

[35] [35]

Deep lambertian net- works,

Y . Tang, R. Salakhutdinov, and G. E. Hinton, “Deep lambertian net- works,” inProc. Int. Conf. Mach. Learn. (ICML), 2012, pp. 1419–1426

work page 2012

[36] [36]

PAM: Pose attention module for pose- invariant face recognition,

E.-J. Tsai and W.-C. Yeh, “PAM: Pose attention module for pose- invariant face recognition,”arXiv preprint arXiv:2111.11940, 2021

work page arXiv 2021

[37] [37]

Disentangled representation learning GAN for pose-invariant face recognition,

L. Tran, X. Yin, and X. Liu, “Disentangled representation learning GAN for pose-invariant face recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 1415–1424

work page 2017

[38] [38]

Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis,

R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global and local perception GAN for photorealistic and identity preserving frontal view synthesis,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2439–2448

work page 2017

[39] [39]

Face recognition based on fitting a 3D morphable model,

V . Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1063–1074, 2003

work page 2003

[40] [40]

Face recognition using a unified 3D morphable model,

G. Hu, F. Yan, C.-H. Chan, W. Deng, W. Christmas, J. Kittler, and N. M. Robertson, “Face recognition using a unified 3D morphable model,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 73–89

work page 2016

[41] [41]

GaitSpike: Event-based gait recognition with spiking neural network,

Y . Tao, C.-H. Chang, S. Sa ¨ıghi, and S. Gao, “GaitSpike: Event-based gait recognition with spiking neural network,” inProc. IEEE Int. Conf. AI Circuits Syst. (AICAS), 2024, pp. 357–361

work page 2024

[42] [42]

A comprehensive study on cross-view gait based human identification with deep cnns,

Z. Wu, Y . Huang, L. Wang, X. Wang, and T. Tan, “A comprehensive study on cross-view gait based human identification with deep cnns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 2, pp. 209–226, 2016

work page 2016

[43] [43]

OuluVS2: A multi- view audiovisual database for non-rigid mouth motion analysis,

I. Anina, Z. Zhou, G. Zhao, and M. Pietik ¨ainen, “OuluVS2: A multi- view audiovisual database for non-rigid mouth motion analysis,” inProc. IEEE Int. Conf. Autom. Face Gesture Recognit. (FG), vol. 1, 2015, pp. 1–5

work page 2015

[44] [44]

End-to-end multi-view lipreading,

S. Petridis, Y . Wang, Z. Li, and M. Pantic, “End-to-end multi-view lipreading,”arXiv preprint arXiv:1709.00443, 2017

work page arXiv 2017

[45] [45]

Multi-view automatic lip-reading using neural network,

D. Lee, J. Lee, and K.-E. Kim, “Multi-view automatic lip-reading using neural network,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 290–302

work page 2016

[46] [46]

Low cost and latency event camera background activity denoising,

S. Guo and T. Delbruck, “Low cost and latency event camera background activity denoising,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 785–795, 2022

work page 2022

[47] [47]

XM2VTSDB: The extended M2VTS database,

K. Messer, J. Matas, J. Kittler, J. Luettin, G. Maitreet al., “XM2VTSDB: The extended M2VTS database,” inProc. Int. Conf. Audio-Video-Based Biometric Person Authentication (AVBPA), 1999, pp. 965–966

work page 1999

[48] [48]

Visual speech recognition with stochastic networks,

J. Movellan, “Visual speech recognition with stochastic networks,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 7, 1994

work page 1994

[49] [49]

Multimodal speaker identifica- tion using an adaptive classifier cascade based on modality reliability,

E. Erzin, Y . Yemez, and A. M. Tekalp, “Multimodal speaker identifica- tion using an adaptive classifier cascade based on modality reliability,” IEEE Trans. Multimedia, vol. 7, no. 5, pp. 840–852, 2005

work page 2005

[50] [50]

Lip reading in the wild,

J. S. Chung and A. Zisserman, “Lip reading in the wild,” inProc. Asian Conf. Comput. Vis. (ACCV), 2016, pp. 87–103

work page 2016

[51] [51]

CREMA-D: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,”IEEE Trans. Affect. Comput., vol. 5, no. 4, pp. 377–390, 2014

work page 2014

[52] [52]

Speech database development at MIT: TIMIT and beyond,

V . Zue, S. Seneff, and J. Glass, “Speech database development at MIT: TIMIT and beyond,”Speech Commun., vol. 9, no. 4, pp. 351–356, 1990

work page 1990

[53] [53]

An audio-visual corpus for speech perception and automatic speech recognition,

M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recognition,”J. Acoust. Soc. Am., vol. 120, no. 5, pp. 2421–2424, 2006

work page 2006

[54] [54]

Assessing the uniqueness and permanence of facial actions for use in biometric applications,

L. Benedikt, D. Cosker, P. L. Rosin, and D. Marshall, “Assessing the uniqueness and permanence of facial actions for use in biometric applications,”IEEE Trans. Syst., Man, Cybern. A, vol. 40, no. 3, pp. 449–460, May 2010

work page 2010

[55] [55]

EgoEvGesture: Gesture recognition based on egocentric event camera,

L. Wang, H. Shi, X. Yin, K. Yang, K. Wang, and J. Bai, “EgoEvGesture: Gesture recognition based on egocentric event camera,”arXiv preprint arXiv:2503.12419, 2025

work page arXiv 2025

[56] [56]

MTGA: Multi-view temporal granularity aligned aggregation for event-based lip- reading,

W. Zhang, J. Wang, Y . Luo, L. Yu, W. Yu, Z. He, and J. Shen, “MTGA: Multi-view temporal granularity aligned aggregation for event-based lip- reading,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 39, no. 10, 2025, pp. 10 176–10 184

work page 2025

[57] [57]

GET: Group event transformer for event-based vision,

Y . Peng, Y . Zhang, Z. Xiong, X. Sun, and F. Wu, “GET: Group event transformer for event-based vision,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2023, pp. 6038–6048

work page 2023

[58] [58]

Event-stream representation for human gaits identification using deep neural networks,

Y . Wang, X. Zhang, Y . Shen, B. Du, G. Zhao, L. Cui, and H. Wen, “Event-stream representation for human gaits identification using deep neural networks,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3436–3449, 2021. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15

work page 2021

[59] [59]

End-to- end learning of representations for asynchronous event-based data,

D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to- end learning of representations for asynchronous event-based data,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 5633–5643

work page 2019

[60] [60]

Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer,

K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, L. Wang, and Y . Qiao, “UniFormerV2: Spatiotemporal learning by arming image ViTs with video UniFormer,”arXiv preprint arXiv:2211.09552, 2022

work page arXiv 2022

[61] [61]

TAM: Temporal adaptive module for video recognition,

Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “TAM: Temporal adaptive module for video recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 13 708–13 718

work page 2021

[62] [62]

Is space-time attention all you need for video understanding?

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” inProc. Int. Conf. Mach. Learn. (ICML), 2021

work page 2021

[63] [63]

Temporal interlacing network,

H. Shao, S. Qian, and Y . Liu, “Temporal interlacing network,” inProc. AAAI Conf. Artif. Intell. (AAAI), vol. 34, no. 07, 2020, pp. 11 966– 11 973

work page 2020

[64] [64]

SlowFast networks for video recognition,

C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast networks for video recognition,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6202–6211

work page 2019

[65] [65]

TSM: Temporal shift module for efficient video understanding,

J. Lin, C. Gan, and S. Han, “TSM: Temporal shift module for efficient video understanding,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 7083–7093

work page 2019

[66] [66]

A closer look at spatiotemporal convolutions for action recognition,

D. Tran, H. Wang, L. Torresani, J. Ray, Y . LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 6450–6459

work page 2018

[67] [67]

Temporal relational reasoning in videos,

B. Zhou, A. Andonian, A. Oliva, and A. Torralba, “Temporal relational reasoning in videos,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2018, pp. 803–818

work page 2018

[68] [68]

Non-local neural net- works,

X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- works,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 7794–7803

work page 2018

[69] [69]

Temporal segment networks: Towards good practices for deep action recognition,

L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 20–36

work page 2016

[70] [70]

Learning spatiotemporal features with 3D convolutional networks,

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3D convolutional networks,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, pp. 4489–4497. Junguang Yao(S’26) received the B.E. degree from the College of Electronic Engineering, South China Agricultural University, Guangzhou, China, in 2021, and the M.S...

work page 2015