Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition
Pith reviewed 2026-05-20 18:56 UTC · model grok-4.3
The pith
An attention-aware transformer learns to combine periocular video frames into one identity vector that beats simple averaging.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that an attention-aware transformer aggregation module can adaptively learn to combine frame-level periocular features extracted by a deep CNN into a single video representation, and that this learned aggregation consistently outperforms naive schemes such as averaging or max-pooling, reaching 99.8 percent TPR at a false-positive rate of 0.1 and 96.6 percent Rank-5 accuracy on the COX Face dataset.
What carries the argument
The encoder-only transformer aggregation module, which receives a collection of frame-level feature vectors and outputs a single aggregated video feature vector by computing learned attention weights across frames.
Load-bearing premise
The video sequences and imaging conditions captured in the COX Face dataset are representative of the variability found in real surveillance deployments.
What would settle it
Evaluating the identical network architecture on a separate video periocular dataset collected under markedly different distances, lighting, or subject motion and observing whether the performance margin over naive aggregation vanishes.
read the original abstract
Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes an attention-aware transformer-based aggregation network for video periocular recognition. It consists of a CNN feature embedding module that extracts frame-level periocular features and an encoder-only transformer aggregation module that adaptively learns to combine these into a single video-level representation (and a still-image reference vector). On the COX Face dataset the method is reported to outperform naive aggregation baselines, reaching 99.8% TPR@1e-1 and 96.6% Rank-5.
Significance. If the empirical gains prove robust, the adaptive transformer aggregation could offer a practical improvement for video-based periocular biometrics in surveillance settings where single-frame or simple pooling methods are insufficient. The use of a public dataset and direct comparison against naive schemes is a positive step; however, the absence of cross-dataset or out-of-distribution testing limits the strength of any broader robustness claims.
major comments (2)
- Experiments section: performance is demonstrated exclusively on the COX Face dataset with no cross-dataset evaluation or out-of-distribution testing. This directly affects the claim that the method shows 'robustness' in general surveillance conditions, as the learned attention weights may be tuned to COX-specific statistics rather than general aggregation rules.
- Experimental protocol and results: the manuscript provides headline metrics but does not include ablation tables, full training details, or statistical significance tests for the reported improvements over naive baselines. Without these, it is not possible to determine whether the central claim of consistent outperformance is reliably supported.
minor comments (1)
- Abstract: the notation 'TPR@1e-1' should be expanded on first use (e.g., true positive rate at a false positive rate of 0.1) for clarity to readers outside the immediate sub-field.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing honest responses and indicating revisions made to the manuscript.
read point-by-point responses
-
Referee: Experiments section: performance is demonstrated exclusively on the COX Face dataset with no cross-dataset evaluation or out-of-distribution testing. This directly affects the claim that the method shows 'robustness' in general surveillance conditions, as the learned attention weights may be tuned to COX-specific statistics rather than general aggregation rules.
Authors: We acknowledge that the evaluation is confined to the COX Face dataset, a standard public benchmark for video periocular recognition in unconstrained surveillance settings. While the transformer-based aggregation is intended to learn general frame-weighting rules rather than dataset-specific artifacts, we agree that the absence of cross-dataset testing limits stronger claims of broad robustness. In the revised manuscript we have moderated the language around 'robustness,' explicitly noting the single-dataset scope, and added a dedicated limitations paragraph discussing the need for future out-of-distribution evaluation. New cross-dataset experiments are not feasible within the current revision timeline. revision: partial
-
Referee: Experimental protocol and results: the manuscript provides headline metrics but does not include ablation tables, full training details, or statistical significance tests for the reported improvements over naive baselines. Without these, it is not possible to determine whether the central claim of consistent outperformance is reliably supported.
Authors: We appreciate this observation. The revised manuscript now includes: (i) ablation tables isolating the contribution of the attention mechanism and the encoder-only transformer, (ii) complete training details (optimizer, learning-rate schedule, batch size, data augmentation, and convergence criteria), and (iii) statistical significance tests (McNemar’s test) comparing the proposed method against the naive pooling baselines. These additions are placed in an expanded Experiments section and its supplementary material. revision: yes
Circularity Check
No circularity: empirical architecture validated on public dataset without self-referential reductions
full rationale
The paper describes a feature embedding CNN followed by an encoder-only transformer aggregation module that learns to combine frame-level periocular features into a video representation. All performance claims (99.8% TPR@1e-1, 96.6% Rank-5 on COX Face) rest on direct experimental comparison against naive baselines rather than any closed-form derivation, fitted parameter renamed as prediction, or uniqueness theorem. No equations are presented that define a quantity in terms of itself, and no load-bearing self-citations are invoked to justify architectural choices. The method is therefore self-contained as a standard deep-learning proposal whose validity is assessed externally on a fixed public benchmark.
Axiom & Free-Parameter Ledger
free parameters (1)
- Transformer hyperparameters and training schedule
axioms (1)
- domain assumption Standard supervised classification loss on identity labels is sufficient to train both embedding and aggregation modules.
Reference graph
Works this paper leans on
-
[1]
Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition
INTRODUCTION Video periocular recognition has received much attention from the scientific community in recent years. This research area has become a relevant study field, mainly because it plays an important role in many real-world applications such as visual surveillance [1, 2] and video search. Compared to single still image-based periocular recogni- ti...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
RELA TED WORKS Since the periocular region is extracted from facial images, both video-based and image set-based face and periocular recognition share several common characteristics. In this work, we consider recognition methods based on video or image sets that exploit periocular or facial data. Regarding image set-based recognition studies, previous wor...
-
[3]
The first component is a deep convolutional network that acts as a feature extractor
PROPOSED APPROACH Our aggregation framework is composed of two main mod- ules: the feature embedding and the aggregation (Figure 1). The first component is a deep convolutional network that acts as a feature extractor. The second part is the aggregation mod- ule, which combines the feature vectors of all video frames (query) to form a video-level feature ...
-
[4]
EXPERIMENTAL RESULTS Datasets and protocols: The VGGFace2 [20] dataset which includes around 3.31M images of about 92k classes is used to train our feature embedding module for the still image- based periocular recognition task. To train and evaluate our aggregation module, we use the COX Face [12] database, which contains 1k subjects and 3k videos. We ad...
-
[5]
CONCLUSIONS This paper presented a novel feature aggregation scheme for video periocular recognition that consistently outperforms all baseline methods in the COX Face dataset. The experimental results show statistically significant gains under both verifica- tion and identification protocols, demonstrating the effective- ness and robustness of the propos...
work page 2023
-
[6]
Deep periocu- lar representation aiming video surveillance,
Eduardo Luz, Gladston Moreira, Luiz Antonio Zan- lorensi Junior, and David Menotti, “Deep periocu- lar representation aiming video surveillance,” Pattern Recognition Letters, vol. 114, pp. 2–12, 2018
work page 2018
-
[7]
Convolutional neu- ral network-based periocular recognition in surveillance environments,
Min Cheol Kim, Ja Hyung Koo, Se Woon Cho, Na Rae Baek, and Kang Ryoung Park, “Convolutional neu- ral network-based periocular recognition in surveillance environments,” IEEE Access, vol. 6, pp. 57291–57310, 2018
work page 2018
-
[8]
Performance evaluation of lo- cal appearance based periocular recognition,
Philip E Miller, Jamie R Lyle, Shrinivas J Pundlik, and Damon L Woodard, “Performance evaluation of lo- cal appearance based periocular recognition,” in 2010 Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS). IEEE, 2010, pp. 1–6
work page 2010
-
[9]
Periocular biometrics in the visible spectrum: A feasibility study,
Unsang Park, Arun Ross, and Anil K Jain, “Periocular biometrics in the visible spectrum: A feasibility study,” in 2009 IEEE 3rd international conference on biomet- rics: theory, applications, and systems. IEEE, 2009, pp. 1–6
work page 2009
-
[10]
Zijing Zhao and Ajay Kumar, “Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network,”IEEE Transactions on Information Forensics and Security , vol. 12, no. 5, pp. 1017–1030, 2016
work page 2016
-
[11]
Veeru Talreja, Nasser M Nasrabadi, and Matthew C Valenti, “Attribute-based deep periocular recognition: Leveraging soft biometrics to improve periocular recog- nition,” in Proceedings of the IEEE/CVF winter con- ference on applications of computer vision , 2022, pp. 4041–4050
work page 2022
-
[12]
A new periocular dataset collected by mobile devices in unconstrained scenarios,
Luiz A Zanlorensi, Rayson Laroca, Diego R Lucio, Lu- cas R Santos, Alceu S Britto Jr, and David Menotti, “A new periocular dataset collected by mobile devices in unconstrained scenarios,”Scientific Reports, vol. 12, no. 1, pp. 17989, 2022
work page 2022
-
[13]
Feature aggregation network for video face recognition,
Zhaoxiang Liu, Huan Hu, Jinqiang Bai, Shaohua Li, and Shiguo Lian, “Feature aggregation network for video face recognition,” in Proceedings of the IEEE/CVF in- ternational conference on computer vision workshops , 2019, pp. 0–0
work page 2019
-
[14]
Deepface: Closing the gap to human-level performance in face verification,
Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2014, pp. 1701–1708
work page 2014
-
[15]
Cnn based key frame extraction for face in video recogni- tion,
Xuan Qi, Chen Liu, and Stephanie Schuckers, “Cnn based key frame extraction for face in video recogni- tion,” in 2018 IEEE 4th international conference on identity, security, and behavior analysis (ISBA) . IEEE, 2018, pp. 1–8
work page 2018
-
[16]
Boost- ing face in video recognition via cnn based key frame extraction,
Xuan Qi, Chen Liu, and Stephanie Schuckers, “Boost- ing face in video recognition via cnn based key frame extraction,” in 2018 International conference on bio- metrics (ICB). IEEE, 2018, pp. 132–139
work page 2018
-
[17]
A benchmark and comparative study of video-based face recognition on cox face database,
Zhiwu Huang, Shiguang Shan, Ruiping Wang, Haihong Zhang, Shihong Lao, Alifu Kuerban, and Xilin Chen, “A benchmark and comparative study of video-based face recognition on cox face database,” IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5967–5981, 2015
work page 2015
-
[18]
Video-based face recognition us- ing probabilistic appearance manifolds,
Kuang-Chih Lee, Jeffrey Ho, Ming-Hsuan Yang, and David Kriegman, “Video-based face recognition us- ing probabilistic appearance manifolds,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. IEEE, 2003, vol. 1, pp. I–I
work page 2003
-
[19]
Face recognition with image sets using manifold density di- vergence,
Ognjen Arandjelovic, Gregory Shakhnarovich, John Fisher, Roberto Cipolla, and Trevor Darrell, “Face recognition with image sets using manifold density di- vergence,” in 2005 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR’05). IEEE, 2005, vol. 1, pp. 581–588
work page 2005
-
[20]
Boosted manifold principal angles for image set-based recognition,
Tae-Kyun Kim, Ognjen Arandjelovi ´c, and Roberto Cipolla, “Boosted manifold principal angles for image set-based recognition,” Pattern Recognition, vol. 40, no. 9, pp. 2475–2484, 2007
work page 2007
-
[21]
Manifold-manifold distance with application to face recognition based on image set,
Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao, “Manifold-manifold distance with application to face recognition based on image set,” in 2008 IEEE Conference on Computer Vision and Pattern Recogni- tion. IEEE, 2008, pp. 1–8
work page 2008
-
[22]
Facenet: A unified embedding for face recog- nition and clustering,
Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recog- nition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823
work page 2015
-
[23]
Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,
Luiz Guilherme Fonseca Carreira, David Menotti, and William Robson Schwartz, “Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,” in 2024 37th SIBGRAPI Conference on Graph- ics, Patterns and Images (SIBGRAPI). IEEE, 2024, pp. 1–6
work page 2024
-
[24]
Neu- ral aggregation network for video face recognition,
Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua, “Neu- ral aggregation network for video face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4362–4371
work page 2017
-
[25]
Vggface2: A dataset for recog- nising faces across pose and age,
Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman, “Vggface2: A dataset for recog- nising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018, pp. 67–74
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.