pith. sign in

arxiv: 2605.16550 · v1 · pith:BZTPSMO3new · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

Pith reviewed 2026-05-20 18:56 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords periocularrecognitionaggregationfeaturefacevideoapproachattention-aware
0
0 comments X

The pith

An attention-aware transformer learns to combine periocular video frames into one identity vector that beats simple averaging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that an encoder-only transformer equipped with attention can adaptively weigh and merge feature vectors from multiple frames of the eye region into a single robust representation for person recognition. This approach is motivated by surveillance settings where full-face or iris biometrics often fail due to poor angles, distance, or motion, yet the periocular area remains relatively stable and informative. A convolutional network first extracts per-frame features; the transformer then processes the sequence to learn which frames matter most rather than treating every frame equally. Results on the COX Face dataset show consistent gains over naive pooling methods in both verification and identification metrics.

Core claim

The authors claim that an attention-aware transformer aggregation module can adaptively learn to combine frame-level periocular features extracted by a deep CNN into a single video representation, and that this learned aggregation consistently outperforms naive schemes such as averaging or max-pooling, reaching 99.8 percent TPR at a false-positive rate of 0.1 and 96.6 percent Rank-5 accuracy on the COX Face dataset.

What carries the argument

The encoder-only transformer aggregation module, which receives a collection of frame-level feature vectors and outputs a single aggregated video feature vector by computing learned attention weights across frames.

Load-bearing premise

The video sequences and imaging conditions captured in the COX Face dataset are representative of the variability found in real surveillance deployments.

What would settle it

Evaluating the identical network architecture on a separate video periocular dataset collected under markedly different distances, lighting, or subject motion and observing whether the performance margin over naive aggregation vanishes.

read the original abstract

Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes an attention-aware transformer-based aggregation network for video periocular recognition. It consists of a CNN feature embedding module that extracts frame-level periocular features and an encoder-only transformer aggregation module that adaptively learns to combine these into a single video-level representation (and a still-image reference vector). On the COX Face dataset the method is reported to outperform naive aggregation baselines, reaching 99.8% TPR@1e-1 and 96.6% Rank-5.

Significance. If the empirical gains prove robust, the adaptive transformer aggregation could offer a practical improvement for video-based periocular biometrics in surveillance settings where single-frame or simple pooling methods are insufficient. The use of a public dataset and direct comparison against naive schemes is a positive step; however, the absence of cross-dataset or out-of-distribution testing limits the strength of any broader robustness claims.

major comments (2)
  1. Experiments section: performance is demonstrated exclusively on the COX Face dataset with no cross-dataset evaluation or out-of-distribution testing. This directly affects the claim that the method shows 'robustness' in general surveillance conditions, as the learned attention weights may be tuned to COX-specific statistics rather than general aggregation rules.
  2. Experimental protocol and results: the manuscript provides headline metrics but does not include ablation tables, full training details, or statistical significance tests for the reported improvements over naive baselines. Without these, it is not possible to determine whether the central claim of consistent outperformance is reliably supported.
minor comments (1)
  1. Abstract: the notation 'TPR@1e-1' should be expanded on first use (e.g., true positive rate at a false positive rate of 0.1) for clarity to readers outside the immediate sub-field.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing honest responses and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: Experiments section: performance is demonstrated exclusively on the COX Face dataset with no cross-dataset evaluation or out-of-distribution testing. This directly affects the claim that the method shows 'robustness' in general surveillance conditions, as the learned attention weights may be tuned to COX-specific statistics rather than general aggregation rules.

    Authors: We acknowledge that the evaluation is confined to the COX Face dataset, a standard public benchmark for video periocular recognition in unconstrained surveillance settings. While the transformer-based aggregation is intended to learn general frame-weighting rules rather than dataset-specific artifacts, we agree that the absence of cross-dataset testing limits stronger claims of broad robustness. In the revised manuscript we have moderated the language around 'robustness,' explicitly noting the single-dataset scope, and added a dedicated limitations paragraph discussing the need for future out-of-distribution evaluation. New cross-dataset experiments are not feasible within the current revision timeline. revision: partial

  2. Referee: Experimental protocol and results: the manuscript provides headline metrics but does not include ablation tables, full training details, or statistical significance tests for the reported improvements over naive baselines. Without these, it is not possible to determine whether the central claim of consistent outperformance is reliably supported.

    Authors: We appreciate this observation. The revised manuscript now includes: (i) ablation tables isolating the contribution of the attention mechanism and the encoder-only transformer, (ii) complete training details (optimizer, learning-rate schedule, batch size, data augmentation, and convergence criteria), and (iii) statistical significance tests (McNemar’s test) comparing the proposed method against the naive pooling baselines. These additions are placed in an expanded Experiments section and its supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on public dataset without self-referential reductions

full rationale

The paper describes a feature embedding CNN followed by an encoder-only transformer aggregation module that learns to combine frame-level periocular features into a video representation. All performance claims (99.8% TPR@1e-1, 96.6% Rank-5 on COX Face) rest on direct experimental comparison against naive baselines rather than any closed-form derivation, fitted parameter renamed as prediction, or uniqueness theorem. No equations are presented that define a quantity in terms of itself, and no load-bearing self-citations are invoked to justify architectural choices. The method is therefore self-contained as a standard deep-learning proposal whose validity is assessed externally on a fixed public benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the transformer can learn useful attention weights from the COX dataset without additional regularization or domain-specific priors beyond standard training.

free parameters (1)
  • Transformer hyperparameters and training schedule
    Learning rate, number of layers, attention heads, and optimization choices are fitted or chosen to achieve the reported numbers.
axioms (1)
  • domain assumption Standard supervised classification loss on identity labels is sufficient to train both embedding and aggregation modules.
    Invoked implicitly when the network is trained end-to-end on labeled periocular videos.

pith-pipeline@v0.9.0 · 5736 in / 1160 out tokens · 35633 ms · 2026-05-20T18:56:55.588066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

    INTRODUCTION Video periocular recognition has received much attention from the scientific community in recent years. This research area has become a relevant study field, mainly because it plays an important role in many real-world applications such as visual surveillance [1, 2] and video search. Compared to single still image-based periocular recogni- ti...

  2. [2]

    In this work, we consider recognition methods based on video or image sets that exploit periocular or facial data

    RELA TED WORKS Since the periocular region is extracted from facial images, both video-based and image set-based face and periocular recognition share several common characteristics. In this work, we consider recognition methods based on video or image sets that exploit periocular or facial data. Regarding image set-based recognition studies, previous wor...

  3. [3]

    The first component is a deep convolutional network that acts as a feature extractor

    PROPOSED APPROACH Our aggregation framework is composed of two main mod- ules: the feature embedding and the aggregation (Figure 1). The first component is a deep convolutional network that acts as a feature extractor. The second part is the aggregation mod- ule, which combines the feature vectors of all video frames (query) to form a video-level feature ...

  4. [4]

    To train and evaluate our aggregation module, we use the COX Face [12] database, which contains 1k subjects and 3k videos

    EXPERIMENTAL RESULTS Datasets and protocols: The VGGFace2 [20] dataset which includes around 3.31M images of about 92k classes is used to train our feature embedding module for the still image- based periocular recognition task. To train and evaluate our aggregation module, we use the COX Face [12] database, which contains 1k subjects and 3k videos. We ad...

  5. [5]

    CONCLUSIONS This paper presented a novel feature aggregation scheme for video periocular recognition that consistently outperforms all baseline methods in the COX Face dataset. The experimental results show statistically significant gains under both verifica- tion and identification protocols, demonstrating the effective- ness and robustness of the propos...

  6. [6]

    Deep periocu- lar representation aiming video surveillance,

    Eduardo Luz, Gladston Moreira, Luiz Antonio Zan- lorensi Junior, and David Menotti, “Deep periocu- lar representation aiming video surveillance,” Pattern Recognition Letters, vol. 114, pp. 2–12, 2018

  7. [7]

    Convolutional neu- ral network-based periocular recognition in surveillance environments,

    Min Cheol Kim, Ja Hyung Koo, Se Woon Cho, Na Rae Baek, and Kang Ryoung Park, “Convolutional neu- ral network-based periocular recognition in surveillance environments,” IEEE Access, vol. 6, pp. 57291–57310, 2018

  8. [8]

    Performance evaluation of lo- cal appearance based periocular recognition,

    Philip E Miller, Jamie R Lyle, Shrinivas J Pundlik, and Damon L Woodard, “Performance evaluation of lo- cal appearance based periocular recognition,” in 2010 Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS). IEEE, 2010, pp. 1–6

  9. [9]

    Periocular biometrics in the visible spectrum: A feasibility study,

    Unsang Park, Arun Ross, and Anil K Jain, “Periocular biometrics in the visible spectrum: A feasibility study,” in 2009 IEEE 3rd international conference on biomet- rics: theory, applications, and systems. IEEE, 2009, pp. 1–6

  10. [10]

    Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network,

    Zijing Zhao and Ajay Kumar, “Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network,”IEEE Transactions on Information Forensics and Security , vol. 12, no. 5, pp. 1017–1030, 2016

  11. [11]

    Attribute-based deep periocular recognition: Leveraging soft biometrics to improve periocular recog- nition,

    Veeru Talreja, Nasser M Nasrabadi, and Matthew C Valenti, “Attribute-based deep periocular recognition: Leveraging soft biometrics to improve periocular recog- nition,” in Proceedings of the IEEE/CVF winter con- ference on applications of computer vision , 2022, pp. 4041–4050

  12. [12]

    A new periocular dataset collected by mobile devices in unconstrained scenarios,

    Luiz A Zanlorensi, Rayson Laroca, Diego R Lucio, Lu- cas R Santos, Alceu S Britto Jr, and David Menotti, “A new periocular dataset collected by mobile devices in unconstrained scenarios,”Scientific Reports, vol. 12, no. 1, pp. 17989, 2022

  13. [13]

    Feature aggregation network for video face recognition,

    Zhaoxiang Liu, Huan Hu, Jinqiang Bai, Shaohua Li, and Shiguo Lian, “Feature aggregation network for video face recognition,” in Proceedings of the IEEE/CVF in- ternational conference on computer vision workshops , 2019, pp. 0–0

  14. [14]

    Deepface: Closing the gap to human-level performance in face verification,

    Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2014, pp. 1701–1708

  15. [15]

    Cnn based key frame extraction for face in video recogni- tion,

    Xuan Qi, Chen Liu, and Stephanie Schuckers, “Cnn based key frame extraction for face in video recogni- tion,” in 2018 IEEE 4th international conference on identity, security, and behavior analysis (ISBA) . IEEE, 2018, pp. 1–8

  16. [16]

    Boost- ing face in video recognition via cnn based key frame extraction,

    Xuan Qi, Chen Liu, and Stephanie Schuckers, “Boost- ing face in video recognition via cnn based key frame extraction,” in 2018 International conference on bio- metrics (ICB). IEEE, 2018, pp. 132–139

  17. [17]

    A benchmark and comparative study of video-based face recognition on cox face database,

    Zhiwu Huang, Shiguang Shan, Ruiping Wang, Haihong Zhang, Shihong Lao, Alifu Kuerban, and Xilin Chen, “A benchmark and comparative study of video-based face recognition on cox face database,” IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5967–5981, 2015

  18. [18]

    Video-based face recognition us- ing probabilistic appearance manifolds,

    Kuang-Chih Lee, Jeffrey Ho, Ming-Hsuan Yang, and David Kriegman, “Video-based face recognition us- ing probabilistic appearance manifolds,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. IEEE, 2003, vol. 1, pp. I–I

  19. [19]

    Face recognition with image sets using manifold density di- vergence,

    Ognjen Arandjelovic, Gregory Shakhnarovich, John Fisher, Roberto Cipolla, and Trevor Darrell, “Face recognition with image sets using manifold density di- vergence,” in 2005 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR’05). IEEE, 2005, vol. 1, pp. 581–588

  20. [20]

    Boosted manifold principal angles for image set-based recognition,

    Tae-Kyun Kim, Ognjen Arandjelovi ´c, and Roberto Cipolla, “Boosted manifold principal angles for image set-based recognition,” Pattern Recognition, vol. 40, no. 9, pp. 2475–2484, 2007

  21. [21]

    Manifold-manifold distance with application to face recognition based on image set,

    Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao, “Manifold-manifold distance with application to face recognition based on image set,” in 2008 IEEE Conference on Computer Vision and Pattern Recogni- tion. IEEE, 2008, pp. 1–8

  22. [22]

    Facenet: A unified embedding for face recog- nition and clustering,

    Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recog- nition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823

  23. [23]

    Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,

    Luiz Guilherme Fonseca Carreira, David Menotti, and William Robson Schwartz, “Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,” in 2024 37th SIBGRAPI Conference on Graph- ics, Patterns and Images (SIBGRAPI). IEEE, 2024, pp. 1–6

  24. [24]

    Neu- ral aggregation network for video face recognition,

    Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua, “Neu- ral aggregation network for video face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4362–4371

  25. [25]

    Vggface2: A dataset for recog- nising faces across pose and age,

    Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman, “Vggface2: A dataset for recog- nising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018, pp. 67–74