Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

Breno A Mariano; David Menotti; Luiz G F Carreira; Victor H C de Melo; William Robson Schwartz

arxiv: 2605.16550 · v1 · pith:BZTPSMO3new · submitted 2026-05-15 · 💻 cs.CV · cs.LG

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

Luiz G F Carreira , Breno A Mariano , Victor H C de Melo , David Menotti , William Robson Schwartz This is my paper

Pith reviewed 2026-05-20 18:56 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords periocularrecognitionaggregationfeaturefacevideoapproachattention-aware

0 comments

The pith

An attention-aware transformer learns to combine periocular video frames into one identity vector that beats simple averaging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that an encoder-only transformer equipped with attention can adaptively weigh and merge feature vectors from multiple frames of the eye region into a single robust representation for person recognition. This approach is motivated by surveillance settings where full-face or iris biometrics often fail due to poor angles, distance, or motion, yet the periocular area remains relatively stable and informative. A convolutional network first extracts per-frame features; the transformer then processes the sequence to learn which frames matter most rather than treating every frame equally. Results on the COX Face dataset show consistent gains over naive pooling methods in both verification and identification metrics.

Core claim

The authors claim that an attention-aware transformer aggregation module can adaptively learn to combine frame-level periocular features extracted by a deep CNN into a single video representation, and that this learned aggregation consistently outperforms naive schemes such as averaging or max-pooling, reaching 99.8 percent TPR at a false-positive rate of 0.1 and 96.6 percent Rank-5 accuracy on the COX Face dataset.

What carries the argument

The encoder-only transformer aggregation module, which receives a collection of frame-level feature vectors and outputs a single aggregated video feature vector by computing learned attention weights across frames.

Load-bearing premise

The video sequences and imaging conditions captured in the COX Face dataset are representative of the variability found in real surveillance deployments.

What would settle it

Evaluating the identical network architecture on a separate video periocular dataset collected under markedly different distances, lighting, or subject motion and observing whether the performance margin over naive aggregation vanishes.

read the original abstract

Video periocular recognition is the task of recognizing an individual's identity based on the region around an individual's eyes. The periocular area is one of the most discriminative regions of the human face, making it suitable for recognition tasks. Its use as a biometric modality has emerged as an alternative, especially in surveillance scenarios where conventional biometric traits such as face or iris recognition become unfeasible due to unconstrained acquisition conditions. This paper proposes an attention-aware approach for video-based periocular recognition in surveillance environments. The framework consists of two main modules: feature embedding and aggregation. The feature embedding module is a deep convolutional neural network that maps periocular data to feature vectors. The aggregation module is an encoder-only transformer that adaptively learns to aggregate frame-level features into a single video representation and a feature vector for the still reference image. Experiments on the publicly available COX Face dataset show the robustness of the proposed method, consistently outperforming naive aggregation schemes. In the best scenario, the approach achieves $99.8\%$ of TPR@$1e^{-1}$ and $96.6\%$ of Rank-5.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Transformer aggregation for periocular video beats naive baselines on COX Face but stays untested outside that dataset.

read the letter

The paper's main result is that an attention-aware transformer can aggregate features from periocular video frames more effectively than standard methods, leading to better recognition performance on the COX Face dataset. It reaches 99.8% TPR at 1e-1 false positive rate and 96.6% Rank-5 identification. What is new here is applying the encoder-only transformer specifically as an adaptive aggregator for video periocular biometrics. The feature embedding comes from a deep CNN, and the transformer learns weights for combining the frame vectors into a single representation, while also producing a vector for the still reference image. This fits the surveillance use case where acquisition is unconstrained. The paper does well in demonstrating that this beats naive schemes like mean or max pooling. The numbers are strong, and the idea is a logical extension of transformer use in video tasks to this biometric area. On the soft spots, the main issue is the single-dataset evaluation. As the stress-test points out, if COX Face has less variation than typical surveillance footage, the attention mechanism might overfit to those patterns. No cross-dataset testing is reported, which limits how much we can say about robustness. Also, the abstract lacks details on ablations or significance, so the full paper would need to show those to make the claims more convincing. That said, the central empirical claim holds on the data presented. The work engages honestly with the literature on biometrics and transformers. This paper is aimed at researchers in computer vision and biometrics who deal with video-based recognition under difficult conditions. Someone looking for practical improvements in aggregation for periocular or face video could find it useful to build on. I would recommend sending it for peer review. It's a focused, applied contribution that adds a tool to the area without overclaiming.

Referee Report

2 major / 1 minor

Summary. The paper proposes an attention-aware transformer-based aggregation network for video periocular recognition. It consists of a CNN feature embedding module that extracts frame-level periocular features and an encoder-only transformer aggregation module that adaptively learns to combine these into a single video-level representation (and a still-image reference vector). On the COX Face dataset the method is reported to outperform naive aggregation baselines, reaching 99.8% TPR@1e-1 and 96.6% Rank-5.

Significance. If the empirical gains prove robust, the adaptive transformer aggregation could offer a practical improvement for video-based periocular biometrics in surveillance settings where single-frame or simple pooling methods are insufficient. The use of a public dataset and direct comparison against naive schemes is a positive step; however, the absence of cross-dataset or out-of-distribution testing limits the strength of any broader robustness claims.

major comments (2)

Experiments section: performance is demonstrated exclusively on the COX Face dataset with no cross-dataset evaluation or out-of-distribution testing. This directly affects the claim that the method shows 'robustness' in general surveillance conditions, as the learned attention weights may be tuned to COX-specific statistics rather than general aggregation rules.
Experimental protocol and results: the manuscript provides headline metrics but does not include ablation tables, full training details, or statistical significance tests for the reported improvements over naive baselines. Without these, it is not possible to determine whether the central claim of consistent outperformance is reliably supported.

minor comments (1)

Abstract: the notation 'TPR@1e-1' should be expanded on first use (e.g., true positive rate at a false positive rate of 0.1) for clarity to readers outside the immediate sub-field.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing honest responses and indicating revisions made to the manuscript.

read point-by-point responses

Referee: Experiments section: performance is demonstrated exclusively on the COX Face dataset with no cross-dataset evaluation or out-of-distribution testing. This directly affects the claim that the method shows 'robustness' in general surveillance conditions, as the learned attention weights may be tuned to COX-specific statistics rather than general aggregation rules.

Authors: We acknowledge that the evaluation is confined to the COX Face dataset, a standard public benchmark for video periocular recognition in unconstrained surveillance settings. While the transformer-based aggregation is intended to learn general frame-weighting rules rather than dataset-specific artifacts, we agree that the absence of cross-dataset testing limits stronger claims of broad robustness. In the revised manuscript we have moderated the language around 'robustness,' explicitly noting the single-dataset scope, and added a dedicated limitations paragraph discussing the need for future out-of-distribution evaluation. New cross-dataset experiments are not feasible within the current revision timeline. revision: partial
Referee: Experimental protocol and results: the manuscript provides headline metrics but does not include ablation tables, full training details, or statistical significance tests for the reported improvements over naive baselines. Without these, it is not possible to determine whether the central claim of consistent outperformance is reliably supported.

Authors: We appreciate this observation. The revised manuscript now includes: (i) ablation tables isolating the contribution of the attention mechanism and the encoder-only transformer, (ii) complete training details (optimizer, learning-rate schedule, batch size, data augmentation, and convergence criteria), and (iii) statistical significance tests (McNemar’s test) comparing the proposed method against the naive pooling baselines. These additions are placed in an expanded Experiments section and its supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture validated on public dataset without self-referential reductions

full rationale

The paper describes a feature embedding CNN followed by an encoder-only transformer aggregation module that learns to combine frame-level periocular features into a video representation. All performance claims (99.8% TPR@1e-1, 96.6% Rank-5 on COX Face) rest on direct experimental comparison against naive baselines rather than any closed-form derivation, fitted parameter renamed as prediction, or uniqueness theorem. No equations are presented that define a quantity in terms of itself, and no load-bearing self-citations are invoked to justify architectural choices. The method is therefore self-contained as a standard deep-learning proposal whose validity is assessed externally on a fixed public benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the transformer can learn useful attention weights from the COX dataset without additional regularization or domain-specific priors beyond standard training.

free parameters (1)

Transformer hyperparameters and training schedule
Learning rate, number of layers, attention heads, and optimization choices are fitted or chosen to achieve the reported numbers.

axioms (1)

domain assumption Standard supervised classification loss on identity labels is sufficient to train both embedding and aggregation modules.
Invoked implicitly when the network is trained end-to-end on labeled periocular videos.

pith-pipeline@v0.9.0 · 5736 in / 1160 out tokens · 35633 ms · 2026-05-20T18:56:55.588066+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

INTRODUCTION Video periocular recognition has received much attention from the scientific community in recent years. This research area has become a relevant study field, mainly because it plays an important role in many real-world applications such as visual surveillance [1, 2] and video search. Compared to single still image-based periocular recogni- ti...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

In this work, we consider recognition methods based on video or image sets that exploit periocular or facial data

RELA TED WORKS Since the periocular region is extracted from facial images, both video-based and image set-based face and periocular recognition share several common characteristics. In this work, we consider recognition methods based on video or image sets that exploit periocular or facial data. Regarding image set-based recognition studies, previous wor...

work page
[3]

The first component is a deep convolutional network that acts as a feature extractor

PROPOSED APPROACH Our aggregation framework is composed of two main mod- ules: the feature embedding and the aggregation (Figure 1). The first component is a deep convolutional network that acts as a feature extractor. The second part is the aggregation mod- ule, which combines the feature vectors of all video frames (query) to form a video-level feature ...

work page
[4]

To train and evaluate our aggregation module, we use the COX Face [12] database, which contains 1k subjects and 3k videos

EXPERIMENTAL RESULTS Datasets and protocols: The VGGFace2 [20] dataset which includes around 3.31M images of about 92k classes is used to train our feature embedding module for the still image- based periocular recognition task. To train and evaluate our aggregation module, we use the COX Face [12] database, which contains 1k subjects and 3k videos. We ad...

work page
[5]

CONCLUSIONS This paper presented a novel feature aggregation scheme for video periocular recognition that consistently outperforms all baseline methods in the COX Face dataset. The experimental results show statistically significant gains under both verifica- tion and identification protocols, demonstrating the effective- ness and robustness of the propos...

work page 2023
[6]

Deep periocu- lar representation aiming video surveillance,

Eduardo Luz, Gladston Moreira, Luiz Antonio Zan- lorensi Junior, and David Menotti, “Deep periocu- lar representation aiming video surveillance,” Pattern Recognition Letters, vol. 114, pp. 2–12, 2018

work page 2018
[7]

Convolutional neu- ral network-based periocular recognition in surveillance environments,

Min Cheol Kim, Ja Hyung Koo, Se Woon Cho, Na Rae Baek, and Kang Ryoung Park, “Convolutional neu- ral network-based periocular recognition in surveillance environments,” IEEE Access, vol. 6, pp. 57291–57310, 2018

work page 2018
[8]

Performance evaluation of lo- cal appearance based periocular recognition,

Philip E Miller, Jamie R Lyle, Shrinivas J Pundlik, and Damon L Woodard, “Performance evaluation of lo- cal appearance based periocular recognition,” in 2010 Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS). IEEE, 2010, pp. 1–6

work page 2010
[9]

Periocular biometrics in the visible spectrum: A feasibility study,

Unsang Park, Arun Ross, and Anil K Jain, “Periocular biometrics in the visible spectrum: A feasibility study,” in 2009 IEEE 3rd international conference on biomet- rics: theory, applications, and systems. IEEE, 2009, pp. 1–6

work page 2009
[10]

Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network,

Zijing Zhao and Ajay Kumar, “Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network,”IEEE Transactions on Information Forensics and Security , vol. 12, no. 5, pp. 1017–1030, 2016

work page 2016
[11]

Attribute-based deep periocular recognition: Leveraging soft biometrics to improve periocular recog- nition,

Veeru Talreja, Nasser M Nasrabadi, and Matthew C Valenti, “Attribute-based deep periocular recognition: Leveraging soft biometrics to improve periocular recog- nition,” in Proceedings of the IEEE/CVF winter con- ference on applications of computer vision , 2022, pp. 4041–4050

work page 2022
[12]

A new periocular dataset collected by mobile devices in unconstrained scenarios,

Luiz A Zanlorensi, Rayson Laroca, Diego R Lucio, Lu- cas R Santos, Alceu S Britto Jr, and David Menotti, “A new periocular dataset collected by mobile devices in unconstrained scenarios,”Scientific Reports, vol. 12, no. 1, pp. 17989, 2022

work page 2022
[13]

Feature aggregation network for video face recognition,

Zhaoxiang Liu, Huan Hu, Jinqiang Bai, Shaohua Li, and Shiguo Lian, “Feature aggregation network for video face recognition,” in Proceedings of the IEEE/CVF in- ternational conference on computer vision workshops , 2019, pp. 0–0

work page 2019
[14]

Deepface: Closing the gap to human-level performance in face verification,

Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2014, pp. 1701–1708

work page 2014
[15]

Cnn based key frame extraction for face in video recogni- tion,

Xuan Qi, Chen Liu, and Stephanie Schuckers, “Cnn based key frame extraction for face in video recogni- tion,” in 2018 IEEE 4th international conference on identity, security, and behavior analysis (ISBA) . IEEE, 2018, pp. 1–8

work page 2018
[16]

Boost- ing face in video recognition via cnn based key frame extraction,

Xuan Qi, Chen Liu, and Stephanie Schuckers, “Boost- ing face in video recognition via cnn based key frame extraction,” in 2018 International conference on bio- metrics (ICB). IEEE, 2018, pp. 132–139

work page 2018
[17]

A benchmark and comparative study of video-based face recognition on cox face database,

Zhiwu Huang, Shiguang Shan, Ruiping Wang, Haihong Zhang, Shihong Lao, Alifu Kuerban, and Xilin Chen, “A benchmark and comparative study of video-based face recognition on cox face database,” IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5967–5981, 2015

work page 2015
[18]

Video-based face recognition us- ing probabilistic appearance manifolds,

Kuang-Chih Lee, Jeffrey Ho, Ming-Hsuan Yang, and David Kriegman, “Video-based face recognition us- ing probabilistic appearance manifolds,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. IEEE, 2003, vol. 1, pp. I–I

work page 2003
[19]

Face recognition with image sets using manifold density di- vergence,

Ognjen Arandjelovic, Gregory Shakhnarovich, John Fisher, Roberto Cipolla, and Trevor Darrell, “Face recognition with image sets using manifold density di- vergence,” in 2005 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR’05). IEEE, 2005, vol. 1, pp. 581–588

work page 2005
[20]

Boosted manifold principal angles for image set-based recognition,

Tae-Kyun Kim, Ognjen Arandjelovi ´c, and Roberto Cipolla, “Boosted manifold principal angles for image set-based recognition,” Pattern Recognition, vol. 40, no. 9, pp. 2475–2484, 2007

work page 2007
[21]

Manifold-manifold distance with application to face recognition based on image set,

Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao, “Manifold-manifold distance with application to face recognition based on image set,” in 2008 IEEE Conference on Computer Vision and Pattern Recogni- tion. IEEE, 2008, pp. 1–8

work page 2008
[22]

Facenet: A unified embedding for face recog- nition and clustering,

Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recog- nition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823

work page 2015
[23]

Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,

Luiz Guilherme Fonseca Carreira, David Menotti, and William Robson Schwartz, “Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,” in 2024 37th SIBGRAPI Conference on Graph- ics, Patterns and Images (SIBGRAPI). IEEE, 2024, pp. 1–6

work page 2024
[24]

Neu- ral aggregation network for video face recognition,

Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua, “Neu- ral aggregation network for video face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4362–4371

work page 2017
[25]

Vggface2: A dataset for recog- nising faces across pose and age,

Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman, “Vggface2: A dataset for recog- nising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018, pp. 67–74

work page 2018

[1] [1]

Attention-Aware Transformer-Based Aggregation Network for Video Periocular Recognition

INTRODUCTION Video periocular recognition has received much attention from the scientific community in recent years. This research area has become a relevant study field, mainly because it plays an important role in many real-world applications such as visual surveillance [1, 2] and video search. Compared to single still image-based periocular recogni- ti...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

In this work, we consider recognition methods based on video or image sets that exploit periocular or facial data

RELA TED WORKS Since the periocular region is extracted from facial images, both video-based and image set-based face and periocular recognition share several common characteristics. In this work, we consider recognition methods based on video or image sets that exploit periocular or facial data. Regarding image set-based recognition studies, previous wor...

work page

[3] [3]

The first component is a deep convolutional network that acts as a feature extractor

PROPOSED APPROACH Our aggregation framework is composed of two main mod- ules: the feature embedding and the aggregation (Figure 1). The first component is a deep convolutional network that acts as a feature extractor. The second part is the aggregation mod- ule, which combines the feature vectors of all video frames (query) to form a video-level feature ...

work page

[4] [4]

To train and evaluate our aggregation module, we use the COX Face [12] database, which contains 1k subjects and 3k videos

EXPERIMENTAL RESULTS Datasets and protocols: The VGGFace2 [20] dataset which includes around 3.31M images of about 92k classes is used to train our feature embedding module for the still image- based periocular recognition task. To train and evaluate our aggregation module, we use the COX Face [12] database, which contains 1k subjects and 3k videos. We ad...

work page

[5] [5]

CONCLUSIONS This paper presented a novel feature aggregation scheme for video periocular recognition that consistently outperforms all baseline methods in the COX Face dataset. The experimental results show statistically significant gains under both verifica- tion and identification protocols, demonstrating the effective- ness and robustness of the propos...

work page 2023

[6] [6]

Deep periocu- lar representation aiming video surveillance,

Eduardo Luz, Gladston Moreira, Luiz Antonio Zan- lorensi Junior, and David Menotti, “Deep periocu- lar representation aiming video surveillance,” Pattern Recognition Letters, vol. 114, pp. 2–12, 2018

work page 2018

[7] [7]

Convolutional neu- ral network-based periocular recognition in surveillance environments,

Min Cheol Kim, Ja Hyung Koo, Se Woon Cho, Na Rae Baek, and Kang Ryoung Park, “Convolutional neu- ral network-based periocular recognition in surveillance environments,” IEEE Access, vol. 6, pp. 57291–57310, 2018

work page 2018

[8] [8]

Performance evaluation of lo- cal appearance based periocular recognition,

Philip E Miller, Jamie R Lyle, Shrinivas J Pundlik, and Damon L Woodard, “Performance evaluation of lo- cal appearance based periocular recognition,” in 2010 Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS). IEEE, 2010, pp. 1–6

work page 2010

[9] [9]

Periocular biometrics in the visible spectrum: A feasibility study,

Unsang Park, Arun Ross, and Anil K Jain, “Periocular biometrics in the visible spectrum: A feasibility study,” in 2009 IEEE 3rd international conference on biomet- rics: theory, applications, and systems. IEEE, 2009, pp. 1–6

work page 2009

[10] [10]

Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network,

Zijing Zhao and Ajay Kumar, “Accurate periocular recognition under less constrained environment using semantics-assisted convolutional neural network,”IEEE Transactions on Information Forensics and Security , vol. 12, no. 5, pp. 1017–1030, 2016

work page 2016

[11] [11]

Attribute-based deep periocular recognition: Leveraging soft biometrics to improve periocular recog- nition,

Veeru Talreja, Nasser M Nasrabadi, and Matthew C Valenti, “Attribute-based deep periocular recognition: Leveraging soft biometrics to improve periocular recog- nition,” in Proceedings of the IEEE/CVF winter con- ference on applications of computer vision , 2022, pp. 4041–4050

work page 2022

[12] [12]

A new periocular dataset collected by mobile devices in unconstrained scenarios,

Luiz A Zanlorensi, Rayson Laroca, Diego R Lucio, Lu- cas R Santos, Alceu S Britto Jr, and David Menotti, “A new periocular dataset collected by mobile devices in unconstrained scenarios,”Scientific Reports, vol. 12, no. 1, pp. 17989, 2022

work page 2022

[13] [13]

Feature aggregation network for video face recognition,

Zhaoxiang Liu, Huan Hu, Jinqiang Bai, Shaohua Li, and Shiguo Lian, “Feature aggregation network for video face recognition,” in Proceedings of the IEEE/CVF in- ternational conference on computer vision workshops , 2019, pp. 0–0

work page 2019

[14] [14]

Deepface: Closing the gap to human-level performance in face verification,

Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2014, pp. 1701–1708

work page 2014

[15] [15]

Cnn based key frame extraction for face in video recogni- tion,

Xuan Qi, Chen Liu, and Stephanie Schuckers, “Cnn based key frame extraction for face in video recogni- tion,” in 2018 IEEE 4th international conference on identity, security, and behavior analysis (ISBA) . IEEE, 2018, pp. 1–8

work page 2018

[16] [16]

Boost- ing face in video recognition via cnn based key frame extraction,

Xuan Qi, Chen Liu, and Stephanie Schuckers, “Boost- ing face in video recognition via cnn based key frame extraction,” in 2018 International conference on bio- metrics (ICB). IEEE, 2018, pp. 132–139

work page 2018

[17] [17]

A benchmark and comparative study of video-based face recognition on cox face database,

Zhiwu Huang, Shiguang Shan, Ruiping Wang, Haihong Zhang, Shihong Lao, Alifu Kuerban, and Xilin Chen, “A benchmark and comparative study of video-based face recognition on cox face database,” IEEE Transactions on Image Processing , vol. 24, no. 12, pp. 5967–5981, 2015

work page 2015

[18] [18]

Video-based face recognition us- ing probabilistic appearance manifolds,

Kuang-Chih Lee, Jeffrey Ho, Ming-Hsuan Yang, and David Kriegman, “Video-based face recognition us- ing probabilistic appearance manifolds,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings. IEEE, 2003, vol. 1, pp. I–I

work page 2003

[19] [19]

Face recognition with image sets using manifold density di- vergence,

Ognjen Arandjelovic, Gregory Shakhnarovich, John Fisher, Roberto Cipolla, and Trevor Darrell, “Face recognition with image sets using manifold density di- vergence,” in 2005 IEEE Computer Society Con- ference on Computer Vision and Pattern Recognition (CVPR’05). IEEE, 2005, vol. 1, pp. 581–588

work page 2005

[20] [20]

Boosted manifold principal angles for image set-based recognition,

Tae-Kyun Kim, Ognjen Arandjelovi ´c, and Roberto Cipolla, “Boosted manifold principal angles for image set-based recognition,” Pattern Recognition, vol. 40, no. 9, pp. 2475–2484, 2007

work page 2007

[21] [21]

Manifold-manifold distance with application to face recognition based on image set,

Ruiping Wang, Shiguang Shan, Xilin Chen, and Wen Gao, “Manifold-manifold distance with application to face recognition based on image set,” in 2008 IEEE Conference on Computer Vision and Pattern Recogni- tion. IEEE, 2008, pp. 1–8

work page 2008

[22] [22]

Facenet: A unified embedding for face recog- nition and clustering,

Florian Schroff, Dmitry Kalenichenko, and James Philbin, “Facenet: A unified embedding for face recog- nition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823

work page 2015

[23] [23]

Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,

Luiz Guilherme Fonseca Carreira, David Menotti, and William Robson Schwartz, “Dpr-v2s: A deep frame- work for periocular recognition in surveillance environ- ments,” in 2024 37th SIBGRAPI Conference on Graph- ics, Patterns and Images (SIBGRAPI). IEEE, 2024, pp. 1–6

work page 2024

[24] [24]

Neu- ral aggregation network for video face recognition,

Jiaolong Yang, Peiran Ren, Dongqing Zhang, Dong Chen, Fang Wen, Hongdong Li, and Gang Hua, “Neu- ral aggregation network for video face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4362–4371

work page 2017

[25] [25]

Vggface2: A dataset for recog- nising faces across pose and age,

Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman, “Vggface2: A dataset for recog- nising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018, pp. 67–74

work page 2018