Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion

Cunrong Li; Xiaoyun Wang; Yabo Luo

arxiv: 2604.27353 · v1 · submitted 2026-04-30 · 💻 cs.CV

Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion

Yabo Luo , Xiaoyun Wang , Cunrong Li This is my paper

Pith reviewed 2026-05-07 08:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords gait recognitionskeleton-based biometricsmulti-branch feature fusionresidual networkspose estimationCASIA-B benchmarkviewpoint variationclothing invariance

0 comments

The pith

A multi-branch residual network fuses body shape, speed, and joint motion features from skeletal poses to improve gait recognition under clothing and view changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that gait recognition improves when three complementary streams—body proportions, walking velocity, and skeletal joint dynamics—are extracted from pose sequences and then combined by a fusion module that learns how much weight to give each stream. The streams come from HRNet pose estimates fed into a ResNet-50 backbone, and the fusion uses channel-wise attention to blend them. A sympathetic reader would care because gait offers a way to identify people at a distance without their cooperation and despite changes in clothing or carried items, which defeats many other biometrics. The authors test the approach on the CASIA-B dataset that includes normal walking, coat wearing, and bag carrying across multiple camera angles, reporting the highest Rank-1 score among skeleton-only methods in the coat condition.

Core claim

The authors claim that constructing three separate feature branches for body proportion, gait velocity, and skeletal motion, processing them with a 50-layer residual network, and integrating them through a Multi-Branch Feature Fusion module that learns branch contributions via activation parameters produces more discriminative gait representations than prior skeleton-based pipelines, yielding 94.52% Rank-1 accuracy on normal walking sequences and the strongest coat-wearing results on the CASIA-B cross-view benchmark.

What carries the argument

The Multi-Branch Feature Fusion (MFF) module, which applies channel-wise attention to learn and apply dynamic weights that combine the outputs of the body-proportion, gait-velocity, and skeletal-motion branches.

If this is right

The three-branch design extracts both static shape cues and dynamic motion cues that remain useful when clothing changes hide surface appearance.
The residual backbone supplies hierarchical features that capture fine spatial detail from the initial HRNet keypoint maps even at lower image resolutions.
Cross-view performance improves because the fusion step lets the network emphasize whichever branch is least affected by the current camera angle.
The reported 94.52% normal-walking accuracy and leading coat-wearing score establish a new reference point for skeleton-only gait systems on CASIA-B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same branch-and-fuse pattern could be tried on other time-varying skeletal tasks such as action classification or fall detection.
If the learned weights prove stable, the method might reduce reliance on hand-crafted gait descriptors in favor of end-to-end learned combinations.
Pairing the skeleton pipeline with low-resolution RGB input could test whether the fusion module still adds value when richer appearance cues are already available.
Re-training the fusion parameters on a larger, more diverse collection of walking videos would indicate how much the current numbers depend on CASIA-B specifics.

Load-bearing premise

The learned weights inside the fusion module will continue to assign useful contributions to each branch when the walking patterns differ from those seen during training on the CASIA-B collection.

What would settle it

Running the same trained model on a fresh gait dataset recorded with different cameras, subjects, or lighting and finding that its Rank-1 accuracy falls below that of the best competing skeleton-based method on the same new data.

Figures

Figures reproduced from arXiv: 2604.27353 by Cunrong Li, Xiaoyun Wang, Yabo Luo.

**Figure 1.** Figure 1: Overview of the proposed gait recognition framework. Raw walking videos are first processed by HRNet for view at source ↗

read the original abstract

Gait recognition has emerged as a compelling biometric modality for surveillance and security applications, offering inherent advantages such as non-intrusiveness, resistance to disguise, and long-range identification capability. However, prevailing approaches struggle to comprehensively capture and exploit the rich biometric cues embedded in human locomotion, particularly under covariate interference including viewpoint variation, clothing change, and carrying conditions. In this paper, we present a high-precision gait recognition framework that deeply extracts and synergistically fuses gait dynamics with body shape characteristics through a multi-branch architecture grounded in deep residual learning. Specifically, we first employ the High-Resolution Network (HRNet) to perform robust skeletal keypoint estimation, preserving fine-grained spatial information even under low-resolution inputs. We then construct three complementary feature branches -- body proportion, gait velocity, and skeletal motion -- from the extracted pose sequences. A 50-layer Residual Network (ResNet-50) backbone is leveraged within a deep feature extraction module to capture hierarchically rich and discriminative representations. To effectively integrate heterogeneous feature streams, we design a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention mechanisms, which dynamically allocates contribution weights across branches through learned activation parameters. Extensive experiments on the cross-view multi-condition CASIA-B benchmark demonstrate that our method achieves a Rank-1 accuracy of 94.52\% under normal walking, with the best recognition performance among skeleton-based methods for the coat-wearing condition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competent incremental assembly of existing components for skeleton-based gait recognition that posts strong CASIA-B numbers but leaves the dynamic fusion claim underspecified.

read the letter

This paper is an incremental improvement on skeleton-based gait recognition that reports strong numbers on CASIA-B but relies entirely on prior components and leaves the key fusion claim under-specified. They use HRNet to get skeletal keypoints, then create three branches for body proportion, gait velocity, and skeletal motion. Each goes through ResNet-50, and a Multi-Branch Feature Fusion module combines them using something like channel attention with learned parameters. The headline result is 94.52% Rank-1 accuracy for normal walking, and it beats other skeleton methods on the coat-wearing condition. The strength is in the concrete benchmark results and the practical focus on handling clothing and carrying variations. The architecture is described clearly enough in the abstract to understand the flow. The main weakness is the lack of detail on whether the fusion is truly dynamic. The abstract says it dynamically allocates contribution weights across branches through learned activation parameters. If those parameters are fixed after training rather than computed from each input sample, then the module is not adapting per gait sequence. That would mean the performance edge comes from the branch design or the backbone, not from the claimed synergistic fusion. The stress test note flags this exactly, and the abstract alone does not resolve it. There are no ablation studies or variance numbers visible here, which makes it hard to judge if the gains are reliable or just from careful tuning on this one dataset. All the pieces—HRNet, ResNet-50, channel attention—are from earlier work, so the contribution is in the specific combination for gait. This is useful for applied researchers who need high accuracy on public gait benchmarks for surveillance applications. Someone looking for novel architectures or theoretical insights will not find much. It is worth sending for peer review because the empirical claims are testable and the setup is standard enough that referees can evaluate the implementation details once the full paper is available.

Referee Report

3 major / 2 minor

Summary. The paper proposes a skeleton-based gait recognition framework that first uses HRNet to extract keypoints from video, derives three complementary branches (body proportion, gait velocity, skeletal motion) from the pose sequences, extracts features with a ResNet-50 backbone, and fuses them via a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention. The MFF is described as dynamically allocating branch weights through learned activation parameters. On the CASIA-B benchmark the method reports 94.52% Rank-1 accuracy under normal walking and the highest accuracy among skeleton-based approaches for the coat-wearing condition.

Significance. If the MFF module indeed supplies input-dependent, condition-adaptive fusion that measurably improves robustness to clothing and carrying covariates, the work would strengthen the case for multi-branch skeleton representations in gait biometrics. The reported CASIA-B numbers are competitive with recent skeleton methods, but the significance is currently limited by the absence of ablation evidence isolating the fusion contribution and by the lack of verification that the activation parameters are truly dynamic rather than globally learned scalars.

major comments (3)

[MFF module description] MFF module (description in abstract and corresponding architecture section): the claim that the module 'dynamically allocates contribution weights across branches through learned activation parameters' is load-bearing for the central attribution of coat-wearing gains to synergistic fusion. It is unclear whether the activation parameters are computed per-sample from the branch features (as in standard squeeze-excitation or cross-attention) or are a fixed set of scalars learned once during training. If the latter, the allocation is static and the superiority on covariate conditions cannot be credited to the claimed dynamic fusion mechanism.
[Experiments and ablation tables] Experimental section and tables: no ablation results are referenced that compare the full MFF-equipped model against (i) simple concatenation of the three branches, (ii) the ResNet-50 backbone alone, or (iii) the three branches with fixed (non-attention) weighting. Without these controls it is impossible to determine whether the reported 94.52% normal-walking Rank-1 and leading coat-wearing result are driven by the fusion module or by the branch design and backbone. In addition, the manuscript provides no error bars, multiple-run statistics, or significance tests for the cross-condition numbers.
[Results tables] Table reporting coat-wearing results: the abstract asserts 'best recognition performance among skeleton-based methods' for the coat condition, yet the manuscript does not supply the full comparative table with exact numbers and standard deviations from the competing skeleton methods. This makes the 'best' claim unverifiable from the given information and weakens the cross-condition robustness argument.

minor comments (2)

[Branch construction] The precise mathematical definitions of the three input branches (body proportion, gait velocity, skeletal motion) are only sketched; explicit equations showing how each is derived from the HRNet keypoint sequences would improve reproducibility.
[Implementation details] The manuscript should state the training protocol (optimizer, learning-rate schedule, data augmentation, batch size) and whether any condition-specific hyper-parameter search was performed on the CASIA-B validation split.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the valuable comments and suggestions. Below, we provide a point-by-point response to the major comments. We have revised the manuscript to incorporate the necessary changes and clarifications.

read point-by-point responses

Referee: [MFF module description] MFF module (description in abstract and corresponding architecture section): the claim that the module 'dynamically allocates contribution weights across branches through learned activation parameters' is load-bearing for the central attribution of coat-wearing gains to synergistic fusion. It is unclear whether the activation parameters are computed per-sample from the branch features (as in standard squeeze-excitation or cross-attention) or are a fixed set of scalars learned once during training. If the latter, the allocation is static and the superiority on covariate conditions cannot be credited to the claimed dynamic fusion mechanism.

Authors: We thank the referee for this important observation regarding the MFF module. The design of the MFF is intended to provide dynamic, input-dependent weighting, as the activation parameters are generated from the input branch features rather than being fixed scalars. To resolve any ambiguity, we have updated the manuscript's architecture description in Section 3 to include explicit details on how the weights are computed dynamically for each sample, along with the relevant equations. This revision ensures that the dynamic nature of the fusion is clearly documented and supports the attribution of performance gains to the fusion mechanism. revision: yes
Referee: [Experiments and ablation tables] Experimental section and tables: no ablation results are referenced that compare the full MFF-equipped model against (i) simple concatenation of the three branches, (ii) the ResNet-50 backbone alone, or (iii) the three branches with fixed (non-attention) weighting. Without these controls it is impossible to determine whether the reported 94.52% normal-walking Rank-1 and leading coat-wearing result are driven by the fusion module or by the branch design and backbone. In addition, the manuscript provides no error bars, multiple-run statistics, or significance tests for the cross-condition numbers.

Authors: We agree with the referee that additional ablation studies would strengthen the experimental validation. In the revised manuscript, we will include a new table presenting results for the full model versus (i) simple concatenation of branches, (ii) ResNet-50 on a single combined feature stream, and (iii) fixed weighting. Furthermore, we will report mean and standard deviation from multiple runs to provide statistical context for the results. revision: yes
Referee: [Results tables] Table reporting coat-wearing results: the abstract asserts 'best recognition performance among skeleton-based methods' for the coat condition, yet the manuscript does not supply the full comparative table with exact numbers and standard deviations from the competing skeleton methods. This makes the 'best' claim unverifiable from the given information and weakens the cross-condition robustness argument.

Authors: We agree that providing the full comparative data would make the claim more verifiable. We have added a comprehensive table in the revised version that includes the exact Rank-1 accuracies and, where available, standard deviations from all referenced skeleton-based methods on the coat-wearing condition. This allows readers to confirm our method's leading performance. revision: yes

Circularity Check

0 steps flagged

Empirical neural architecture with benchmark evaluation exhibits no circularity

full rationale

The paper describes a standard deep learning pipeline: HRNet pose estimation followed by three hand-crafted feature branches, ResNet-50 extraction, and an MFF fusion module whose weights are learned during training. All performance claims are Rank-1 accuracies measured on held-out splits of the public CASIA-B benchmark. No mathematical derivation, uniqueness theorem, or first-principles prediction is offered; therefore no step can reduce to its own inputs by construction. The MFF description (“dynamically allocates … through learned activation parameters”) is an architectural claim whose correctness is tested empirically rather than assumed. No self-citation load-bearing steps appear in the abstract or described method.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised deep-learning assumptions (i.i.d. train/test splits, cross-entropy loss, SGD optimization) plus the unstated premise that skeletal keypoints extracted by HRNet remain reliable under the covariate conditions tested. No new physical or mathematical axioms are introduced.

free parameters (2)

ResNet-50 and fusion-module weights
All network parameters are learned from CASIA-B training data; their values are not derived from first principles.
MFF attention scaling parameters
Learned activation parameters that control branch weighting are fitted during training.

axioms (2)

domain assumption HRNet produces sufficiently accurate keypoints on low-resolution gait videos
Invoked when the paper states that HRNet preserves fine-grained spatial information even under low-resolution inputs.
domain assumption The three chosen feature streams (body proportion, gait velocity, skeletal motion) are complementary and sufficient
The multi-branch design presupposes that these three streams together capture the necessary biometric cues.

pith-pipeline@v0.9.0 · 5553 in / 1524 out tokens · 51908 ms · 2026-05-07T08:27:18.436425+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 3 canonical work pages

[1]

Realtime multi-person 2D pose estimation using part affin- ity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation using part affin- ity fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7291– 7299, 2017

2017
[2]

GaitSet: cross-view gait recognition through utilizing gait as a deep set.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3467–3478, 2022

Hanqing Chao, Kun Wang, Yiwei He, Jianping Zhang, and Jianfeng Feng. GaitSet: cross-view gait recognition through utilizing gait as a deep set.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3467–3478, 2022

2022
[3]

GaitGCI: generative counterfactual interven- tion for gait recognition

Huanzhang Dou, Pengyi Zhang, Wei Su, Yunlong Yu, Yiying Lin, and Xi Li. GaitGCI: generative counterfactual interven- tion for gait recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5765–5774, 2022

2022
[4]

OpenGait: revisiting gait recognition toward better practicality

Chao Fan, Junhao Liang, Chuanfu Shen, Saihui Hou, Yongzhen Huang, and Shiqi Yu. OpenGait: revisiting gait recognition toward better practicality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9707–9716, 2023

2023
[5]

SkeletonGait: gait recognition using skeleton maps

Chao Fan, Junzhe Ma, Dongyang Jin, Chuanfu Shen, and Shiqi Yu. SkeletonGait: gait recognition using skeleton maps. InProceedings of the 38th AAAI Conference on Arti- ficial Intelligence, pages 1662–1669, 2024

2024
[6]

GaitPart: temporal part-based model for gait recognition

Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He. GaitPart: temporal part-based model for gait recognition. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14213–14221, 2020

2020
[7]

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile doc- ument understanding.Science China Information Sciences, 2024

Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wei Zhou, Hao Li, and Can Huang. DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile doc- ument understanding.Science China Information Sciences, 2024

2024
[8]

UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wei Zhou, Hao Li, and Can Huang. UniDoc: a universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. InarXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023
[9]

Dolphin: document image parsing via heterogeneous anchor prompting

Hao Feng, Shuai Wei, Xiang Fei, Wenhui Shi, Yi Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, et al. Dolphin: document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

2025
[10]

GPGait: generalized pose-based gait recognition

Yang Fu, Shibei Meng, Saihui Hou, Xuecai Hu, and Yongzhen Huang. GPGait: generalized pose-based gait recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19538–19547, 2023

2023
[11]

Per- son re-identification method based on multi-branch fusion attention mechanism.Computer Engineering and Design, 43(8):2260–2267, 2022

Tong Guo, Qian Zhao, Yan Zhao, and Chenglong Wang. Per- son re-identification method based on multi-branch fusion attention mechanism.Computer Engineering and Design, 43(8):2260–2267, 2022

2022
[12]

Deep residual learning for image recognition.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016
[13]

GaitDAN: cross-view gait recognition via adversarial domain adaptation.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8026–8040, 2024

Tianhao Huang, Xianye Ben, Chen Gong, Wenzheng Xu, Qiang Wu, and Huicheng Zhou. GaitDAN: cross-view gait recognition via adversarial domain adaptation.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8026–8040, 2024

2024
[14]

Context- sensitive temporal feature learning for gait recognition

Xiaohu Huang, Duowang Zhu, Hao Wang, Xinggang Wang, Bo Yang, Botao He, Wenyu Liu, and Bin Feng. Context- sensitive temporal feature learning for gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12909–12918, 2021

2021
[15]

Deep learn- ing based two-dimension human pose estimation: a critical analysis.Journal of Image and Graphics, 28(7):1965–1989, 2023

Yinghui Kong, Yinfeng Qin, and Ke Zhang. Deep learn- ing based two-dimension human pose estimation: a critical analysis.Journal of Image and Graphics, 28(7):1965–1989, 2023

1965
[16]

Hierarchical feature fusion attention network for im- age super-resolution reconstruction.Journal of Image and Graphics, 25(9):1773–1786, 2020

Pengcheng Lei, Cong Liu, Jiangang Tang, and Dunlu Peng. Hierarchical feature fusion attention network for im- age super-resolution reconstruction.Journal of Image and Graphics, 25(9):1773–1786, 2020

2020
[17]

Gait recognition via semi-supervised disen- tangled representation learning to identity and covariate fea- tures

Xiang Li, Yasushi Makihara, Chi Xu, Yasushi Yagi, and Mingwu Ren. Gait recognition via semi-supervised disen- tangled representation learning to identity and covariate fea- tures. pages 13309–13319, 2020

2020
[18]

Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing vari- ations

Rijun Liao, Chunshui Cao, Edel B Garcia, Shiqi Yu, and Yongzhen Huang. Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing vari- ations. InProceedings of the 12th Chinese Conference on Biometric Recognition, pages 474–483, 2017

2017
[19]

Gait recognition via effective global-local feature representation and local tem- poral aggregation

Beibei Lin, Shunli Zhang, and Xin Yu. Gait recognition via effective global-local feature representation and local tem- poral aggregation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 14648–14656, 2021

2021
[20]

SPTS v2: single-point scene text spotting.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):15477–15493, 2023

Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chun- hua Shen, Xiang Bai, et al. SPTS v2: single-point scene text spotting.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):15477–15493, 2023

2023
[21]

A bounding box is worth one token: interleav- ing layout and text in a large language model for document understanding

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: interleav- ing layout and text in a large language model for document understanding. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 7252–7273, 2025

2025
[22]

Gait recognition: a comprehensive survey on methods, datasets and evaluation metrics.IEEE Access, 11:83098–83120, 2023

Jashila Nair Mogan, Chin Poo Lee, and Kian Ming Lim. Gait recognition: a comprehensive survey on methods, datasets and evaluation metrics.IEEE Access, 11:83098–83120, 2023

2023
[23]

Learn- ing rich features for gait recognition by integrating skeletons and silhouettes

Yunjie Peng, Chao Fan, Chuanfu Shen, and Shiqi Yu. Learn- ing rich features for gait recognition by integrating skeletons and silhouettes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4559–4567, 2024

2024
[24]

Deep gait recognition: a survey.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):264–284, 2023

Alireza Sepas-Moghaddam and Ali Etemad. Deep gait recognition: a survey.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):264–284, 2023

2023
[25]

MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

Biluo Shan, Xiang Fei, Wenhui Shi, Aolin Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: multimodal cognition towards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024. 10

work page arXiv 2024
[26]

LidarGait: benchmarking 3D gait recognition with point clouds

Chuanfu Shen, Chao Fan, Wei Wu, Rui Wang, George Q Huang, and Shiqi Yu. LidarGait: benchmarking 3D gait recognition with point clouds. pages 1054–1063, 2023

2023
[27]

GaitNet: an end-to-end network for gait based human identification.Pattern Recognition, 96:106988, 2019

Chunfeng Song, Yongzhen Huang, Wanli Ouyang, and Liang Wang. GaitNet: an end-to-end network for gait based human identification.Pattern Recognition, 96:106988, 2019

2019
[28]

Deep high-resolution representation learning for human pose es- timation

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5686– 5696, 2019

2019
[29]

TextSquare: Scaling up text-centric visual instruction tuning,

Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yuang He, Kaixuan Lu, Hao Feng, Yu Li, et al. TextSquare: scaling up text-centric visual in- struction tuning.arXiv preprint arXiv:2404.12803, 2024

work page arXiv 2024
[30]

MTVQA: benchmarking multilingual text-centric vi- sual question answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Aolin Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. MTVQA: benchmarking multilingual text-centric vi- sual question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7748–7763, 2025

2025
[31]

Optimal boxes: boosting end-to- end scene text recognition by adjusting annotated bound- ing boxes via reinforcement learning

Jingqun Tang, Wenqing Qian, Lei Song, Xiaolong Dong, Lan Li, and Xiang Bai. Optimal boxes: boosting end-to- end scene text recognition by adjusting annotated bound- ing boxes via reinforcement learning. InProceedings of the European Conference on Computer Vision, pages 233–248, 2022

2022
[32]

Few could be better than all: Feature sampling and grouping for scene text detection

Jingqun Tang, Wenqing Zhang, Hao Liu, Min-Kuan Yang, Bo Jiang, Guangliang Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563– 4572, 2022

2022
[33]

Towards a deeper understand- ing of skeleton-based gait recognition

Torben Teepe, Johannes Gilg, Fabian Herzog, Stefan H¨ormann, and Gerhard Rigoll. Towards a deeper understand- ing of skeleton-based gait recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1569–1577, 2022

2022
[34]

GaitGraph: graph con- volutional network for skeleton-based gait recognition

Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Ste- fan H ¨ormann, and Gerhard Rigoll. GaitGraph: graph con- volutional network for skeleton-based gait recognition. In Proceedings of the IEEE International Conference on Image Processing, pages 2314–2318, 2021

2021
[35]

PARGO: bridging vision-language with partial and global views

Aolin Wang, Biluo Shan, Wenhui Shi, Kevin Yi Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, et al. PARGO: bridging vision-language with partial and global views. InProceedings of the 38th AAAI Conference on Arti- ficial Intelligence, 2025

2025
[36]

WildDoc: how far are we from achieving comprehensive and robust document understanding in the wild? 2025

Aolin Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xi- ang Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: how far are we from achieving comprehensive and robust document understanding in the wild? 2025

2025
[37]

Learning discriminative features with multiple gran- ularities for person re-identification

Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple gran- ularities for person re-identification. InProceedings of the 26th ACM International Conference on Multimedia, pages 274–282, 2018

2018
[38]

DyGait: exploiting dynamic representations for high-performance gait recognition

Ming Wang, Xianda Guo, Beibei Lin, Tian Yang, Xin Yu, Shunli Zhang, and Xin Yu. DyGait: exploiting dynamic representations for high-performance gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13424–13433, 2023

2023
[39]

BigGait: learning gait representation you want by large vision models

Dingqiang Ye, Chao Fan, Junzhe Ma, Xiaoming Liu, and Shiqi Yu. BigGait: learning gait representation you want by large vision models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 200–210, 2024

2024
[40]

A Hough transform based method for gait feature extraction.Journal of Image and Graphics, 10(10):1304–1309, 2005

Jing Yu, Juan Duan, and Kaina Su. A Hough transform based method for gait feature extraction.Journal of Image and Graphics, 10(10):1304–1309, 2005

2005
[41]

A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition

Shiqi Yu, Daoliang Tan, and Tieniu Tan. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. InProceedings of the 18th In- ternational Conference on Pattern Recognition, pages 441– 444, 2006

2006
[42]

TabPedia: towards comprehensive visual table under- standing with concept synergy

Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wei Zhou, et al. TabPedia: towards comprehensive visual table under- standing with concept synergy. InAdvances in Neural Infor- mation Processing Systems, 2024

2024
[43]

Multi-modal in-context learning makes an ego-evolving scene text recognizer

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zeming Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. pages 15756–15766, 2023

2023
[44]

Harmonizing visual text comprehension and generation

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zeming Zhang, Can Huang, and Yuan Xie. Harmonizing visual text comprehension and generation. InAdvances in Neural Information Processing Systems, 2024

2024
[45]

Gait recognition in the wild with multi-hop temporal switch

Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Cheng- gang Yan, and Tao Mei. Gait recognition in the wild with multi-hop temporal switch. pages 3111–3119, 2022

2022
[46]

Parsing is all you need for accurate gait recognition in the wild

Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Cheng- gang Yan, and Tao Mei. Parsing is all you need for accurate gait recognition in the wild. InProceedings of the 31st ACM International Conference on Multimedia, pages 3603–3612, 2023

2023
[47]

Gait recognition in the wild: A large-scale benchmark and NAS-based baseline

Zheng Zhu, Xianda Guo, Tian Yang, Junge Huang, Jiankang Deng, Guan Huang, Dalong Du, Jiwen Lu, and Jie Zhou. Gait recognition in the wild: A large-scale benchmark and NAS-based baseline. pages 14789–14798, 2021. 11

2021

[1] [1]

Realtime multi-person 2D pose estimation using part affin- ity fields

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation using part affin- ity fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7291– 7299, 2017

2017

[2] [2]

GaitSet: cross-view gait recognition through utilizing gait as a deep set.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3467–3478, 2022

Hanqing Chao, Kun Wang, Yiwei He, Jianping Zhang, and Jianfeng Feng. GaitSet: cross-view gait recognition through utilizing gait as a deep set.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3467–3478, 2022

2022

[3] [3]

GaitGCI: generative counterfactual interven- tion for gait recognition

Huanzhang Dou, Pengyi Zhang, Wei Su, Yunlong Yu, Yiying Lin, and Xi Li. GaitGCI: generative counterfactual interven- tion for gait recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5765–5774, 2022

2022

[4] [4]

OpenGait: revisiting gait recognition toward better practicality

Chao Fan, Junhao Liang, Chuanfu Shen, Saihui Hou, Yongzhen Huang, and Shiqi Yu. OpenGait: revisiting gait recognition toward better practicality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9707–9716, 2023

2023

[5] [5]

SkeletonGait: gait recognition using skeleton maps

Chao Fan, Junzhe Ma, Dongyang Jin, Chuanfu Shen, and Shiqi Yu. SkeletonGait: gait recognition using skeleton maps. InProceedings of the 38th AAAI Conference on Arti- ficial Intelligence, pages 1662–1669, 2024

2024

[6] [6]

GaitPart: temporal part-based model for gait recognition

Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He. GaitPart: temporal part-based model for gait recognition. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14213–14221, 2020

2020

[7] [7]

DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile doc- ument understanding.Science China Information Sciences, 2024

Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wei Zhou, Hao Li, and Can Huang. DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile doc- ument understanding.Science China Information Sciences, 2024

2024

[8] [8]

UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wei Zhou, Hao Li, and Can Huang. UniDoc: a universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. InarXiv preprint arXiv:2308.11592, 2023

work page arXiv 2023

[9] [9]

Dolphin: document image parsing via heterogeneous anchor prompting

Hao Feng, Shuai Wei, Xiang Fei, Wenhui Shi, Yi Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, et al. Dolphin: document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

2025

[10] [10]

GPGait: generalized pose-based gait recognition

Yang Fu, Shibei Meng, Saihui Hou, Xuecai Hu, and Yongzhen Huang. GPGait: generalized pose-based gait recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19538–19547, 2023

2023

[11] [11]

Per- son re-identification method based on multi-branch fusion attention mechanism.Computer Engineering and Design, 43(8):2260–2267, 2022

Tong Guo, Qian Zhao, Yan Zhao, and Chenglong Wang. Per- son re-identification method based on multi-branch fusion attention mechanism.Computer Engineering and Design, 43(8):2260–2267, 2022

2022

[12] [12]

Deep residual learning for image recognition.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

2016

[13] [13]

GaitDAN: cross-view gait recognition via adversarial domain adaptation.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8026–8040, 2024

Tianhao Huang, Xianye Ben, Chen Gong, Wenzheng Xu, Qiang Wu, and Huicheng Zhou. GaitDAN: cross-view gait recognition via adversarial domain adaptation.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8026–8040, 2024

2024

[14] [14]

Context- sensitive temporal feature learning for gait recognition

Xiaohu Huang, Duowang Zhu, Hao Wang, Xinggang Wang, Bo Yang, Botao He, Wenyu Liu, and Bin Feng. Context- sensitive temporal feature learning for gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12909–12918, 2021

2021

[15] [15]

Deep learn- ing based two-dimension human pose estimation: a critical analysis.Journal of Image and Graphics, 28(7):1965–1989, 2023

Yinghui Kong, Yinfeng Qin, and Ke Zhang. Deep learn- ing based two-dimension human pose estimation: a critical analysis.Journal of Image and Graphics, 28(7):1965–1989, 2023

1965

[16] [16]

Hierarchical feature fusion attention network for im- age super-resolution reconstruction.Journal of Image and Graphics, 25(9):1773–1786, 2020

Pengcheng Lei, Cong Liu, Jiangang Tang, and Dunlu Peng. Hierarchical feature fusion attention network for im- age super-resolution reconstruction.Journal of Image and Graphics, 25(9):1773–1786, 2020

2020

[17] [17]

Gait recognition via semi-supervised disen- tangled representation learning to identity and covariate fea- tures

Xiang Li, Yasushi Makihara, Chi Xu, Yasushi Yagi, and Mingwu Ren. Gait recognition via semi-supervised disen- tangled representation learning to identity and covariate fea- tures. pages 13309–13319, 2020

2020

[18] [18]

Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing vari- ations

Rijun Liao, Chunshui Cao, Edel B Garcia, Shiqi Yu, and Yongzhen Huang. Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing vari- ations. InProceedings of the 12th Chinese Conference on Biometric Recognition, pages 474–483, 2017

2017

[19] [19]

Gait recognition via effective global-local feature representation and local tem- poral aggregation

Beibei Lin, Shunli Zhang, and Xin Yu. Gait recognition via effective global-local feature representation and local tem- poral aggregation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 14648–14656, 2021

2021

[20] [20]

SPTS v2: single-point scene text spotting.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):15477–15493, 2023

Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chun- hua Shen, Xiang Bai, et al. SPTS v2: single-point scene text spotting.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):15477–15493, 2023

2023

[21] [21]

A bounding box is worth one token: interleav- ing layout and text in a large language model for document understanding

Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: interleav- ing layout and text in a large language model for document understanding. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 7252–7273, 2025

2025

[22] [22]

Gait recognition: a comprehensive survey on methods, datasets and evaluation metrics.IEEE Access, 11:83098–83120, 2023

Jashila Nair Mogan, Chin Poo Lee, and Kian Ming Lim. Gait recognition: a comprehensive survey on methods, datasets and evaluation metrics.IEEE Access, 11:83098–83120, 2023

2023

[23] [23]

Learn- ing rich features for gait recognition by integrating skeletons and silhouettes

Yunjie Peng, Chao Fan, Chuanfu Shen, and Shiqi Yu. Learn- ing rich features for gait recognition by integrating skeletons and silhouettes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4559–4567, 2024

2024

[24] [24]

Deep gait recognition: a survey.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):264–284, 2023

Alireza Sepas-Moghaddam and Ali Etemad. Deep gait recognition: a survey.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):264–284, 2023

2023

[25] [25]

MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

Biluo Shan, Xiang Fei, Wenhui Shi, Aolin Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: multimodal cognition towards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024. 10

work page arXiv 2024

[26] [26]

LidarGait: benchmarking 3D gait recognition with point clouds

Chuanfu Shen, Chao Fan, Wei Wu, Rui Wang, George Q Huang, and Shiqi Yu. LidarGait: benchmarking 3D gait recognition with point clouds. pages 1054–1063, 2023

2023

[27] [27]

GaitNet: an end-to-end network for gait based human identification.Pattern Recognition, 96:106988, 2019

Chunfeng Song, Yongzhen Huang, Wanli Ouyang, and Liang Wang. GaitNet: an end-to-end network for gait based human identification.Pattern Recognition, 96:106988, 2019

2019

[28] [28]

Deep high-resolution representation learning for human pose es- timation

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5686– 5696, 2019

2019

[29] [29]

TextSquare: Scaling up text-centric visual instruction tuning,

Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yuang He, Kaixuan Lu, Hao Feng, Yu Li, et al. TextSquare: scaling up text-centric visual in- struction tuning.arXiv preprint arXiv:2404.12803, 2024

work page arXiv 2024

[30] [30]

MTVQA: benchmarking multilingual text-centric vi- sual question answering

Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Aolin Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. MTVQA: benchmarking multilingual text-centric vi- sual question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7748–7763, 2025

2025

[31] [31]

Optimal boxes: boosting end-to- end scene text recognition by adjusting annotated bound- ing boxes via reinforcement learning

Jingqun Tang, Wenqing Qian, Lei Song, Xiaolong Dong, Lan Li, and Xiang Bai. Optimal boxes: boosting end-to- end scene text recognition by adjusting annotated bound- ing boxes via reinforcement learning. InProceedings of the European Conference on Computer Vision, pages 233–248, 2022

2022

[32] [32]

Few could be better than all: Feature sampling and grouping for scene text detection

Jingqun Tang, Wenqing Zhang, Hao Liu, Min-Kuan Yang, Bo Jiang, Guangliang Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563– 4572, 2022

2022

[33] [33]

Towards a deeper understand- ing of skeleton-based gait recognition

Torben Teepe, Johannes Gilg, Fabian Herzog, Stefan H¨ormann, and Gerhard Rigoll. Towards a deeper understand- ing of skeleton-based gait recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1569–1577, 2022

2022

[34] [34]

GaitGraph: graph con- volutional network for skeleton-based gait recognition

Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Ste- fan H ¨ormann, and Gerhard Rigoll. GaitGraph: graph con- volutional network for skeleton-based gait recognition. In Proceedings of the IEEE International Conference on Image Processing, pages 2314–2318, 2021

2021

[35] [35]

PARGO: bridging vision-language with partial and global views

Aolin Wang, Biluo Shan, Wenhui Shi, Kevin Yi Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, et al. PARGO: bridging vision-language with partial and global views. InProceedings of the 38th AAAI Conference on Arti- ficial Intelligence, 2025

2025

[36] [36]

WildDoc: how far are we from achieving comprehensive and robust document understanding in the wild? 2025

Aolin Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xi- ang Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: how far are we from achieving comprehensive and robust document understanding in the wild? 2025

2025

[37] [37]

Learning discriminative features with multiple gran- ularities for person re-identification

Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple gran- ularities for person re-identification. InProceedings of the 26th ACM International Conference on Multimedia, pages 274–282, 2018

2018

[38] [38]

DyGait: exploiting dynamic representations for high-performance gait recognition

Ming Wang, Xianda Guo, Beibei Lin, Tian Yang, Xin Yu, Shunli Zhang, and Xin Yu. DyGait: exploiting dynamic representations for high-performance gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13424–13433, 2023

2023

[39] [39]

BigGait: learning gait representation you want by large vision models

Dingqiang Ye, Chao Fan, Junzhe Ma, Xiaoming Liu, and Shiqi Yu. BigGait: learning gait representation you want by large vision models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 200–210, 2024

2024

[40] [40]

A Hough transform based method for gait feature extraction.Journal of Image and Graphics, 10(10):1304–1309, 2005

Jing Yu, Juan Duan, and Kaina Su. A Hough transform based method for gait feature extraction.Journal of Image and Graphics, 10(10):1304–1309, 2005

2005

[41] [41]

A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition

Shiqi Yu, Daoliang Tan, and Tieniu Tan. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. InProceedings of the 18th In- ternational Conference on Pattern Recognition, pages 441– 444, 2006

2006

[42] [42]

TabPedia: towards comprehensive visual table under- standing with concept synergy

Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wei Zhou, et al. TabPedia: towards comprehensive visual table under- standing with concept synergy. InAdvances in Neural Infor- mation Processing Systems, 2024

2024

[43] [43]

Multi-modal in-context learning makes an ego-evolving scene text recognizer

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zeming Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. pages 15756–15766, 2023

2023

[44] [44]

Harmonizing visual text comprehension and generation

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zeming Zhang, Can Huang, and Yuan Xie. Harmonizing visual text comprehension and generation. InAdvances in Neural Information Processing Systems, 2024

2024

[45] [45]

Gait recognition in the wild with multi-hop temporal switch

Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Cheng- gang Yan, and Tao Mei. Gait recognition in the wild with multi-hop temporal switch. pages 3111–3119, 2022

2022

[46] [46]

Parsing is all you need for accurate gait recognition in the wild

Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Cheng- gang Yan, and Tao Mei. Parsing is all you need for accurate gait recognition in the wild. InProceedings of the 31st ACM International Conference on Multimedia, pages 3603–3612, 2023

2023

[47] [47]

Gait recognition in the wild: A large-scale benchmark and NAS-based baseline

Zheng Zhu, Xianda Guo, Tian Yang, Junge Huang, Jiankang Deng, Guan Huang, Dalong Du, Jiwen Lu, and Jie Zhou. Gait recognition in the wild: A large-scale benchmark and NAS-based baseline. pages 14789–14798, 2021. 11

2021