pith. sign in

arxiv: 2604.27353 · v1 · submitted 2026-04-30 · 💻 cs.CV

Gait Recognition via Deep Residual Networks and Multi-Branch Feature Fusion

Pith reviewed 2026-05-07 08:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords gait recognitionskeleton-based biometricsmulti-branch feature fusionresidual networkspose estimationCASIA-B benchmarkviewpoint variationclothing invariance
0
0 comments X

The pith

A multi-branch residual network fuses body shape, speed, and joint motion features from skeletal poses to improve gait recognition under clothing and view changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that gait recognition improves when three complementary streams—body proportions, walking velocity, and skeletal joint dynamics—are extracted from pose sequences and then combined by a fusion module that learns how much weight to give each stream. The streams come from HRNet pose estimates fed into a ResNet-50 backbone, and the fusion uses channel-wise attention to blend them. A sympathetic reader would care because gait offers a way to identify people at a distance without their cooperation and despite changes in clothing or carried items, which defeats many other biometrics. The authors test the approach on the CASIA-B dataset that includes normal walking, coat wearing, and bag carrying across multiple camera angles, reporting the highest Rank-1 score among skeleton-only methods in the coat condition.

Core claim

The authors claim that constructing three separate feature branches for body proportion, gait velocity, and skeletal motion, processing them with a 50-layer residual network, and integrating them through a Multi-Branch Feature Fusion module that learns branch contributions via activation parameters produces more discriminative gait representations than prior skeleton-based pipelines, yielding 94.52% Rank-1 accuracy on normal walking sequences and the strongest coat-wearing results on the CASIA-B cross-view benchmark.

What carries the argument

The Multi-Branch Feature Fusion (MFF) module, which applies channel-wise attention to learn and apply dynamic weights that combine the outputs of the body-proportion, gait-velocity, and skeletal-motion branches.

If this is right

  • The three-branch design extracts both static shape cues and dynamic motion cues that remain useful when clothing changes hide surface appearance.
  • The residual backbone supplies hierarchical features that capture fine spatial detail from the initial HRNet keypoint maps even at lower image resolutions.
  • Cross-view performance improves because the fusion step lets the network emphasize whichever branch is least affected by the current camera angle.
  • The reported 94.52% normal-walking accuracy and leading coat-wearing score establish a new reference point for skeleton-only gait systems on CASIA-B.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same branch-and-fuse pattern could be tried on other time-varying skeletal tasks such as action classification or fall detection.
  • If the learned weights prove stable, the method might reduce reliance on hand-crafted gait descriptors in favor of end-to-end learned combinations.
  • Pairing the skeleton pipeline with low-resolution RGB input could test whether the fusion module still adds value when richer appearance cues are already available.
  • Re-training the fusion parameters on a larger, more diverse collection of walking videos would indicate how much the current numbers depend on CASIA-B specifics.

Load-bearing premise

The learned weights inside the fusion module will continue to assign useful contributions to each branch when the walking patterns differ from those seen during training on the CASIA-B collection.

What would settle it

Running the same trained model on a fresh gait dataset recorded with different cameras, subjects, or lighting and finding that its Rank-1 accuracy falls below that of the best competing skeleton-based method on the same new data.

Figures

Figures reproduced from arXiv: 2604.27353 by Cunrong Li, Xiaoyun Wang, Yabo Luo.

Figure 1
Figure 1. Figure 1: Overview of the proposed gait recognition framework. Raw walking videos are first processed by HRNet for view at source ↗
read the original abstract

Gait recognition has emerged as a compelling biometric modality for surveillance and security applications, offering inherent advantages such as non-intrusiveness, resistance to disguise, and long-range identification capability. However, prevailing approaches struggle to comprehensively capture and exploit the rich biometric cues embedded in human locomotion, particularly under covariate interference including viewpoint variation, clothing change, and carrying conditions. In this paper, we present a high-precision gait recognition framework that deeply extracts and synergistically fuses gait dynamics with body shape characteristics through a multi-branch architecture grounded in deep residual learning. Specifically, we first employ the High-Resolution Network (HRNet) to perform robust skeletal keypoint estimation, preserving fine-grained spatial information even under low-resolution inputs. We then construct three complementary feature branches -- body proportion, gait velocity, and skeletal motion -- from the extracted pose sequences. A 50-layer Residual Network (ResNet-50) backbone is leveraged within a deep feature extraction module to capture hierarchically rich and discriminative representations. To effectively integrate heterogeneous feature streams, we design a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention mechanisms, which dynamically allocates contribution weights across branches through learned activation parameters. Extensive experiments on the cross-view multi-condition CASIA-B benchmark demonstrate that our method achieves a Rank-1 accuracy of 94.52\% under normal walking, with the best recognition performance among skeleton-based methods for the coat-wearing condition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a skeleton-based gait recognition framework that first uses HRNet to extract keypoints from video, derives three complementary branches (body proportion, gait velocity, skeletal motion) from the pose sequences, extracts features with a ResNet-50 backbone, and fuses them via a Multi-Branch Feature Fusion (MFF) module inspired by channel-wise attention. The MFF is described as dynamically allocating branch weights through learned activation parameters. On the CASIA-B benchmark the method reports 94.52% Rank-1 accuracy under normal walking and the highest accuracy among skeleton-based approaches for the coat-wearing condition.

Significance. If the MFF module indeed supplies input-dependent, condition-adaptive fusion that measurably improves robustness to clothing and carrying covariates, the work would strengthen the case for multi-branch skeleton representations in gait biometrics. The reported CASIA-B numbers are competitive with recent skeleton methods, but the significance is currently limited by the absence of ablation evidence isolating the fusion contribution and by the lack of verification that the activation parameters are truly dynamic rather than globally learned scalars.

major comments (3)
  1. [MFF module description] MFF module (description in abstract and corresponding architecture section): the claim that the module 'dynamically allocates contribution weights across branches through learned activation parameters' is load-bearing for the central attribution of coat-wearing gains to synergistic fusion. It is unclear whether the activation parameters are computed per-sample from the branch features (as in standard squeeze-excitation or cross-attention) or are a fixed set of scalars learned once during training. If the latter, the allocation is static and the superiority on covariate conditions cannot be credited to the claimed dynamic fusion mechanism.
  2. [Experiments and ablation tables] Experimental section and tables: no ablation results are referenced that compare the full MFF-equipped model against (i) simple concatenation of the three branches, (ii) the ResNet-50 backbone alone, or (iii) the three branches with fixed (non-attention) weighting. Without these controls it is impossible to determine whether the reported 94.52% normal-walking Rank-1 and leading coat-wearing result are driven by the fusion module or by the branch design and backbone. In addition, the manuscript provides no error bars, multiple-run statistics, or significance tests for the cross-condition numbers.
  3. [Results tables] Table reporting coat-wearing results: the abstract asserts 'best recognition performance among skeleton-based methods' for the coat condition, yet the manuscript does not supply the full comparative table with exact numbers and standard deviations from the competing skeleton methods. This makes the 'best' claim unverifiable from the given information and weakens the cross-condition robustness argument.
minor comments (2)
  1. [Branch construction] The precise mathematical definitions of the three input branches (body proportion, gait velocity, skeletal motion) are only sketched; explicit equations showing how each is derived from the HRNet keypoint sequences would improve reproducibility.
  2. [Implementation details] The manuscript should state the training protocol (optimizer, learning-rate schedule, data augmentation, batch size) and whether any condition-specific hyper-parameter search was performed on the CASIA-B validation split.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We sincerely thank the referee for the valuable comments and suggestions. Below, we provide a point-by-point response to the major comments. We have revised the manuscript to incorporate the necessary changes and clarifications.

read point-by-point responses
  1. Referee: [MFF module description] MFF module (description in abstract and corresponding architecture section): the claim that the module 'dynamically allocates contribution weights across branches through learned activation parameters' is load-bearing for the central attribution of coat-wearing gains to synergistic fusion. It is unclear whether the activation parameters are computed per-sample from the branch features (as in standard squeeze-excitation or cross-attention) or are a fixed set of scalars learned once during training. If the latter, the allocation is static and the superiority on covariate conditions cannot be credited to the claimed dynamic fusion mechanism.

    Authors: We thank the referee for this important observation regarding the MFF module. The design of the MFF is intended to provide dynamic, input-dependent weighting, as the activation parameters are generated from the input branch features rather than being fixed scalars. To resolve any ambiguity, we have updated the manuscript's architecture description in Section 3 to include explicit details on how the weights are computed dynamically for each sample, along with the relevant equations. This revision ensures that the dynamic nature of the fusion is clearly documented and supports the attribution of performance gains to the fusion mechanism. revision: yes

  2. Referee: [Experiments and ablation tables] Experimental section and tables: no ablation results are referenced that compare the full MFF-equipped model against (i) simple concatenation of the three branches, (ii) the ResNet-50 backbone alone, or (iii) the three branches with fixed (non-attention) weighting. Without these controls it is impossible to determine whether the reported 94.52% normal-walking Rank-1 and leading coat-wearing result are driven by the fusion module or by the branch design and backbone. In addition, the manuscript provides no error bars, multiple-run statistics, or significance tests for the cross-condition numbers.

    Authors: We agree with the referee that additional ablation studies would strengthen the experimental validation. In the revised manuscript, we will include a new table presenting results for the full model versus (i) simple concatenation of branches, (ii) ResNet-50 on a single combined feature stream, and (iii) fixed weighting. Furthermore, we will report mean and standard deviation from multiple runs to provide statistical context for the results. revision: yes

  3. Referee: [Results tables] Table reporting coat-wearing results: the abstract asserts 'best recognition performance among skeleton-based methods' for the coat condition, yet the manuscript does not supply the full comparative table with exact numbers and standard deviations from the competing skeleton methods. This makes the 'best' claim unverifiable from the given information and weakens the cross-condition robustness argument.

    Authors: We agree that providing the full comparative data would make the claim more verifiable. We have added a comprehensive table in the revised version that includes the exact Rank-1 accuracies and, where available, standard deviations from all referenced skeleton-based methods on the coat-wearing condition. This allows readers to confirm our method's leading performance. revision: yes

Circularity Check

0 steps flagged

Empirical neural architecture with benchmark evaluation exhibits no circularity

full rationale

The paper describes a standard deep learning pipeline: HRNet pose estimation followed by three hand-crafted feature branches, ResNet-50 extraction, and an MFF fusion module whose weights are learned during training. All performance claims are Rank-1 accuracies measured on held-out splits of the public CASIA-B benchmark. No mathematical derivation, uniqueness theorem, or first-principles prediction is offered; therefore no step can reduce to its own inputs by construction. The MFF description (“dynamically allocates … through learned activation parameters”) is an architectural claim whose correctness is tested empirically rather than assumed. No self-citation load-bearing steps appear in the abstract or described method.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised deep-learning assumptions (i.i.d. train/test splits, cross-entropy loss, SGD optimization) plus the unstated premise that skeletal keypoints extracted by HRNet remain reliable under the covariate conditions tested. No new physical or mathematical axioms are introduced.

free parameters (2)
  • ResNet-50 and fusion-module weights
    All network parameters are learned from CASIA-B training data; their values are not derived from first principles.
  • MFF attention scaling parameters
    Learned activation parameters that control branch weighting are fitted during training.
axioms (2)
  • domain assumption HRNet produces sufficiently accurate keypoints on low-resolution gait videos
    Invoked when the paper states that HRNet preserves fine-grained spatial information even under low-resolution inputs.
  • domain assumption The three chosen feature streams (body proportion, gait velocity, skeletal motion) are complementary and sufficient
    The multi-branch design presupposes that these three streams together capture the necessary biometric cues.

pith-pipeline@v0.9.0 · 5553 in / 1524 out tokens · 51908 ms · 2026-05-07T08:27:18.436425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 3 canonical work pages

  1. [1]

    Realtime multi-person 2D pose estimation using part affin- ity fields

    Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2D pose estimation using part affin- ity fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7291– 7299, 2017

  2. [2]

    GaitSet: cross-view gait recognition through utilizing gait as a deep set.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3467–3478, 2022

    Hanqing Chao, Kun Wang, Yiwei He, Jianping Zhang, and Jianfeng Feng. GaitSet: cross-view gait recognition through utilizing gait as a deep set.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3467–3478, 2022

  3. [3]

    GaitGCI: generative counterfactual interven- tion for gait recognition

    Huanzhang Dou, Pengyi Zhang, Wei Su, Yunlong Yu, Yiying Lin, and Xi Li. GaitGCI: generative counterfactual interven- tion for gait recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5765–5774, 2022

  4. [4]

    OpenGait: revisiting gait recognition toward better practicality

    Chao Fan, Junhao Liang, Chuanfu Shen, Saihui Hou, Yongzhen Huang, and Shiqi Yu. OpenGait: revisiting gait recognition toward better practicality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9707–9716, 2023

  5. [5]

    SkeletonGait: gait recognition using skeleton maps

    Chao Fan, Junzhe Ma, Dongyang Jin, Chuanfu Shen, and Shiqi Yu. SkeletonGait: gait recognition using skeleton maps. InProceedings of the 38th AAAI Conference on Arti- ficial Intelligence, pages 1662–1669, 2024

  6. [6]

    GaitPart: temporal part-based model for gait recognition

    Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He. GaitPart: temporal part-based model for gait recognition. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14213–14221, 2020

  7. [7]

    DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile doc- ument understanding.Science China Information Sciences, 2024

    Hao Feng, Qi Liu, Hao Liu, Jingqun Tang, Wei Zhou, Hao Li, and Can Huang. DocPedia: unleashing the power of large multimodal model in the frequency domain for versatile doc- ument understanding.Science China Information Sciences, 2024

  8. [8]

    UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding,

    Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wei Zhou, Hao Li, and Can Huang. UniDoc: a universal large multimodal model for simultaneous text detection, recognition, spotting and understanding. InarXiv preprint arXiv:2308.11592, 2023

  9. [9]

    Dolphin: document image parsing via heterogeneous anchor prompting

    Hao Feng, Shuai Wei, Xiang Fei, Wenhui Shi, Yi Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, et al. Dolphin: document image parsing via heterogeneous anchor prompting. InFindings of the As- sociation for Computational Linguistics: ACL 2025, pages 21919–21936, 2025

  10. [10]

    GPGait: generalized pose-based gait recognition

    Yang Fu, Shibei Meng, Saihui Hou, Xuecai Hu, and Yongzhen Huang. GPGait: generalized pose-based gait recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19538–19547, 2023

  11. [11]

    Per- son re-identification method based on multi-branch fusion attention mechanism.Computer Engineering and Design, 43(8):2260–2267, 2022

    Tong Guo, Qian Zhao, Yan Zhao, and Chenglong Wang. Per- son re-identification method based on multi-branch fusion attention mechanism.Computer Engineering and Design, 43(8):2260–2267, 2022

  12. [12]

    Deep residual learning for image recognition.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016

  13. [13]

    GaitDAN: cross-view gait recognition via adversarial domain adaptation.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8026–8040, 2024

    Tianhao Huang, Xianye Ben, Chen Gong, Wenzheng Xu, Qiang Wu, and Huicheng Zhou. GaitDAN: cross-view gait recognition via adversarial domain adaptation.IEEE Transactions on Circuits and Systems for Video Technology, 34(9):8026–8040, 2024

  14. [14]

    Context- sensitive temporal feature learning for gait recognition

    Xiaohu Huang, Duowang Zhu, Hao Wang, Xinggang Wang, Bo Yang, Botao He, Wenyu Liu, and Bin Feng. Context- sensitive temporal feature learning for gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12909–12918, 2021

  15. [15]

    Deep learn- ing based two-dimension human pose estimation: a critical analysis.Journal of Image and Graphics, 28(7):1965–1989, 2023

    Yinghui Kong, Yinfeng Qin, and Ke Zhang. Deep learn- ing based two-dimension human pose estimation: a critical analysis.Journal of Image and Graphics, 28(7):1965–1989, 2023

  16. [16]

    Hierarchical feature fusion attention network for im- age super-resolution reconstruction.Journal of Image and Graphics, 25(9):1773–1786, 2020

    Pengcheng Lei, Cong Liu, Jiangang Tang, and Dunlu Peng. Hierarchical feature fusion attention network for im- age super-resolution reconstruction.Journal of Image and Graphics, 25(9):1773–1786, 2020

  17. [17]

    Gait recognition via semi-supervised disen- tangled representation learning to identity and covariate fea- tures

    Xiang Li, Yasushi Makihara, Chi Xu, Yasushi Yagi, and Mingwu Ren. Gait recognition via semi-supervised disen- tangled representation learning to identity and covariate fea- tures. pages 13309–13319, 2020

  18. [18]

    Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing vari- ations

    Rijun Liao, Chunshui Cao, Edel B Garcia, Shiqi Yu, and Yongzhen Huang. Pose-based temporal-spatial network (PTSN) for gait recognition with carrying and clothing vari- ations. InProceedings of the 12th Chinese Conference on Biometric Recognition, pages 474–483, 2017

  19. [19]

    Gait recognition via effective global-local feature representation and local tem- poral aggregation

    Beibei Lin, Shunli Zhang, and Xin Yu. Gait recognition via effective global-local feature representation and local tem- poral aggregation. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 14648–14656, 2021

  20. [20]

    SPTS v2: single-point scene text spotting.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):15477–15493, 2023

    Yuliang Liu, Jiaxin Zhang, Dezhi Peng, Mingxin Huang, Xinyu Wang, Jingqun Tang, Can Huang, Dahua Lin, Chun- hua Shen, Xiang Bai, et al. SPTS v2: single-point scene text spotting.IEEE Transactions on Pattern Analysis and Ma- chine Intelligence, 45(12):15477–15493, 2023

  21. [21]

    A bounding box is worth one token: interleav- ing layout and text in a large language model for document understanding

    Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token: interleav- ing layout and text in a large language model for document understanding. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 7252–7273, 2025

  22. [22]

    Gait recognition: a comprehensive survey on methods, datasets and evaluation metrics.IEEE Access, 11:83098–83120, 2023

    Jashila Nair Mogan, Chin Poo Lee, and Kian Ming Lim. Gait recognition: a comprehensive survey on methods, datasets and evaluation metrics.IEEE Access, 11:83098–83120, 2023

  23. [23]

    Learn- ing rich features for gait recognition by integrating skeletons and silhouettes

    Yunjie Peng, Chao Fan, Chuanfu Shen, and Shiqi Yu. Learn- ing rich features for gait recognition by integrating skeletons and silhouettes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4559–4567, 2024

  24. [24]

    Deep gait recognition: a survey.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):264–284, 2023

    Alireza Sepas-Moghaddam and Ali Etemad. Deep gait recognition: a survey.IEEE Transactions on Pattern Analy- sis and Machine Intelligence, 45(1):264–284, 2023

  25. [25]

    MCTBench: Multi- modal cognition towards text-rich visual scenes bench- mark,

    Biluo Shan, Xiang Fei, Wenhui Shi, Aolin Wang, Guozhi Tang, Lei Liao, Jingqun Tang, Xiang Bai, and Can Huang. MCTBench: multimodal cognition towards text-rich visual scenes benchmark.arXiv preprint arXiv:2410.11538, 2024. 10

  26. [26]

    LidarGait: benchmarking 3D gait recognition with point clouds

    Chuanfu Shen, Chao Fan, Wei Wu, Rui Wang, George Q Huang, and Shiqi Yu. LidarGait: benchmarking 3D gait recognition with point clouds. pages 1054–1063, 2023

  27. [27]

    GaitNet: an end-to-end network for gait based human identification.Pattern Recognition, 96:106988, 2019

    Chunfeng Song, Yongzhen Huang, Wanli Ouyang, and Liang Wang. GaitNet: an end-to-end network for gait based human identification.Pattern Recognition, 96:106988, 2019

  28. [28]

    Deep high-resolution representation learning for human pose es- timation

    Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose es- timation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5686– 5696, 2019

  29. [29]

    TextSquare: Scaling up text-centric visual instruction tuning,

    Jingqun Tang, Chunhui Lin, Zhen Zhao, Shuai Wei, Binghong Wu, Qi Liu, Yuang He, Kaixuan Lu, Hao Feng, Yu Li, et al. TextSquare: scaling up text-centric visual in- struction tuning.arXiv preprint arXiv:2404.12803, 2024

  30. [30]

    MTVQA: benchmarking multilingual text-centric vi- sual question answering

    Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shuai Wei, Aolin Wang, Chunhui Lin, Hao Feng, Zhen Zhao, et al. MTVQA: benchmarking multilingual text-centric vi- sual question answering. InFindings of the Association for Computational Linguistics: ACL 2025, pages 7748–7763, 2025

  31. [31]

    Optimal boxes: boosting end-to- end scene text recognition by adjusting annotated bound- ing boxes via reinforcement learning

    Jingqun Tang, Wenqing Qian, Lei Song, Xiaolong Dong, Lan Li, and Xiang Bai. Optimal boxes: boosting end-to- end scene text recognition by adjusting annotated bound- ing boxes via reinforcement learning. InProceedings of the European Conference on Computer Vision, pages 233–248, 2022

  32. [32]

    Few could be better than all: Feature sampling and grouping for scene text detection

    Jingqun Tang, Wenqing Zhang, Hao Liu, Min-Kuan Yang, Bo Jiang, Guangliang Hu, and Xiang Bai. Few could be better than all: Feature sampling and grouping for scene text detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4563– 4572, 2022

  33. [33]

    Towards a deeper understand- ing of skeleton-based gait recognition

    Torben Teepe, Johannes Gilg, Fabian Herzog, Stefan H¨ormann, and Gerhard Rigoll. Towards a deeper understand- ing of skeleton-based gait recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 1569–1577, 2022

  34. [34]

    GaitGraph: graph con- volutional network for skeleton-based gait recognition

    Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Ste- fan H ¨ormann, and Gerhard Rigoll. GaitGraph: graph con- volutional network for skeleton-based gait recognition. In Proceedings of the IEEE International Conference on Image Processing, pages 2314–2318, 2021

  35. [35]

    PARGO: bridging vision-language with partial and global views

    Aolin Wang, Biluo Shan, Wenhui Shi, Kevin Yi Lin, Xiang Fei, Guozhi Tang, Lei Liao, Jingqun Tang, Can Huang, et al. PARGO: bridging vision-language with partial and global views. InProceedings of the 38th AAAI Conference on Arti- ficial Intelligence, 2025

  36. [36]

    WildDoc: how far are we from achieving comprehensive and robust document understanding in the wild? 2025

    Aolin Wang, Jingqun Tang, Lei Liao, Hao Feng, Qi Liu, Xi- ang Fei, Jinghui Lu, Han Wang, Hao Liu, Yuliang Liu, et al. WildDoc: how far are we from achieving comprehensive and robust document understanding in the wild? 2025

  37. [37]

    Learning discriminative features with multiple gran- ularities for person re-identification

    Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple gran- ularities for person re-identification. InProceedings of the 26th ACM International Conference on Multimedia, pages 274–282, 2018

  38. [38]

    DyGait: exploiting dynamic representations for high-performance gait recognition

    Ming Wang, Xianda Guo, Beibei Lin, Tian Yang, Xin Yu, Shunli Zhang, and Xin Yu. DyGait: exploiting dynamic representations for high-performance gait recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13424–13433, 2023

  39. [39]

    BigGait: learning gait representation you want by large vision models

    Dingqiang Ye, Chao Fan, Junzhe Ma, Xiaoming Liu, and Shiqi Yu. BigGait: learning gait representation you want by large vision models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 200–210, 2024

  40. [40]

    A Hough transform based method for gait feature extraction.Journal of Image and Graphics, 10(10):1304–1309, 2005

    Jing Yu, Juan Duan, and Kaina Su. A Hough transform based method for gait feature extraction.Journal of Image and Graphics, 10(10):1304–1309, 2005

  41. [41]

    A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition

    Shiqi Yu, Daoliang Tan, and Tieniu Tan. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. InProceedings of the 18th In- ternational Conference on Pattern Recognition, pages 441– 444, 2006

  42. [42]

    TabPedia: towards comprehensive visual table under- standing with concept synergy

    Weichao Zhao, Hao Feng, Qi Liu, Jingqun Tang, Shuai Wei, Binghong Wu, Lei Liao, Yongjie Ye, Hao Liu, Wei Zhou, et al. TabPedia: towards comprehensive visual table under- standing with concept synergy. InAdvances in Neural Infor- mation Processing Systems, 2024

  43. [43]

    Multi-modal in-context learning makes an ego-evolving scene text recognizer

    Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Hao Liu, Zeming Zhang, Xin Tan, Can Huang, and Yuan Xie. Multi-modal in-context learning makes an ego-evolving scene text recognizer. pages 15756–15766, 2023

  44. [44]

    Harmonizing visual text comprehension and generation

    Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shuai Wei, Hao Liu, Xin Tan, Zeming Zhang, Can Huang, and Yuan Xie. Harmonizing visual text comprehension and generation. InAdvances in Neural Information Processing Systems, 2024

  45. [45]

    Gait recognition in the wild with multi-hop temporal switch

    Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Cheng- gang Yan, and Tao Mei. Gait recognition in the wild with multi-hop temporal switch. pages 3111–3119, 2022

  46. [46]

    Parsing is all you need for accurate gait recognition in the wild

    Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Cheng- gang Yan, and Tao Mei. Parsing is all you need for accurate gait recognition in the wild. InProceedings of the 31st ACM International Conference on Multimedia, pages 3603–3612, 2023

  47. [47]

    Gait recognition in the wild: A large-scale benchmark and NAS-based baseline

    Zheng Zhu, Xianda Guo, Tian Yang, Junge Huang, Jiankang Deng, Guan Huang, Dalong Du, Jiwen Lu, and Jie Zhou. Gait recognition in the wild: A large-scale benchmark and NAS-based baseline. pages 14789–14798, 2021. 11