pith. sign in

arxiv: 2605.18192 · v1 · pith:3UYDIHQWnew · submitted 2026-05-18 · 💻 cs.CV

View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification

Pith reviewed 2026-05-20 11:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords aerial-ground person re-identificationview-aware semantic alignmentexpert-driven token generationdual-branch local fusioncross-view consistencyviewpoint variationsgraph reasoningperson re-identification
0
0 comments X

The pith

View-aware semantic alignment using expert queries and graph fusion improves cross-view person re-identification over view-invariant methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard view-invariant alignment in aerial-ground person re-identification discards useful viewpoint-specific details that help distinguish identities. It introduces ViSA to maintain cross-view semantic consistency instead, by generating adaptive queries from view-aware experts and fusing local regions through graph reasoning. This design is meant to keep more discriminative identity information intact despite large viewpoint shifts between drones and fixed cameras. Experiments on three benchmarks show consistent gains, including a 10.06 percent mAP rise on the hard CARGO cross-view setting. The result matters for applications like surveillance where the same person must be matched across aerial and ground footage.

Core claim

The central claim is that a view-aware framework achieves better cross-view semantic consistency than view-invariant paradigms by constructing view-aware experts to generate adaptive semantic queries that capture viewpoint-specific patterns, then using graph reasoning in a dual-branch module to extract and align responsive local regions, thereby retaining more discriminative identity information.

What carries the argument

ViSA framework with an Expert-driven Token Generation Module that builds view-aware experts for adaptive semantic queries and a Dual-branch Local Fusion Module that applies graph reasoning to align local regions.

If this is right

  • Superior matching accuracy on AG-ReID.v2, CARGO, and LAGPeR benchmarks for aerial-ground person pairs.
  • A 10.06 percent mAP gain on the challenging CARGO cross-view evaluation protocol.
  • Improved extraction of local identity cues that respond differently to drone versus ground viewpoints.
  • Cross-view semantic consistency that still respects viewpoint-specific patterns rather than discarding them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same expert-query and graph-fusion pattern could be tested on vehicle re-identification across satellite and street views.
  • If the modules generalize, they might reduce the need for massive viewpoint-augmented training sets in other multi-camera settings.
  • Scaling the number of view-aware experts could be checked for diminishing returns when more than two camera platforms are involved.

Load-bearing premise

Preserving view-specific cues through adaptive expert queries and graph-based local alignment will retain more discriminative identity information than enforcing view-invariant alignment, without creating new inconsistencies in identity matching.

What would settle it

If a carefully tuned view-invariant baseline matches or exceeds ViSA's mAP on the CARGO cross-view protocol while using similar compute, the advantage of retaining view-specific cues would be called into question.

Figures

Figures reproduced from arXiv: 2605.18192 by Cailun Wu, Hongbo Chen, Jianhuang Lai, Jingze Wu, Peiming Zhao, Quan Zhang, Zeqiang Cai.

Figure 1
Figure 1. Figure 1: Comparison of view-invariant and view-aware learning. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed ViSA. The transformer backbone first separates intrinsic identity semantics from extrinsic viewpoint [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Modules detail of the Token Expert Module and Graph Learning Module. The Token Expert Module generates view-aware expert [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of hyper-parameters under ‘ALL’ protocol on the CARGO dataset. Rank-1 and mAP and their average are reported (%). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of several retrieval visualizations on AG [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of several retrieval visualizations on [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE visualization. Left: Baseline. Right: ViSA. Colors indicate different classes and shapes denote different views. 1 accuracy, demonstrating its strong generalization ability under larger cross-view variations. 8. Parameter Analysis We further analyze the impact of hyperparameters under different evaluation protocols. In the study of the number of experts E, we fix the number of selected expert to 1 (s… view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of E, k, and λ across different protocols on the CARGO dataset. Rank-1 and mAP and their average are reported (%). E represents the number of experts. k represents number of selected experts. λ represents the balancing coefficient [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Aerial-Ground Person Re-Identification (AGPReID) remains highly challenging due to drastic viewpoint variations between drones and fixed cameras. Existing methods typically follow a view-invariant paradigm, aligning shared features across views to achieve robustness. However, view-invariant inherently enforces part-level alignment, which ignores view-specific cues and discriminative identity information. To this end, this work proposes ViSA (View-aware Semantic Alignment), a view-aware framework that achieves cross-view semantic consistency containing an Expert-driven Token Generation Module (ETGM) and a Dual-branch Local Fusion Module (DLFM). Technically, the former constructs a set of view-aware experts to generate adaptive semantic queries that perceive viewpoint-specific patterns, while the latter leverages graph reasoning to extract and align local regions responsive to different experts. Extensive experiments on three AGPReID benchmarks including AG-ReID.v2, CARGO and LAGPeR demonstrate that ViSA consistently achieves superior performance, with a notable 10.06\% mAP improvement on the challenging CARGO cross-view protocol. The code is available at \href{https://github.com/Cat-Zero/ViSA}{https://github.com/Cat-Zero/ViSA}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ViSA, a view-aware semantic alignment framework for Aerial-Ground Person Re-Identification (AGPReID). It consists of an Expert-driven Token Generation Module (ETGM) that uses view-aware experts to generate adaptive semantic queries for viewpoint-specific patterns, and a Dual-branch Local Fusion Module (DLFM) that employs graph reasoning to extract and align local regions. The method is evaluated on three benchmarks (AG-ReID.v2, CARGO, LAGPeR), reporting superior performance including a 10.06% mAP improvement on the CARGO cross-view protocol.

Significance. If the empirical results hold, this work challenges the conventional view-invariant paradigm in cross-view ReID by demonstrating that preserving view-specific cues can lead to better identity discriminability. The code release supports reproducibility. This could influence future designs in handling drastic viewpoint variations in person re-identification tasks.

major comments (2)
  1. [§3.1 (ETGM)] §3.1 (ETGM): The construction of view-aware experts and adaptive queries does not include an explicit mechanism to enforce that the generated tokens correspond to consistent identity features across aerial and ground views. This is load-bearing for the central claim, as the performance gains might arise from modeling view-specific variations rather than achieving true semantic alignment.
  2. [§3.2 (DLFM)] §3.2 (DLFM): The graph-based local fusion aligns regions responsive to different experts, but lacks a cross-view identity consistency constraint (such as a matching loss between corresponding local features of the same identity). Without this, the alignment may connect locally salient but identity-inconsistent patterns, undermining the claim of improved semantic consistency over view-invariant methods.
minor comments (2)
  1. [Abstract] The abstract mentions empirical improvements but does not specify the baselines used or report statistical significance of the results.
  2. [Experiments] Details on ablation studies for the number of view-aware experts and their impact on performance should be expanded for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below with clarifications on the implicit mechanisms in ViSA and note planned revisions to strengthen the presentation of identity consistency.

read point-by-point responses
  1. Referee: [§3.1 (ETGM)] §3.1 (ETGM): The construction of view-aware experts and adaptive queries does not include an explicit mechanism to enforce that the generated tokens correspond to consistent identity features across aerial and ground views. This is load-bearing for the central claim, as the performance gains might arise from modeling view-specific variations rather than achieving true semantic alignment.

    Authors: We appreciate the referee highlighting this aspect of the ETGM. The view-aware experts share parameters across aerial and ground inputs and generate adaptive queries conditioned on the input features; end-to-end training with identity classification and triplet losses on cross-view pairs encourages the resulting tokens to emphasize identity-discriminative patterns that remain consistent despite viewpoint differences. The reported gains on cross-view protocols (e.g., 10.06% mAP on CARGO) are consistent with semantic alignment rather than isolated view-specific modeling. To make the consistency argument more explicit, we will add a clarifying paragraph in §3.1 and an ablation measuring token similarity for matched identities across views. revision: partial

  2. Referee: [§3.2 (DLFM)] §3.2 (DLFM): The graph-based local fusion aligns regions responsive to different experts, but lacks a cross-view identity consistency constraint (such as a matching loss between corresponding local features of the same identity). Without this, the alignment may connect locally salient but identity-inconsistent patterns, undermining the claim of improved semantic consistency over view-invariant methods.

    Authors: We thank the referee for this observation on the DLFM. Graph edges are formed between local patches from the two branches that exhibit high feature similarity under the same expert; the global identity and metric losses then back-propagate consistency signals to these locally aligned regions. While an explicit local matching loss is not present, the current design ties local alignment directly to identity supervision. We agree an additional local consistency term could further reinforce the claim and will incorporate either a lightweight local matching loss or a quantitative analysis of local feature correspondence in the revised manuscript and supplementary material. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance on external benchmarks

full rationale

The paper introduces ViSA with two new modules (ETGM for view-aware expert queries and DLFM for graph-based local fusion) and reports mAP gains on public AGPReID datasets (AG-ReID.v2, CARGO, LAGPeR). These gains are measured against external test protocols rather than being algebraically forced by any fitted parameter or self-referential definition inside the method. No derivation chain reduces a claimed result to an input by construction, and no load-bearing step relies on a self-citation that itself lacks independent verification. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that view-specific cues carry useful identity signal and on the modeling choice to implement that signal via expert tokens and graph reasoning; no explicit free parameters or invented physical entities are named in the abstract.

free parameters (1)
  • number of view-aware experts
    Hyperparameter controlling how many adaptive semantic queries are generated; value not stated in abstract.
axioms (1)
  • domain assumption View-specific cues contain discriminative identity information that should be preserved rather than discarded by invariance
    Explicitly stated as motivation for moving away from view-invariant paradigm.
invented entities (1)
  • View-aware experts no independent evidence
    purpose: Generate adaptive semantic queries that perceive viewpoint-specific patterns
    New component introduced in the ETGM module

pith-pipeline@v0.9.0 · 5755 in / 1229 out tokens · 34029 ms · 2026-05-20T11:11:46.236364+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages

  1. [1]

    Separable spatial-temporal residual graph for cloth-changing group re-identification.IEEE TPAMI, 46(8): 5791–5805, 2024

    Quan Zhang, Jianhuang Lai, Xiaohua Xie, Xiaofeng Jin, and Sien Huang. Separable spatial-temporal residual graph for cloth-changing group re-identification.IEEE TPAMI, 46(8): 5791–5805, 2024. 1

  2. [2]

    Idea: Inverted text with cooperative deformable aggrega- tion for multi-modal object re-identification

    Yuhao Wang, Yongfeng Lv, Pingping Zhang, and Huchuan Lu. Idea: Inverted text with cooperative deformable aggrega- tion for multi-modal object re-identification. InCVPR, pages 29701–29710, 2025

  3. [3]

    Uncertainty modeling for group re-identification.IJCV, 132(8):3046–3066, 2024

    Quan Zhang, Jianhuang Lai, Zhanxiang Feng, and Xiaohua Xie. Uncertainty modeling for group re-identification.IJCV, 132(8):3046–3066, 2024

  4. [4]

    Video-based person re-identification with long short-term representation learning

    Xuehu Liu, Pingping Zhang, and Huchuan Lu. Video-based person re-identification with long short-term representation learning. InICIG, pages 55–67, 2023

  5. [5]

    Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification

    Quan Zhang, Jianhuang Lai, Zhanxiang Feng, and Xiao- hua Xie. Seeing like a human: Asynchronous learning with dynamic progressive refinement for person re-identification. IEEE TIP, 31:352–365, 2021

  6. [6]

    Learning pro- gressive modality-shared transformers for effective visible- infrared person re-identification

    Hu Lu, Xuezhang Zou, and Pingping Zhang. Learning pro- gressive modality-shared transformers for effective visible- infrared person re-identification. InAAAI, pages 1835–1843, 2023

  7. [7]

    Modeling 3d layout for group re- identification

    Quan Zhang, Kaiheng Dang, Jian-Huang Lai, Zhanxiang Feng, and Xiaohua Xie. Modeling 3d layout for group re- identification. InCVPR, pages 7512–7520, 2022

  8. [8]

    Multi- memory matching for unsupervised visible-infrared person re-identification

    Jiangming Shi, Xiangbo Yin, Yeyun Chen, Yachao Zhang, Zhizhong Zhang, Yuan Xie, and Yanyun Qu. Multi- memory matching for unsupervised visible-infrared person re-identification. InECCV, pages 456–474. Springer, 2024

  9. [9]

    Part representation learning with teacher-student decoder for occluded person re-identification

    Shang Gao, Chenyang Yu, Pingping Zhang, and Huchuan Lu. Part representation learning with teacher-student decoder for occluded person re-identification. InICASSP, pages 2660–2664, 2024

  10. [10]

    A feature enhancement loss for person re-identification

    Yao Peng, Yining Lin, Huajian Ni, Hua Gao, and Chenchen Hu. A feature enhancement loss for person re-identification. Systems Science & Control Engineering, 11(1):2220482, 2023

  11. [11]

    Learning modal-invariant angular metric by cyclic projection network for vis-nir person re-identification.IEEE TIP, 30:8019– 8033, 2021

    Quan Zhang, Jianhuang Lai, and Xiaohua Xie. Learning modal-invariant angular metric by cyclic projection network for vis-nir person re-identification.IEEE TIP, 30:8019– 8033, 2021

  12. [12]

    Dp-gan: A novel generative adversarial network-based drone pilot identifica- tion scheme.IEEE Sensors Journal, 23(24):31537–31548, 2023

    Liyao Han, Jiajia Liu, and Yanning Zhang. Dp-gan: A novel generative adversarial network-based drone pilot identifica- tion scheme.IEEE Sensors Journal, 23(24):31537–31548, 2023

  13. [13]

    Hat: Hierarchical aggregation transformers for person re-identification

    Guowen Zhang, Pingping Zhang, Jinqing Qi, and Huchuan Lu. Hat: Hierarchical aggregation transformers for person re-identification. InACM MM, pages 516–525, 2021

  14. [14]

    Toward re-identifying any animal

    Bingliang Jiao, Lingqiao Liu, Liying Gao, Ruiqi Wu, Gu- osheng Lin, PENG W ANG, and Yanning Zhang. Toward re-identifying any animal. InNeurIPS, pages 40042–40053. Curran Associates, Inc., 2023. 1

  15. [15]

    Aerial-ground person re-id

    Huy Nguyen, Khanh Nguyen, Sridha Sridharan, and Clinton Fookes. Aerial-ground person re-id. InICME, pages 2585– 2590, 2023. 1, 2

  16. [16]

    Person re- identification in aerial imagery.IEEE TMM, 23:281–291, 2020

    Shizhou Zhang, Qi Zhang, Yifei Yang, Xing Wei, Peng Wang, Bingliang Jiao, and Yanning Zhang. Person re- identification in aerial imagery.IEEE TMM, 23:281–291, 2020

  17. [17]

    Ag-reid.v2: Bridging aerial and ground views for person re-identification.IEEE Transactions on Information Forensics and Security, 19:2896–2908, 2024

    Huy Nguyen, Kien Nguyen, Sridha Sridharan, and Clinton Fookes. Ag-reid.v2: Bridging aerial and ground views for person re-identification.IEEE Transactions on Information Forensics and Security, 19:2896–2908, 2024. 2, 6, 7

  18. [18]

    Ground- to-aerial person search: Benchmark dataset and approach

    Shizhou Zhang, Qingchun Yang, De Cheng, Yinghui Xing, Guoqiang Liang, Peng Wang, and Yanning Zhang. Ground- to-aerial person search: Benchmark dataset and approach. In ACM MM, pages 789–799, 2023. 1

  19. [19]

    View-decoupled transformer for person re- identification under aerial-ground camera network

    Quan Zhang, Lei Wang, Vishal M Patel, Xiaohua Xie, and Jianhaung Lai. View-decoupled transformer for person re- identification under aerial-ground camera network. InCVPR, pages 22000–22009, 2024. 1, 2, 4, 6, 7

  20. [20]

    Dissecting person re- identification from the viewpoint of viewpoint

    Xiaoxiao Sun and Liang Zheng. Dissecting person re- identification from the viewpoint of viewpoint. InCVPR, pages 608–617, 2019

  21. [21]

    Secap: Self- calibrating and adaptive prompts for cross-view person re- identification in aerial-ground networks

    Shiqi Wang, Yifan Wang, Rui Wu, et al. Secap: Self- calibrating and adaptive prompts for cross-view person re- identification in aerial-ground networks. InCVPR, 2025. 2, 6, 7

  22. [22]

    Bridging the sky and ground: To- wards view-invariant feature learning for aerial-ground per- son re-identification

    Wajahat Khalid, Bin Liu, Xulin Li, Muhammad Waqas, and Muhammad Sher Afgan. Bridging the sky and ground: To- wards view-invariant feature learning for aerial-ground per- son re-identification. InICCV, pages 9749–9758, 2025. 1, 2

  23. [23]

    Latex: Leveraging attribute-based text knowledge for aerial- ground person re-identification

    Pingping Zhang, Xiang Hu, Yuhao Wang, and Huchuan Lu. Latex: Leveraging attribute-based text knowledge for aerial- ground person re-identification. 2025. 2

  24. [24]

    Dynamic token selection for aerial-ground person re-identification

    Yuhai Wang and Maryam Pishgar. Dynamic token selection for aerial-ground person re-identification. InICME, 2025. 2, 6

  25. [25]

    Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neu- ral computation, 3(1):79–87, 1991. 2

  26. [26]

    Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2): 181–214, 1994

    Michael I Jordan and Robert A Jacobs. Hierarchical mixtures of experts and the em algorithm.Neural computation, 6(2): 181–214, 1994. 2

  27. [27]

    Tutel: Adaptive mixture-of-experts at scale

    Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prab- hat Ram, et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5:269–287,

  28. [28]

    Scaling vision with sparse mix- ture of experts

    Carlos Riquelme, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, Andr ´e Susano Pinto, Daniel Keysers, and Neil Houlsby. Scaling vision with sparse mix- ture of experts. InNeurIPS, pages 8583–8595, 2021

  29. [29]

    Patch-level routing in mixture-of-experts is provably sample-efficient for con- volutional neural networks

    Mohammed Nowaz Rabbani Chowdhury, Shuai Zhang, Meng Wang, Sijia Liu, and Pin-Yu Chen. Patch-level routing in mixture-of-experts is provably sample-efficient for con- volutional neural networks. InICML, pages 6074–6114. PMLR, 2023. 2

  30. [30]

    Llava-mole: Sparse mixture of lora experts for mitigating data con- flicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024

    Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating data con- flicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024. 2

  31. [31]

    Octavius: Mitigating task interference in mllms via moe, 2023

    Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, and Jing Shao. Octavius: Mitigating task interference in mllms via moe, 2023

  32. [32]

    Multimodal contrastive learn- ing with limoe: the language-image mixture of experts

    Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. Multimodal contrastive learn- ing with limoe: the language-image mixture of experts. NeurIPS, 35:9564–9576, 2022. 2

  33. [33]

    Mosce-reid: Mixture of semantic clustering experts for person re-identification.Neurocomputing, 626: 129587, 2025

    Kai Ren, Chuanping Hu, Hao Xi, Yongqiang Li, Jinhao Fan, and Lihua Liu. Mosce-reid: Mixture of semantic clustering experts for person re-identification.Neurocomputing, 626: 129587, 2025. 2

  34. [34]

    Hamobe: Hierarchical and adaptive mixture of biometric ex- perts for video-based person reid

    Yiyang Su, Yunping Shi, Feng Liu, and Xiaoming Liu. Hamobe: Hierarchical and adaptive mixture of biometric ex- perts for video-based person reid. InICCV, pages 11525– 11536, 2025. 2

  35. [35]

    Demo: Decoupled feature-based mixture of experts for multi-modal object re-identification.AAAI, 39(8):8141– 8149, 2025

    Yuhao Wang, Yang Liu, Aihua Zheng, and Pingping Zhang. Demo: Decoupled feature-based mixture of experts for multi-modal object re-identification.AAAI, 39(8):8141– 8149, 2025. 2

  36. [36]

    Yang Wang, Yixing Zhang, Xudie Ren, and Yuxin Deng. Moda: Mixture of domain adapters for parameter-efficient generalizable person re-identification.ACM Transactions on Multimedia Computing, Communications and Applications, 21(5):1–19, 2025. 2

  37. [37]

    Graph-based person sig- nature for person re-identifications

    Binh X Nguyen, Binh D Nguyen, Tuong Do, Erman Tjiputra, Quang D Tran, and Anh Nguyen. Graph-based person sig- nature for person re-identifications. InCVPR, pages 3492– 3501, 2021. 3

  38. [38]

    High-order information matters: Learning relation and topology for occluded person re-identification

    Guan’an Wang, Shuo Yang, Huanyu Liu, Zhicheng Wang, Yang Yang, Shuliang Wang, Gang Yu, Erjin Zhou, and Jian Sun. High-order information matters: Learning relation and topology for occluded person re-identification. InCVPR, pages 6449–6458, 2020. 3

  39. [39]

    Reasoning and tuning: Graph attention net- work for occluded person re-identification.IEEE TIP, 32: 1568–1582, 2023

    Meiyan Huang, Chunping Hou, Qingyuan Yang, and Zhipeng Wang. Reasoning and tuning: Graph attention net- work for occluded person re-identification.IEEE TIP, 32: 1568–1582, 2023. 3

  40. [40]

    Learning discriminative features with multiple granu- larities for person re-identification.ACM MM, 2018

    Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granu- larities for person re-identification.ACM MM, 2018. 6, 2

  41. [41]

    Bag of tricks and a strong baseline for deep person re-identification

    Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. InCVPR, pages 0–0, 2019. 6, 2

  42. [42]

    A strong and efficient baseline for ve- hicle re-identification using deep triplet embedding.Journal of Artificial Intelligence and Soft Computing Research, 10 (1):27–45, 2020

    Ratnesh Kumar, Edwin Weill, Farzin Aghdasi, and Parthasarathy Sriram. A strong and efficient baseline for ve- hicle re-identification using deep triplet embedding.Journal of Artificial Intelligence and Soft Computing Research, 10 (1):27–45, 2020. 6

  43. [43]

    Y . Sun, L. Zheng, Y . Li, Y . Yang, Q. Tian, and S. Wang. Learning part-based convolutional features for person re- identification.IEEE TPAMI, 43(3):902–917, 2021. 6

  44. [44]

    Deep learning for person re- identification: A survey and outlook.IEEE TPAMI, 44(6): 2872–2893, 2021

    Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re- identification: A survey and outlook.IEEE TPAMI, 44(6): 2872–2893, 2021. 6

  45. [45]

    Fastreid: A pytorch toolbox for general instance re-identification

    Lingxiao He, Xingyu Liao, Wu Liu, Xinchen Liu, Peng Cheng, and Tao Mei. Fastreid: A pytorch toolbox for general instance re-identification. InACM MM, pages 9664–9667,

  46. [46]

    An image is worth 16x16 words: Trans- formers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InICLR, 2021. 6, 7, 2

  47. [47]

    Prototypical contrastive learning-based clip fine-tuning for object re-identification

    Jiachen Li and Xiaojin Gong. Prototypical contrastive learning-based clip fine-tuning for object re-identification. CoRR, 2023. 6, 2

  48. [48]

    Clip-reid: exploiting vision-language model for image re-identification without concrete text labels

    Siyuan Li, Li Sun, and Qingli Li. Clip-reid: exploiting vision-language model for image re-identification without concrete text labels. InAAAI, pages 1405–1413, 2023. 6, 2

  49. [49]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021. 6, 2

  50. [50]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 7

  51. [51]

    Large-scale machine learning with stochastic gradient descent

    L ´eon Bottou. Large-scale machine learning with stochastic gradient descent. InICCS, pages 177–186. Springer, 2010. 7

  52. [52]

    Deep high-resolution represen- tation learning for visual recognition.IEEE TPAMI, 43(10): 3349–3364, 2020

    Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution represen- tation learning for visual recognition.IEEE TPAMI, 43(10): 3349–3364, 2020. 2

  53. [53]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, 2022. 2

  54. [54]

    Transreid: Transformer-based object re- identification

    Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re- identification. InICCV, pages 15013–15022, 2021. 2

  55. [55]

    Unity is strength: Unifying convolutional and transformeral features for better person re-identification

    Yuhao Wang, Pingping Zhang, Xuehu Liu, Zhengzheng Tu, and Huchuan Lu. Unity is strength: Unifying convolutional and transformeral features for better person re-identification. IEEE Transactions on Intelligent Transportation Systems,

  56. [56]

    Enhancing visible-infrared person re-identification with modality-and instance-aware visual prompt learning

    Ruiqi Wu, Bingliang Jiao, Wenxuan Wang, Meng Liu, and Peng Wang. Enhancing visible-infrared person re-identification with modality-and instance-aware visual prompt learning. InICMR, pages 579–588, 2024. 2 View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification Supplementary Material In this supplementary material, we provide additional an...

  57. [57]

    Retrieval Visualization In addition to the retrieval visualization on the AG-ReID.v2 dataset shown in Fig

    Visual Analysis 6.1. Retrieval Visualization In addition to the retrieval visualization on the AG-ReID.v2 dataset shown in Fig. 5, we further provide retrieval exam- ples on the CARGO dataset under both the A↔G protocol and the ALL protocol. As illustrated in Fig. 6, our method consistently retrieves correct cross-view matches across dif- ferent evaluatio...

  58. [58]

    3 and Tab

    Performance Tab. 3 and Tab. 4 report the performance of ViSA on AG-ReID.v2 and LAGPeR, respectively. On AG-ReID.v2, ViSA achieves the highest mAP across all methods and sets a new state of the art in every evaluation setting except the A→W protocol, where the viewpoint variation is rela- tively minor. We further observe that in this less challenging A→W s...

  59. [59]

    In the study of the number of expertsE, we fix the number of selected expert to 1 (set- tingk= 1) to ensure controlled comparisons

    Parameter Analysis We further analyze the impact of hyperparameters under different evaluation protocols. In the study of the number of expertsE, we fix the number of selected expert to 1 (set- tingk= 1) to ensure controlled comparisons. As shown in Table 3. Performance comparison on AG-ReID.v2 dataset. C represents CCTV , W represents wearable devices an...

  60. [60]

    As shown in Tab

    Efficiency ViSA achieves strong accuracy and efficiency despite in- troducing additional parameters (+107.6%). As shown in Tab. 5, ViSA enhances mAP by 10.06% while reducing GFLOPs by 3.3% and increasing inference speed by 55.1% FPS