pith. sign in

arxiv: 2605.05054 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI· cs.LG

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Pith reviewed 2026-05-08 17:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords flow matchingfew-shot adaptationvision-language modelsmanifold geometryradial angular decouplingcross-modal alignmentwarped product manifold
0
0 comments X

The pith

Decoupling radial and angular dynamics in flow matching resolves geometric issues in vision-language model adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing flow matching methods for few-shot adaptation of vision-language models suffer from angular dynamics distortion, neglect of radial dynamics, and loss of dataset-specific context due to incompatible geometric priors in pre-trained features. The paper proposes direct product flow matching as a way to model alignment on a decoupled cylindrical manifold where radial and angular components evolve independently. This approach uses a constant-warping metric on a warped product manifold to enable constant-speed angular transport and independent radial evolution. By also conditioning the flow on hidden states from the pre-trained model, it recovers missing information and achieves improved performance on multi-step adaptation tasks.

Core claim

The authors claim that by reformulating the alignment process on a warped product manifold and deriving a direct product manifold via a constant-warping metric, direct product flow matching allows independent radial evolution and constant-speed angular geodesic transport. This eliminates the angular dynamics distortion present in prior methods while preserving radial consistency, and when augmented with classifier-free guidance from model hidden states, it leads to state-of-the-art results in multi-step few-shot adaptation across multiple benchmarks.

What carries the argument

The direct product manifold obtained from a constant-warping metric on the warped product manifold, enabling decoupled radial and angular flows.

If this is right

  • Angular movement occurs at constant speed, reducing training difficulty and truncation errors.
  • Radial dynamics are preserved to distinguish distribution shifts and modality confidence.
  • Dataset-specific information is recovered through conditioning on pre-trained hidden states.
  • Superior performance on 11 benchmarks for few-shot adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar decoupling strategies could improve flow matching in other domains like image generation or time-series modeling.
  • The emphasis on manifold geometry suggests that choosing appropriate Riemannian structures may be key to advancing continuous generative models.
  • Future work might explore variable warping metrics beyond the constant case for more flexible adaptations.

Load-bearing premise

Pre-trained cross-modal features have incompatible geometric priors that lead to the three specific limitations, and a constant-warping metric fully fixes them without new trade-offs.

What would settle it

Experiments demonstrating that DP-FM does not reduce angular speed variation or that it fails to outperform baselines on the benchmarks would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.05054 by Bowei Zhu, Hongxiang Li, Hongxu Chen, Lin Li, Long Chen, Rui Liu, Yanghao Wang, Zhen Wang, Ziqi Jiang.

Figure 1
Figure 1. Figure 1: (a). Single-step parameter-efficient fine-tuning (PEFT) mostly performs cross-modal alignment in a single-step manner. (b). Multi-step flow matching (FM) methods model continuous and multi-step alignment dynamics. During the training stage, (c). FMA undergoes a non-uniform angular speed induced by radial–angular coupling. However, (d). DP-FM follows a constant-speed angular geodesic due to decoupled radial… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between FMA [10], WP-FM (Hyperbolic) [11], and DP-FM on Aircraft dataset at 100 epochs. DP-FM shows enhanced accuracy and more uniform angular speed across time step t at inference. Specifically, FMA first normalizes the pre￾trained cross-modal features (e.g., CLIP image and text features x0, x1) into unit-length pairs (x¯0, x¯1), where x¯i = xi/∥xi∥2 for i ∈ {0, 1}. It then trains a ve￾locity n… view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of DP-FM. Velocity Decomposition. Standard neural networks typically output velocity predictions in the ambient Euclidean space, denoted as vψ(xt, t) ∈ R d . To adhere to our warped product manifold geometry, this output is explicitly projected onto the tangent spaces T Mr and T Mθ. Specifically, as shown in view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between WP-FM (Euclidean [10], Hyperbolic [11]) and DP-FM on (a). Aircraft, (b). DTD, and (c). UCF dataset at 20 epochs. Metric-Aware Loss Objective. Unlike standard FM, which minimizes mean squared error (MSE) uniformly across Euclidean dimensions, WP-FM utilizes a specific optimization objective function corresponding to the Riemannian manifold. Guided by the metric tensor ds2 = dr2 + ϕ(r) 2dθ… view at source ↗
Figure 5
Figure 5. Figure 5: Radial Magnitude Distribution across Datasets. As illustrated in view at source ↗
read the original abstract

Recent flow matching (FM) methods improve the few-shot adaptation of vision-language models, by modeling cross-modal alignment as a continuous multi-step flow. In this paper, we argue that existing FM methods are inherently constrained by incompatible geometric priors on pre-trained cross-modal features, resulting in suboptimal adaptation performance. We first analyze these methods from a polar decomposition perspective (i.e., radial and angular sub-manifolds). Under this new geometric view, we identify three overlooked limitations in them: 1) Angular dynamics distortion: The radial-angular coupling induces non-uniform speed on the angular sub-manifold, leading to regression training difficulty and extra truncation errors. 2) Radial dynamics neglect: Feature normalization discards modality confidence, failing to distinguish out-of-distribution and in-distribution data, and abandoning crucial radial dynamics. 3) Context-agnostic unconditional flow: Dataset-specific information loss during pre-trained cross-modal feature extraction remains unrecovered. To resolve these issues, we propose warped product flow matching (WP-FM), a unified Riemannian framework that reformulates alignment on a warped product manifold. Within this framework, we derive direct product flow matching (DP-FM) by introducing a constant-warping metric, which yields a decoupled cylindrical manifold (i.e., direct product manifold). DP-FM enables independent radial evolution and constant-speed angular geodesic transport, effectively eliminating angular dynamics distortion while preserving radial consistency. Meanwhile, we incorporate classifier-free guidance by conditioning the flow on the pre-trained VLMs' hidden states to inject missing dataset-specific information. Extensive results across 11 benchmarks have demonstrated that DP-FM achieves a new state-of-the-art for multi-step few-shot adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing flow matching methods for few-shot adaptation of vision-language models suffer from three geometric limitations (angular dynamics distortion from radial-angular coupling, radial dynamics neglect due to normalization, and loss of dataset-specific context). It proposes a Riemannian framework called warped product flow matching (WP-FM) on a warped product manifold; by introducing a constant-warping metric, it derives direct product flow matching (DP-FM) on a decoupled cylindrical direct product manifold that enables independent radial evolution and constant-speed angular geodesic transport. Classifier-free guidance is added by conditioning on pre-trained VLM hidden states. Experiments across 11 benchmarks are reported to show that DP-FM achieves new state-of-the-art performance for multi-step few-shot adaptation.

Significance. If the constant-warping metric indeed produces a manifold on which the learned flow matching velocity field preserves exact constant-speed angular geodesics without new truncation or fitting artifacts, and if the reported gains are attributable to this decoupling, the work would supply a principled geometric prior for continuous cross-modal alignment that addresses overlooked incompatibilities in pre-trained features. The polar decomposition analysis itself offers a reusable lens for diagnosing limitations in other flow-based adaptation methods.

major comments (2)
  1. The central derivation (abstract and the section introducing DP-FM) asserts that a constant-warping metric on the warped product manifold yields a direct product manifold with exactly constant-speed angular geodesics that eliminate angular dynamics distortion. However, flow matching optimizes a velocity field by regression to a target vector field; the manuscript does not provide a proof or bound showing that the optimal learned field on this manifold remains geodesic (or that discretization error remains negligible) once the metric is fixed. This leaves open whether the claimed elimination of non-uniform angular speed survives training and sampling.
  2. The experimental section reports new SOTA results across 11 benchmarks, but the abstract and available description provide no controls isolating the contribution of the constant-warping metric versus the added classifier-free guidance or other implementation choices. Without ablations that vary only the manifold metric while holding the velocity network and integrator fixed, it is difficult to confirm that performance gains stem from the geometric decoupling rather than ancillary factors.
minor comments (2)
  1. The abstract would be strengthened by including the explicit definition of the constant-warping metric and the resulting metric tensor on the direct product manifold.
  2. Notation for the radial and angular sub-manifolds should be introduced with a brief equation or diagram early in the geometric analysis section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments highlight important aspects of the theoretical grounding and experimental validation of the constant-warping metric and its contribution to performance. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: The central derivation (abstract and the section introducing DP-FM) asserts that a constant-warping metric on the warped product manifold yields a direct product manifold with exactly constant-speed angular geodesics that eliminate angular dynamics distortion. However, flow matching optimizes a velocity field by regression to a target vector field; the manuscript does not provide a proof or bound showing that the optimal learned field on this manifold remains geodesic (or that discretization error remains negligible) once the metric is fixed. This leaves open whether the claimed elimination of non-uniform angular speed survives training and sampling.

    Authors: We agree that a formal guarantee would strengthen the presentation. By construction, the constant-warping metric defines the direct-product manifold such that the target vector field for flow matching corresponds exactly to constant-speed angular geodesics plus independent radial motion; the regression objective therefore learns a velocity field whose angular component is geodesic on that manifold. We will add a dedicated theoretical subsection that (i) recalls the metric-induced decomposition, (ii) shows that any velocity field obtained by exact regression to the target inherits the constant-speed property, and (iii) provides a first-order bound on the deviation introduced by finite discretization and network approximation error, supported by additional diagnostic plots of angular speed during sampling. revision: yes

  2. Referee: The experimental section reports new SOTA results across 11 benchmarks, but the abstract and available description provide no controls isolating the contribution of the constant-warping metric versus the added classifier-free guidance or other implementation choices. Without ablations that vary only the manifold metric while holding the velocity network and integrator fixed, it is difficult to confirm that performance gains stem from the geometric decoupling rather than ancillary factors.

    Authors: We concur that isolating the geometric contribution is essential. In the revised version we will insert a new ablation table that compares three controlled settings while keeping the velocity network architecture, optimizer, number of steps, and classifier-free guidance identical: (1) standard FM on the original coupled manifold, (2) WP-FM with the learned warping function, and (3) DP-FM with the fixed constant-warping metric. This isolates the effect of the metric choice itself and will be reported on a representative subset of the 11 benchmarks together with the original results. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained from geometric analysis.

full rationale

The abstract and provided text derive DP-FM by first analyzing existing FM methods via polar decomposition to identify three limitations, then proposing a warped product manifold with constant-warping metric to yield a decoupled cylindrical manifold enabling independent radial evolution and constant-speed angular geodesics. No equations, self-citations, or steps are exhibited that reduce any prediction or result to fitted inputs, self-definitions, or prior author work by construction. The central claims rest on manifold geometry and conditioning rather than tautological renaming or load-bearing self-references. The derivation chain is therefore independent and self-contained against external geometric priors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on Riemannian manifold assumptions for feature geometry and the introduction of new manifold constructs without external validation beyond performance claims.

axioms (2)
  • domain assumption Pre-trained cross-modal features admit a polar decomposition into radial and angular sub-manifolds
    Invoked to identify the three limitations in existing FM methods.
  • ad hoc to paper A constant-warping metric yields a decoupled cylindrical direct product manifold
    Introduced to enable independent radial and angular evolution.
invented entities (2)
  • warped product manifold no independent evidence
    purpose: Reformulate cross-modal alignment to resolve geometric constraints
    New framework proposed to address angular distortion and radial neglect.
  • direct product manifold (cylindrical) no independent evidence
    purpose: Enable decoupled radial evolution and constant-speed angular transport
    Derived via constant-warping metric for DP-FM.

pith-pipeline@v0.9.0 · 5624 in / 1431 out tokens · 33866 ms · 2026-05-08T17:29:57.503720+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Flamingo: a visual language model for few-shot learning.NeurIPS, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35:23716–23736, 2022

  2. [2]

    Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742. PMLR, 2023

  3. [3]

    Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

  4. [4]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PMLR, 2021

  5. [5]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888– 12900. PMLR, 2022

  6. [6]

    Conditional prompt learning for vision-language models

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InCVPR, pages 16816–16825, 2022

  7. [7]

    Learning to prompt for vision-language models.IJCV, 130(9):2337–2348, 2022

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.IJCV, 130(9):2337–2348, 2022

  8. [8]

    Low-rank few-shot adaptation of vision-language models

    Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024

  9. [9]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  10. [10]

    Exploring cross-modal flows for few-shot learning

    Ziqi Jiang, Yanghao Wang, and Long Chen. Exploring cross-modal flows for few-shot learning. InICLR, 2026

  11. [11]

    Path-decoupled hyperbolic flow matching for few-shot adaptation.arXiv preprint, 2026

    Lin Li, Ziqi Jiang, Gefan Ye, Zhenqi He, Jiahui Li, Jun Xiao, Kwang-Ting Cheng, and Long Chen. Path-decoupled hyperbolic flow matching for few-shot adaptation.arXiv preprint, 2026

  12. [12]

    Angular gradient sign method: Uncovering vulnerabilities in hyperbolic networks

    Minsoo Jo, Dongyoon Yang, and Taesup Kim. Angular gradient sign method: Uncovering vulnerabilities in hyperbolic networks. InAAAI, volume 40, pages 5566–5574, 2026

  13. [13]

    Hyperbolic image-text representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakr- ishna Vedantam. Hyperbolic image-text representations. InICML, pages 7694–7731. PMLR, 2023

  14. [14]

    Understanding the feature norm for out-of-distribution detection

    Jaewoo Park, Jacky Chen Long Chai, Jaeho Yoon, and Andrew Beng Jin Teoh. Understanding the feature norm for out-of-distribution detection. InICCV, pages 1557–1567, 2023

  15. [15]

    Flow matching on general geometries.arXiv preprint, 2023

    Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint, 2023

  16. [16]

    Classifier-free diffusion guidance.arXiv preprint, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint, 2022

  17. [17]

    Denseclip: Language-guided dense prediction with context-aware prompting

    Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. InCVPR, pages 18082–18091, 2022

  18. [18]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023

  19. [19]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–4916. PMLR, 2021. 10

  20. [20]

    Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint, 2024

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint, 2024

  21. [21]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InICML, pages 2790–2799. PMLR, 2019

  22. [22]

    Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

    Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

  23. [23]

    Flow matching for generative modeling.arXiv preprint, 2022

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint, 2022

  24. [24]

    Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint, 2022

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint, 2022

  25. [25]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  26. [26]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, pages 23–40. Springer, 2024

  27. [27]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

  28. [28]

    Bi-anchor interpolation solver for accelerating generative modeling.arXiv preprint, 2026

    Hongxu Chen, Hongxiang Li, Zhen Wang, and Long Chen. Bi-anchor interpolation solver for accelerating generative modeling.arXiv preprint, 2026

  29. [29]

    Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

  30. [30]

    Score-based generative modeling through stochastic differential equations.arXiv preprint, 2020

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint, 2020

  31. [31]

    Aca- demic press, 1983

    Barrett O’neill.Semi-Riemannian geometry with applications to relativity, volume 103. Aca- demic press, 1983

  32. [32]

    Scaling rectified flow transform- ers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024

  33. [33]

    Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint, 2023

    Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint, 2023

  34. [34]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232. PMLR, 2023

  35. [35]

    Fine- grained visual classification of aircraft.arXiv preprint, 2013

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint, 2013

  36. [36]

    Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

  37. [37]

    Describing textures in the wild

    Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014

  38. [38]

    Sun database: Large-scale scene recognition from abbey to zoo

    Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. InCVPR, pages 3485–3492. IEEE, 2010. 11

  39. [39]

    3d object representations for fine-grained categorization

    Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV, pages 554–561, 2013

  40. [40]

    Cats and dogs

    Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012

  41. [41]

    Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint, 2012

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint, 2012

  42. [42]

    Automated flower classification over a large number of classes

    Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InICVGIP, pages 722–729. IEEE, 2008

  43. [43]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InCVPR, pages 178–178. IEEE, 2004

  44. [44]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, pages 446–461. Springer, 2014

  45. [45]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009

  46. [46]

    Tip-adapter: Training-free adaption of clip for few-shot classification

    Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free adaption of clip for few-shot classification. InECCV, pages 493–510. Springer, 2022

  47. [47]

    Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint, 2022

    Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint, 2022

  48. [48]

    Visual-language prompt tuning with knowledge- guided context optimization

    Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge- guided context optimization. InCVPR, pages 6757–6767, 2023

  49. [49]

    Prompt-aligned gradient for prompt tuning

    Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InICCV, pages 15659–15669, 2023

  50. [50]

    Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024

  51. [51]

    Decoupled weight decay regularization.arXiv preprint, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint, 2017. 12 A Angular Truncation Error of WP-FM In this section, we provide a detailed derivation of the angular truncation error for first-order Rieman- nian ODE solvers under the warped product geometry introduced in Sec. 3. Proposition 1 (Angular Truncation Error).Let ...