Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Bowei Zhu; Hongxiang Li; Hongxu Chen; Lin Li; Long Chen; Rui Liu; Yanghao Wang; Zhen Wang; Ziqi Jiang

arxiv: 2605.05054 · v1 · submitted 2026-05-06 · 💻 cs.CV · cs.AI· cs.LG

Direct Product Flow Matching: Decoupling Radial and Angular Dynamics for Few-Shot Adaptation

Hongxu Chen , Yanghao Wang , Bowei Zhu , Hongxiang Li , Zhen Wang , Ziqi Jiang , Lin Li , Rui Liu

show 1 more author

Long Chen

This is my paper

Pith reviewed 2026-05-08 17:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords flow matchingfew-shot adaptationvision-language modelsmanifold geometryradial angular decouplingcross-modal alignmentwarped product manifold

0 comments

The pith

Decoupling radial and angular dynamics in flow matching resolves geometric issues in vision-language model adaptation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing flow matching methods for few-shot adaptation of vision-language models suffer from angular dynamics distortion, neglect of radial dynamics, and loss of dataset-specific context due to incompatible geometric priors in pre-trained features. The paper proposes direct product flow matching as a way to model alignment on a decoupled cylindrical manifold where radial and angular components evolve independently. This approach uses a constant-warping metric on a warped product manifold to enable constant-speed angular transport and independent radial evolution. By also conditioning the flow on hidden states from the pre-trained model, it recovers missing information and achieves improved performance on multi-step adaptation tasks.

Core claim

The authors claim that by reformulating the alignment process on a warped product manifold and deriving a direct product manifold via a constant-warping metric, direct product flow matching allows independent radial evolution and constant-speed angular geodesic transport. This eliminates the angular dynamics distortion present in prior methods while preserving radial consistency, and when augmented with classifier-free guidance from model hidden states, it leads to state-of-the-art results in multi-step few-shot adaptation across multiple benchmarks.

What carries the argument

The direct product manifold obtained from a constant-warping metric on the warped product manifold, enabling decoupled radial and angular flows.

If this is right

Angular movement occurs at constant speed, reducing training difficulty and truncation errors.
Radial dynamics are preserved to distinguish distribution shifts and modality confidence.
Dataset-specific information is recovered through conditioning on pre-trained hidden states.
Superior performance on 11 benchmarks for few-shot adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar decoupling strategies could improve flow matching in other domains like image generation or time-series modeling.
The emphasis on manifold geometry suggests that choosing appropriate Riemannian structures may be key to advancing continuous generative models.
Future work might explore variable warping metrics beyond the constant case for more flexible adaptations.

Load-bearing premise

Pre-trained cross-modal features have incompatible geometric priors that lead to the three specific limitations, and a constant-warping metric fully fixes them without new trade-offs.

What would settle it

Experiments demonstrating that DP-FM does not reduce angular speed variation or that it fails to outperform baselines on the benchmarks would disprove the central claim.

Figures

Figures reproduced from arXiv: 2605.05054 by Bowei Zhu, Hongxiang Li, Hongxu Chen, Lin Li, Long Chen, Rui Liu, Yanghao Wang, Zhen Wang, Ziqi Jiang.

**Figure 1.** Figure 1: (a). Single-step parameter-efficient fine-tuning (PEFT) mostly performs cross-modal alignment in a single-step manner. (b). Multi-step flow matching (FM) methods model continuous and multi-step alignment dynamics. During the training stage, (c). FMA undergoes a non-uniform angular speed induced by radial–angular coupling. However, (d). DP-FM follows a constant-speed angular geodesic due to decoupled radial… view at source ↗

**Figure 2.** Figure 2: Comparison between FMA [10], WP-FM (Hyperbolic) [11], and DP-FM on Aircraft dataset at 100 epochs. DP-FM shows enhanced accuracy and more uniform angular speed across time step t at inference. Specifically, FMA first normalizes the pretrained cross-modal features (e.g., CLIP image and text features x0, x1) into unit-length pairs (x¯0, x¯1), where x¯i = xi/∥xi∥2 for i ∈ {0, 1}. It then trains a velocity n… view at source ↗

**Figure 3.** Figure 3: Illustration of DP-FM. Velocity Decomposition. Standard neural networks typically output velocity predictions in the ambient Euclidean space, denoted as vψ(xt, t) ∈ R d . To adhere to our warped product manifold geometry, this output is explicitly projected onto the tangent spaces T Mr and T Mθ. Specifically, as shown in view at source ↗

**Figure 4.** Figure 4: Comparison between WP-FM (Euclidean [10], Hyperbolic [11]) and DP-FM on (a). Aircraft, (b). DTD, and (c). UCF dataset at 20 epochs. Metric-Aware Loss Objective. Unlike standard FM, which minimizes mean squared error (MSE) uniformly across Euclidean dimensions, WP-FM utilizes a specific optimization objective function corresponding to the Riemannian manifold. Guided by the metric tensor ds2 = dr2 + ϕ(r) 2dθ… view at source ↗

**Figure 5.** Figure 5: Radial Magnitude Distribution across Datasets. As illustrated in view at source ↗

read the original abstract

Recent flow matching (FM) methods improve the few-shot adaptation of vision-language models, by modeling cross-modal alignment as a continuous multi-step flow. In this paper, we argue that existing FM methods are inherently constrained by incompatible geometric priors on pre-trained cross-modal features, resulting in suboptimal adaptation performance. We first analyze these methods from a polar decomposition perspective (i.e., radial and angular sub-manifolds). Under this new geometric view, we identify three overlooked limitations in them: 1) Angular dynamics distortion: The radial-angular coupling induces non-uniform speed on the angular sub-manifold, leading to regression training difficulty and extra truncation errors. 2) Radial dynamics neglect: Feature normalization discards modality confidence, failing to distinguish out-of-distribution and in-distribution data, and abandoning crucial radial dynamics. 3) Context-agnostic unconditional flow: Dataset-specific information loss during pre-trained cross-modal feature extraction remains unrecovered. To resolve these issues, we propose warped product flow matching (WP-FM), a unified Riemannian framework that reformulates alignment on a warped product manifold. Within this framework, we derive direct product flow matching (DP-FM) by introducing a constant-warping metric, which yields a decoupled cylindrical manifold (i.e., direct product manifold). DP-FM enables independent radial evolution and constant-speed angular geodesic transport, effectively eliminating angular dynamics distortion while preserving radial consistency. Meanwhile, we incorporate classifier-free guidance by conditioning the flow on the pre-trained VLMs' hidden states to inject missing dataset-specific information. Extensive results across 11 benchmarks have demonstrated that DP-FM achieves a new state-of-the-art for multi-step few-shot adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DP-FM introduces a constant-warping metric on a warped product manifold to decouple radial and angular flows in few-shot VLM adaptation, but the claim that this yields exact constant-speed angular geodesics under learned velocity fields needs tighter verification.

read the letter

The paper's main move is to reframe flow matching for cross-modal alignment through a polar decomposition into radial and angular sub-manifolds, then build a warped product Riemannian structure that specializes to a direct product (cylindrical) manifold when the warping function is constant. From there they derive DP-FM, which runs independent radial evolution while keeping angular transport on constant-speed geodesics, and they add classifier-free guidance conditioned on the VLM's hidden states to put dataset-specific information back in. That geometric derivation and the explicit handling of the three listed limitations in prior FM work are the genuinely new pieces; the rest of the setup follows standard flow matching training but on this new manifold.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing flow matching methods for few-shot adaptation of vision-language models suffer from three geometric limitations (angular dynamics distortion from radial-angular coupling, radial dynamics neglect due to normalization, and loss of dataset-specific context). It proposes a Riemannian framework called warped product flow matching (WP-FM) on a warped product manifold; by introducing a constant-warping metric, it derives direct product flow matching (DP-FM) on a decoupled cylindrical direct product manifold that enables independent radial evolution and constant-speed angular geodesic transport. Classifier-free guidance is added by conditioning on pre-trained VLM hidden states. Experiments across 11 benchmarks are reported to show that DP-FM achieves new state-of-the-art performance for multi-step few-shot adaptation.

Significance. If the constant-warping metric indeed produces a manifold on which the learned flow matching velocity field preserves exact constant-speed angular geodesics without new truncation or fitting artifacts, and if the reported gains are attributable to this decoupling, the work would supply a principled geometric prior for continuous cross-modal alignment that addresses overlooked incompatibilities in pre-trained features. The polar decomposition analysis itself offers a reusable lens for diagnosing limitations in other flow-based adaptation methods.

major comments (2)

The central derivation (abstract and the section introducing DP-FM) asserts that a constant-warping metric on the warped product manifold yields a direct product manifold with exactly constant-speed angular geodesics that eliminate angular dynamics distortion. However, flow matching optimizes a velocity field by regression to a target vector field; the manuscript does not provide a proof or bound showing that the optimal learned field on this manifold remains geodesic (or that discretization error remains negligible) once the metric is fixed. This leaves open whether the claimed elimination of non-uniform angular speed survives training and sampling.
The experimental section reports new SOTA results across 11 benchmarks, but the abstract and available description provide no controls isolating the contribution of the constant-warping metric versus the added classifier-free guidance or other implementation choices. Without ablations that vary only the manifold metric while holding the velocity network and integrator fixed, it is difficult to confirm that performance gains stem from the geometric decoupling rather than ancillary factors.

minor comments (2)

The abstract would be strengthened by including the explicit definition of the constant-warping metric and the resulting metric tensor on the direct product manifold.
Notation for the radial and angular sub-manifolds should be introduced with a brief equation or diagram early in the geometric analysis section to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The two major comments highlight important aspects of the theoretical grounding and experimental validation of the constant-warping metric and its contribution to performance. We address each point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: The central derivation (abstract and the section introducing DP-FM) asserts that a constant-warping metric on the warped product manifold yields a direct product manifold with exactly constant-speed angular geodesics that eliminate angular dynamics distortion. However, flow matching optimizes a velocity field by regression to a target vector field; the manuscript does not provide a proof or bound showing that the optimal learned field on this manifold remains geodesic (or that discretization error remains negligible) once the metric is fixed. This leaves open whether the claimed elimination of non-uniform angular speed survives training and sampling.

Authors: We agree that a formal guarantee would strengthen the presentation. By construction, the constant-warping metric defines the direct-product manifold such that the target vector field for flow matching corresponds exactly to constant-speed angular geodesics plus independent radial motion; the regression objective therefore learns a velocity field whose angular component is geodesic on that manifold. We will add a dedicated theoretical subsection that (i) recalls the metric-induced decomposition, (ii) shows that any velocity field obtained by exact regression to the target inherits the constant-speed property, and (iii) provides a first-order bound on the deviation introduced by finite discretization and network approximation error, supported by additional diagnostic plots of angular speed during sampling. revision: yes
Referee: The experimental section reports new SOTA results across 11 benchmarks, but the abstract and available description provide no controls isolating the contribution of the constant-warping metric versus the added classifier-free guidance or other implementation choices. Without ablations that vary only the manifold metric while holding the velocity network and integrator fixed, it is difficult to confirm that performance gains stem from the geometric decoupling rather than ancillary factors.

Authors: We concur that isolating the geometric contribution is essential. In the revised version we will insert a new ablation table that compares three controlled settings while keeping the velocity network architecture, optimizer, number of steps, and classifier-free guidance identical: (1) standard FM on the original coupled manifold, (2) WP-FM with the learned warping function, and (3) DP-FM with the fixed constant-warping metric. This isolates the effect of the metric choice itself and will be reported on a representative subset of the 11 benchmarks together with the original results. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation is self-contained from geometric analysis.

full rationale

The abstract and provided text derive DP-FM by first analyzing existing FM methods via polar decomposition to identify three limitations, then proposing a warped product manifold with constant-warping metric to yield a decoupled cylindrical manifold enabling independent radial evolution and constant-speed angular geodesics. No equations, self-citations, or steps are exhibited that reduce any prediction or result to fitted inputs, self-definitions, or prior author work by construction. The central claims rest on manifold geometry and conditioning rather than tautological renaming or load-bearing self-references. The derivation chain is therefore independent and self-contained against external geometric priors.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on Riemannian manifold assumptions for feature geometry and the introduction of new manifold constructs without external validation beyond performance claims.

axioms (2)

domain assumption Pre-trained cross-modal features admit a polar decomposition into radial and angular sub-manifolds
Invoked to identify the three limitations in existing FM methods.
ad hoc to paper A constant-warping metric yields a decoupled cylindrical direct product manifold
Introduced to enable independent radial and angular evolution.

invented entities (2)

warped product manifold no independent evidence
purpose: Reformulate cross-modal alignment to resolve geometric constraints
New framework proposed to address angular distortion and radial neglect.
direct product manifold (cylindrical) no independent evidence
purpose: Enable decoupled radial evolution and constant-speed angular transport
Derived via constant-warping metric for DP-FM.

pith-pipeline@v0.9.0 · 5624 in / 1431 out tokens · 33866 ms · 2026-05-08T17:29:57.503720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

[1]

Flamingo: a visual language model for few-shot learning.NeurIPS, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35:23716–23736, 2022

work page 2022
[2]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742. PMLR, 2023

work page 2023
[3]

Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

work page 2023
[4]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PMLR, 2021

work page 2021
[5]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888– 12900. PMLR, 2022

work page 2022
[6]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InCVPR, pages 16816–16825, 2022

work page 2022
[7]

Learning to prompt for vision-language models.IJCV, 130(9):2337–2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.IJCV, 130(9):2337–2348, 2022

work page 2022
[8]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024

work page 2024
[9]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[10]

Exploring cross-modal flows for few-shot learning

Ziqi Jiang, Yanghao Wang, and Long Chen. Exploring cross-modal flows for few-shot learning. InICLR, 2026

work page 2026
[11]

Path-decoupled hyperbolic flow matching for few-shot adaptation.arXiv preprint, 2026

Lin Li, Ziqi Jiang, Gefan Ye, Zhenqi He, Jiahui Li, Jun Xiao, Kwang-Ting Cheng, and Long Chen. Path-decoupled hyperbolic flow matching for few-shot adaptation.arXiv preprint, 2026

work page 2026
[12]

Angular gradient sign method: Uncovering vulnerabilities in hyperbolic networks

Minsoo Jo, Dongyoon Yang, and Taesup Kim. Angular gradient sign method: Uncovering vulnerabilities in hyperbolic networks. InAAAI, volume 40, pages 5566–5574, 2026

work page 2026
[13]

Hyperbolic image-text representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakr- ishna Vedantam. Hyperbolic image-text representations. InICML, pages 7694–7731. PMLR, 2023

work page 2023
[14]

Understanding the feature norm for out-of-distribution detection

Jaewoo Park, Jacky Chen Long Chai, Jaeho Yoon, and Andrew Beng Jin Teoh. Understanding the feature norm for out-of-distribution detection. InICCV, pages 1557–1567, 2023

work page 2023
[15]

Flow matching on general geometries.arXiv preprint, 2023

Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint, 2023

work page 2023
[16]

Classifier-free diffusion guidance.arXiv preprint, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint, 2022

work page 2022
[17]

Denseclip: Language-guided dense prediction with context-aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. InCVPR, pages 18082–18091, 2022

work page 2022
[18]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023

work page 2023
[19]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–4916. PMLR, 2021. 10

work page 2021
[20]

Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint, 2024

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint, 2024

work page 2024
[21]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InICML, pages 2790–2799. PMLR, 2019

work page 2019
[22]

Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

work page 2024
[23]

Flow matching for generative modeling.arXiv preprint, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint, 2022

work page 2022
[24]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint, 2022

work page 2022
[25]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

work page 2022
[26]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, pages 23–40. Springer, 2024

work page 2024
[27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

work page 2023
[28]

Bi-anchor interpolation solver for accelerating generative modeling.arXiv preprint, 2026

Hongxu Chen, Hongxiang Li, Zhen Wang, and Long Chen. Bi-anchor interpolation solver for accelerating generative modeling.arXiv preprint, 2026

work page 2026
[29]

Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

work page 2020
[30]

Score-based generative modeling through stochastic differential equations.arXiv preprint, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint, 2020

work page 2020
[31]

Aca- demic press, 1983

Barrett O’neill.Semi-Riemannian geometry with applications to relativity, volume 103. Aca- demic press, 1983

work page 1983
[32]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024

work page 2024
[33]

Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint, 2023

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint, 2023

work page 2023
[34]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232. PMLR, 2023

work page 2023
[35]

Fine- grained visual classification of aircraft.arXiv preprint, 2013

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint, 2013

work page 2013
[36]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

work page 2019
[37]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014

work page 2014
[38]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. InCVPR, pages 3485–3492. IEEE, 2010. 11

work page 2010
[39]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV, pages 554–561, 2013

work page 2013
[40]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012

work page 2012
[41]

Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint, 2012

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint, 2012

work page 2012
[42]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InICVGIP, pages 722–729. IEEE, 2008

work page 2008
[43]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InCVPR, pages 178–178. IEEE, 2004

work page 2004
[44]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, pages 446–461. Springer, 2014

work page 2014
[45]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009

work page 2009
[46]

Tip-adapter: Training-free adaption of clip for few-shot classification

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free adaption of clip for few-shot classification. InECCV, pages 493–510. Springer, 2022

work page 2022
[47]

Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint, 2022

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint, 2022

work page 2022
[48]

Visual-language prompt tuning with knowledge- guided context optimization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge- guided context optimization. InCVPR, pages 6757–6767, 2023

work page 2023
[49]

Prompt-aligned gradient for prompt tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InICCV, pages 15659–15669, 2023

work page 2023
[50]

Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024

work page 2024
[51]

Decoupled weight decay regularization.arXiv preprint, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint, 2017. 12 A Angular Truncation Error of WP-FM In this section, we provide a detailed derivation of the angular truncation error for first-order Rieman- nian ODE solvers under the warped product geometry introduced in Sec. 3. Proposition 1 (Angular Truncation Error).Let ...

work page 2017

[1] [1]

Flamingo: a visual language model for few-shot learning.NeurIPS, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.NeurIPS, 35:23716–23736, 2022

work page 2022

[2] [2]

Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, pages 19730–19742. PMLR, 2023

work page 2023

[3] [3]

Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.NeurIPS, 36:34892–34916, 2023

work page 2023

[4] [4]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763. PMLR, 2021

work page 2021

[5] [5]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, pages 12888– 12900. PMLR, 2022

work page 2022

[6] [6]

Conditional prompt learning for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language models. InCVPR, pages 16816–16825, 2022

work page 2022

[7] [7]

Learning to prompt for vision-language models.IJCV, 130(9):2337–2348, 2022

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.IJCV, 130(9):2337–2348, 2022

work page 2022

[8] [8]

Low-rank few-shot adaptation of vision-language models

Maxime Zanella and Ismail Ben Ayed. Low-rank few-shot adaptation of vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1593–1603, 2024

work page 2024

[9] [9]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022

[10] [10]

Exploring cross-modal flows for few-shot learning

Ziqi Jiang, Yanghao Wang, and Long Chen. Exploring cross-modal flows for few-shot learning. InICLR, 2026

work page 2026

[11] [11]

Path-decoupled hyperbolic flow matching for few-shot adaptation.arXiv preprint, 2026

Lin Li, Ziqi Jiang, Gefan Ye, Zhenqi He, Jiahui Li, Jun Xiao, Kwang-Ting Cheng, and Long Chen. Path-decoupled hyperbolic flow matching for few-shot adaptation.arXiv preprint, 2026

work page 2026

[12] [12]

Angular gradient sign method: Uncovering vulnerabilities in hyperbolic networks

Minsoo Jo, Dongyoon Yang, and Taesup Kim. Angular gradient sign method: Uncovering vulnerabilities in hyperbolic networks. InAAAI, volume 40, pages 5566–5574, 2026

work page 2026

[13] [13]

Hyperbolic image-text representations

Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Shanmukha Ramakr- ishna Vedantam. Hyperbolic image-text representations. InICML, pages 7694–7731. PMLR, 2023

work page 2023

[14] [14]

Understanding the feature norm for out-of-distribution detection

Jaewoo Park, Jacky Chen Long Chai, Jaeho Yoon, and Andrew Beng Jin Teoh. Understanding the feature norm for out-of-distribution detection. InICCV, pages 1557–1567, 2023

work page 2023

[15] [15]

Flow matching on general geometries.arXiv preprint, 2023

Ricky TQ Chen and Yaron Lipman. Flow matching on general geometries.arXiv preprint, 2023

work page 2023

[16] [16]

Classifier-free diffusion guidance.arXiv preprint, 2022

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint, 2022

work page 2022

[17] [17]

Denseclip: Language-guided dense prediction with context-aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. InCVPR, pages 18082–18091, 2022

work page 2022

[18] [18]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, pages 11975–11986, 2023

work page 2023

[19] [19]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, pages 4904–4916. PMLR, 2021. 10

work page 2021

[20] [20]

Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint, 2024

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey.arXiv preprint, 2024

work page 2024

[21] [21]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InICML, pages 2790–2799. PMLR, 2019

work page 2019

[22] [22]

Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters.IJCV, 132(2):581–595, 2024

work page 2024

[23] [23]

Flow matching for generative modeling.arXiv preprint, 2022

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint, 2022

work page 2022

[24] [24]

Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint, 2022

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint, 2022

work page 2022

[25] [25]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

work page 2022

[26] [26]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, pages 23–40. Springer, 2024

work page 2024

[27] [27]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, pages 4195–4205, 2023

work page 2023

[28] [28]

Bi-anchor interpolation solver for accelerating generative modeling.arXiv preprint, 2026

Hongxu Chen, Hongxiang Li, Zhen Wang, and Long Chen. Bi-anchor interpolation solver for accelerating generative modeling.arXiv preprint, 2026

work page 2026

[29] [29]

Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

work page 2020

[30] [30]

Score-based generative modeling through stochastic differential equations.arXiv preprint, 2020

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint, 2020

work page 2020

[31] [31]

Aca- demic press, 1983

Barrett O’neill.Semi-Riemannian geometry with applications to relativity, volume 103. Aca- demic press, 1983

work page 1983

[32] [32]

Scaling rectified flow transform- ers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. InICML, 2024

work page 2024

[33] [33]

Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint, 2023

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image synthesis.arXiv preprint, 2023

work page 2023

[34] [34]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InICML, pages 13213–13232. PMLR, 2023

work page 2023

[35] [35]

Fine- grained visual classification of aircraft.arXiv preprint, 2013

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine- grained visual classification of aircraft.arXiv preprint, 2013

work page 2013

[36] [36]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.J-STARS, 12(7):2217–2226, 2019

work page 2019

[37] [37]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. InCVPR, pages 3606–3613, 2014

work page 2014

[38] [38]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. InCVPR, pages 3485–3492. IEEE, 2010. 11

work page 2010

[39] [39]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. InICCV, pages 554–561, 2013

work page 2013

[40] [40]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012

work page 2012

[41] [41]

Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint, 2012

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint, 2012

work page 2012

[42] [42]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. InICVGIP, pages 722–729. IEEE, 2008

work page 2008

[43] [43]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. InCVPR, pages 178–178. IEEE, 2004

work page 2004

[44] [44]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. InECCV, pages 446–461. Springer, 2014

work page 2014

[45] [45]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009

work page 2009

[46] [46]

Tip-adapter: Training-free adaption of clip for few-shot classification

Renrui Zhang, Wei Zhang, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, and Hongsheng Li. Tip-adapter: Training-free adaption of clip for few-shot classification. InECCV, pages 493–510. Springer, 2022

work page 2022

[47] [47]

Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint, 2022

Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint, 2022

work page 2022

[48] [48]

Visual-language prompt tuning with knowledge- guided context optimization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual-language prompt tuning with knowledge- guided context optimization. InCVPR, pages 6757–6767, 2023

work page 2023

[49] [49]

Prompt-aligned gradient for prompt tuning

Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. InICCV, pages 15659–15669, 2023

work page 2023

[50] [50]

Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024

Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024

work page 2024

[51] [51]

Decoupled weight decay regularization.arXiv preprint, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint, 2017. 12 A Angular Truncation Error of WP-FM In this section, we provide a detailed derivation of the angular truncation error for first-order Rieman- nian ODE solvers under the warped product geometry introduced in Sec. 3. Proposition 1 (Angular Truncation Error).Let ...

work page 2017