pith. sign in

arxiv: 2510.14543 · v4 · submitted 2025-10-16 · 💻 cs.CV

Exploring Cross-Modal Flows for Few-Shot Learning

Pith reviewed 2026-05-18 06:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords few-shot learningcross-modal alignmentflow matchingparameter-efficient fine-tuningvision-language modelsfeature alignmentmulti-step adjustment
0
0 comments X

The pith

Multi-step flow matching aligns image and text features more precisely than one-step fine-tuning in few-shot cross-modal tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that all current parameter-efficient fine-tuning methods for vision-language models perform only a single adjustment step on features. This proves inadequate on complex datasets where visual and textual representations are highly entangled. The authors introduce Flow Matching Alignment, which learns a velocity field to carry out repeated transformations between modalities while preserving category links and mitigating data scarcity. If correct, this would allow more accurate alignment without adding many parameters, especially benefiting challenging few-shot benchmarks.

Core claim

Existing PEFT methods always perform one-step adjustment, which is insufficient for complex datasets where features of different modalities are highly entangled. We propose Flow Matching Alignment (FMA) as the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field. With a fixed coupling strategy for category correspondence, noise augmentation to address data scarcity, and an early-stopping solver, FMA achieves more precise and robust alignment and yields significant performance gains across benchmarks and backbones.

What carries the argument

The cross-modal velocity field learned through flow matching, which models gradual multi-step transformations between image and text features.

Load-bearing premise

That the primary reason existing PEFT methods underperform on complex datasets is their restriction to a single adjustment step.

What would settle it

A controlled test where a standard one-step PEFT method is extended with repeated or iterative updates and then compared directly against FMA on the same entangled cross-modal few-shot datasets.

Figures

Figures reproduced from arXiv: 2510.14543 by Long Chen, Yanghao Wang, Ziqi Jiang.

Figure 1
Figure 1. Figure 1: Comparisons of the cross-modal alignment process of different methods. (a) The overview pipeline of CLIP (Radford et al., 2021) for zero-shot cross-modal alignment and classification. Some image features and their corresponding text features are not well-aligned. (b-d) The alignment process of three typical types of state-of-the-art PEFT approaches, which adjust image or text features in one single step. T… view at source ↗
Figure 2
Figure 2. Figure 2: Performance of CoOp and linear probing. We experimented with the 16-shot set￾ting and chose CLIP RN50 as the backbone. It is widely acknowledged that PEFT typically out￾performs linear probing. However, we observed that their advantages compared with LP are not consistent across different datasets. To better illustrate this, we introduce the concept of dataset difficulty: A dataset with lower CLIP zero-sho… view at source ↗
Figure 3
Figure 3. Figure 3: (a), it may transform image features from one class to text features of another class, resulting in wrong classification. Therefore, the first challenge is to figure out how to train a velocity field, which can transfer image features from source distribution, not only to the target distribution (near text features), but also close to their right class embeddings. Secondly, in the inference stage, the goal… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Flow Matching Alignment (FMA). The main idea of the training stage is to learn a velocity field, which can transform image features to corresponding text features. Two designs are proposed: coupling enhancement and noise augmentation. During the inference, FMA applies an early-stopping solver that can output intermediate features for classification. CLIP. Specifically, CLIP contains an image en… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Red line: Average distance to target figure at different timesteps. Black line: Accuracy using features at different timesteps for classification. (b) At different timesteps, the distance between the intermediate features and the correct/incorrect text feature. After learning a velocity field u θ t (xt), we use it to transform any image fea￾ture x0 for classification. Vanilla flow matching inference pr… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Performance of different CLIP backbones on difficult datasets. (b) Ablation on different inference strategies. (c) Ablation on noise augmentation strategy. The average performance on 11 datasets is reported. across all approaches, improvements on difficult datasets consistently surpass those on easy datasets. This further supports our conclusion that for difficult datasets with complex cross-modal dist… view at source ↗
read the original abstract

Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that existing parameter-efficient fine-tuning (PEFT) methods for cross-modal alignment in vision-language models are limited to one-step adjustments, which is insufficient for complex datasets with highly entangled features. It proposes Flow Matching Alignment (FMA), the first model-agnostic multi-step approach that learns a cross-modal velocity field via flow matching. The method incorporates a fixed coupling strategy to preserve category correspondence, a noise augmentation strategy to address data scarcity in few-shot settings, and an early-stopping solver for efficiency and accuracy. The authors report that FMA yields consistent and significant performance gains over one-step PEFT baselines across benchmarks and backbones, especially on challenging datasets.

Significance. If the performance gains can be rigorously attributed to the multi-step rectification property of the learned velocity field (rather than the auxiliary coupling, augmentation, or solver choices), the work would introduce a new paradigm for iterative cross-modal alignment in few-shot learning. This could improve robustness on entangled feature distributions and extend to other vision-language tasks, provided the method remains model-agnostic and computationally practical.

major comments (2)
  1. [§3] §3 (Method): The central claim that multi-step rectification via the velocity field provides more precise alignment than one-step PEFT rests on the introduction of fixed coupling, noise augmentation, and early-stopping solver, yet no ablation isolates the iterative flow-matching component. A single-step application of the same velocity field (or multi-step application of a baseline PEFT method) is required to establish that the gains arise from the multi-step property rather than the new auxiliary strategies.
  2. [§4] §4 (Experiments): Results claim consistent superiority on challenging datasets, but without controls that apply the velocity field in one step or ablate the coupling/noise/early-stopping elements individually, attribution of improvements to multi-step rectification remains unestablished. This directly affects the load-bearing assertion that one-step limitation is the primary reason for underperformance on entangled features.
minor comments (2)
  1. [Abstract] The abstract and introduction repeatedly state that FMA is 'the first' multi-step approach; a brief related-work discussion of prior flow-matching or iterative alignment techniques in cross-modal settings would clarify novelty.
  2. [§3] Notation for the velocity field and coupling is introduced without an explicit equation reference in the method overview; adding a numbered equation for the flow-matching objective would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We appreciate the emphasis on rigorously attributing performance gains to the multi-step rectification property of the learned velocity field. We agree that targeted ablations are necessary to strengthen this attribution and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The central claim that multi-step rectification via the velocity field provides more precise alignment than one-step PEFT rests on the introduction of fixed coupling, noise augmentation, and early-stopping solver, yet no ablation isolates the iterative flow-matching component. A single-step application of the same velocity field (or multi-step application of a baseline PEFT method) is required to establish that the gains arise from the multi-step property rather than the new auxiliary strategies.

    Authors: We agree that the current manuscript does not include an explicit ablation isolating the multi-step aspect of the velocity field from the auxiliary components. In the revised manuscript, we will add a single-step application of the learned velocity field (by setting the number of integration steps to 1) and compare it directly against the multi-step version. We will also report results from applying a baseline one-step PEFT method in a multi-step manner where feasible. These additions will clarify that the iterative rectification enabled by flow matching is the primary driver of improved alignment on entangled features, while the fixed coupling, noise augmentation, and early-stopping solver serve as enabling techniques for the few-shot cross-modal setting. revision: yes

  2. Referee: [§4] §4 (Experiments): Results claim consistent superiority on challenging datasets, but without controls that apply the velocity field in one step or ablate the coupling/noise/early-stopping elements individually, attribution of improvements to multi-step rectification remains unestablished. This directly affects the load-bearing assertion that one-step limitation is the primary reason for underperformance on entangled features.

    Authors: We acknowledge that the experimental section would benefit from additional controls to isolate the contribution of multi-step rectification. In revision, we will include one-step velocity field evaluations across the reported benchmarks and backbones, along with individual ablations of the fixed coupling strategy, noise augmentation, and early-stopping solver. These results will be presented in a new table or figure to demonstrate that the performance gains, particularly on challenging datasets with highly entangled features, are attributable to the multi-step property rather than the auxiliary design choices alone. We believe this will directly support the central claim regarding the limitations of existing one-step PEFT methods. revision: yes

Circularity Check

0 steps flagged

No significant circularity in FMA derivation chain

full rationale

The paper introduces an original Flow Matching Alignment (FMA) method built around a learned cross-modal velocity field, fixed coupling, noise augmentation, and early-stopping solver. These elements are presented as novel contributions rather than reductions of prior fitted parameters or self-citations. The central claim of multi-step rectification superiority is framed as an empirical extension of one-step PEFT limitations, without any load-bearing step that equates the output to the input by construction or via author-overlapping citations. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are detailed in the abstract; the velocity field and strategies are presented as novel algorithmic choices rather than derived from prior constants or postulates.

pith-pipeline@v0.9.0 · 5770 in / 932 out tokens · 24421 ms · 2026-05-18T06:13:04.404359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 13 internal anchors

  1. [1]

    Building Normalizing Flows with Stochastic Interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571,

  2. [2]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,

  3. [3]

    Plot: Prompt learning with optimal transport for vision-language models,

    Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint arXiv:2210.01253,

  4. [4]

    Diver- sified in-domain synthesis with efficient fine-tuning for few-shot classification.arXiv preprint arXiv:2312.03046,

    Victor G Turrisi da Costa, Nicola Dall’Asen, Yiming Wang, Nicu Sebe, and Elisa Ricci. Diver- sified in-domain synthesis with efficient fine-tuning for few-shot classification.arXiv preprint arXiv:2312.03046,

  5. [5]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,

  6. [6]

    Flowtok: Flowing seamlessly across text and image tokens.arXiv preprint arXiv:2503.10772,

    Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flowtok: Flowing seamlessly across text and image tokens.arXiv preprint arXiv:2503.10772,

  7. [7]

    arXiv preprint arXiv:2210.07574 , year=

    Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition?arXiv preprint arXiv:2210.07574,

  8. [8]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

  9. [9]

    Pyramidal flow matching for efficient video generative modeling

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954,

  10. [10]

    Flowalign: Trajectory-regularized, inversion-free flow-based image editing.arXiv preprint arXiv:2505.23145,

    Jeongsol Kim, Yeobin Hong, and Jong Chul Ye. Flowalign: Trajectory-regularized, inversion-free flow-based image editing.arXiv preprint arXiv:2505.23145,

  11. [11]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,

  12. [12]

    FlowEdit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629, 2024

    Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629,

  13. [13]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190,

  14. [14]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  15. [15]

    I ^2 sb: Image-to-image schr\" o dinger bridge

    Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A Theodorou, Weili Nie, and Anima Anandkumar. Image-to-image schrödinger bridge.arXiv preprint arXiv:2302.05872, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023b. Qihao Liu, Xi Yin, Alan Yui...

  16. [16]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,

  17. [17]

    Decoupled Weight Decay Regularization

    11 Preprint Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  18. [18]

    Fine-Grained Visual Classification of Aircraft

    Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

  19. [19]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

  20. [20]

    Semantic image inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792, 2024

    Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic image inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792,

  21. [21]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32,

  22. [22]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv...