Exploring Cross-Modal Flows for Few-Shot Learning
Pith reviewed 2026-05-18 06:13 UTC · model grok-4.3
The pith
Multi-step flow matching aligns image and text features more precisely than one-step fine-tuning in few-shot cross-modal tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing PEFT methods always perform one-step adjustment, which is insufficient for complex datasets where features of different modalities are highly entangled. We propose Flow Matching Alignment (FMA) as the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field. With a fixed coupling strategy for category correspondence, noise augmentation to address data scarcity, and an early-stopping solver, FMA achieves more precise and robust alignment and yields significant performance gains across benchmarks and backbones.
What carries the argument
The cross-modal velocity field learned through flow matching, which models gradual multi-step transformations between image and text features.
Load-bearing premise
That the primary reason existing PEFT methods underperform on complex datasets is their restriction to a single adjustment step.
What would settle it
A controlled test where a standard one-step PEFT method is extended with repeated or iterative updates and then compared directly against FMA on the same entangled cross-modal few-shot datasets.
Figures
read the original abstract
Aligning features from different modalities, is one of the most fundamental challenges for cross-modal tasks. Although pre-trained vision-language models can achieve a general alignment between image and text, they often require parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively fine-tune a subset of parameters, which can slightly adjust either visual or textual features, and avoid overfitting. In this paper, we are the first to highlight that all existing PEFT methods perform one-step adjustment. It is insufficient for complex (or difficult) datasets, where features of different modalities are highly entangled. To this end, we propose the first model-agnostic multi-step adjustment approach by learning a cross-modal velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the correspondence between categories during training, we first utilize a fixed coupling strategy. Then, we propose a noise augmentation strategy to alleviate the data scarcity issue. Finally, we design an early-stopping solver, which terminates the transformation process earlier, improving both efficiency and accuracy. Compared with one-step PEFT methods, FMA has the multi-step rectification ability to achieve more precise and robust alignment. Extensive results have demonstrated that FMA can consistently yield significant performance gains across various benchmarks and backbones, particularly on challenging datasets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing parameter-efficient fine-tuning (PEFT) methods for cross-modal alignment in vision-language models are limited to one-step adjustments, which is insufficient for complex datasets with highly entangled features. It proposes Flow Matching Alignment (FMA), the first model-agnostic multi-step approach that learns a cross-modal velocity field via flow matching. The method incorporates a fixed coupling strategy to preserve category correspondence, a noise augmentation strategy to address data scarcity in few-shot settings, and an early-stopping solver for efficiency and accuracy. The authors report that FMA yields consistent and significant performance gains over one-step PEFT baselines across benchmarks and backbones, especially on challenging datasets.
Significance. If the performance gains can be rigorously attributed to the multi-step rectification property of the learned velocity field (rather than the auxiliary coupling, augmentation, or solver choices), the work would introduce a new paradigm for iterative cross-modal alignment in few-shot learning. This could improve robustness on entangled feature distributions and extend to other vision-language tasks, provided the method remains model-agnostic and computationally practical.
major comments (2)
- [§3] §3 (Method): The central claim that multi-step rectification via the velocity field provides more precise alignment than one-step PEFT rests on the introduction of fixed coupling, noise augmentation, and early-stopping solver, yet no ablation isolates the iterative flow-matching component. A single-step application of the same velocity field (or multi-step application of a baseline PEFT method) is required to establish that the gains arise from the multi-step property rather than the new auxiliary strategies.
- [§4] §4 (Experiments): Results claim consistent superiority on challenging datasets, but without controls that apply the velocity field in one step or ablate the coupling/noise/early-stopping elements individually, attribution of improvements to multi-step rectification remains unestablished. This directly affects the load-bearing assertion that one-step limitation is the primary reason for underperformance on entangled features.
minor comments (2)
- [Abstract] The abstract and introduction repeatedly state that FMA is 'the first' multi-step approach; a brief related-work discussion of prior flow-matching or iterative alignment techniques in cross-modal settings would clarify novelty.
- [§3] Notation for the velocity field and coupling is introduced without an explicit equation reference in the method overview; adding a numbered equation for the flow-matching objective would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We appreciate the emphasis on rigorously attributing performance gains to the multi-step rectification property of the learned velocity field. We agree that targeted ablations are necessary to strengthen this attribution and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Method): The central claim that multi-step rectification via the velocity field provides more precise alignment than one-step PEFT rests on the introduction of fixed coupling, noise augmentation, and early-stopping solver, yet no ablation isolates the iterative flow-matching component. A single-step application of the same velocity field (or multi-step application of a baseline PEFT method) is required to establish that the gains arise from the multi-step property rather than the new auxiliary strategies.
Authors: We agree that the current manuscript does not include an explicit ablation isolating the multi-step aspect of the velocity field from the auxiliary components. In the revised manuscript, we will add a single-step application of the learned velocity field (by setting the number of integration steps to 1) and compare it directly against the multi-step version. We will also report results from applying a baseline one-step PEFT method in a multi-step manner where feasible. These additions will clarify that the iterative rectification enabled by flow matching is the primary driver of improved alignment on entangled features, while the fixed coupling, noise augmentation, and early-stopping solver serve as enabling techniques for the few-shot cross-modal setting. revision: yes
-
Referee: [§4] §4 (Experiments): Results claim consistent superiority on challenging datasets, but without controls that apply the velocity field in one step or ablate the coupling/noise/early-stopping elements individually, attribution of improvements to multi-step rectification remains unestablished. This directly affects the load-bearing assertion that one-step limitation is the primary reason for underperformance on entangled features.
Authors: We acknowledge that the experimental section would benefit from additional controls to isolate the contribution of multi-step rectification. In revision, we will include one-step velocity field evaluations across the reported benchmarks and backbones, along with individual ablations of the fixed coupling strategy, noise augmentation, and early-stopping solver. These results will be presented in a new table or figure to demonstrate that the performance gains, particularly on challenging datasets with highly entangled features, are attributable to the multi-step property rather than the auxiliary design choices alone. We believe this will directly support the central claim regarding the limitations of existing one-step PEFT methods. revision: yes
Circularity Check
No significant circularity in FMA derivation chain
full rationale
The paper introduces an original Flow Matching Alignment (FMA) method built around a learned cross-modal velocity field, fixed coupling, noise augmentation, and early-stopping solver. These elements are presented as novel contributions rather than reductions of prior fitted parameters or self-citations. The central claim of multi-step rectification superiority is framed as an empirical extension of one-step PEFT limitations, without any load-bearing step that equates the output to the input by construction or via author-overlapping citations. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Building Normalizing Flows with Stochastic Interpolants
Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. arXiv preprint arXiv:2209.15571,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Plot: Prompt learning with optimal transport for vision-language models,
Guangyi Chen, Weiran Yao, Xiangchen Song, Xinyue Li, Yongming Rao, and Kun Zhang. Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint arXiv:2210.01253,
-
[4]
Victor G Turrisi da Costa, Nicola Dall’Asen, Yiming Wang, Nicu Sebe, and Elisa Ricci. Diver- sified in-domain synthesis with efficient fine-tuning for few-shot classification.arXiv preprint arXiv:2312.03046,
-
[5]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Flowtok: Flowing seamlessly across text and image tokens.arXiv preprint arXiv:2503.10772,
Ju He, Qihang Yu, Qihao Liu, and Liang-Chieh Chen. Flowtok: Flowing seamlessly across text and image tokens.arXiv preprint arXiv:2503.10772,
-
[7]
arXiv preprint arXiv:2210.07574 , year=
Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic data from generative models ready for image recognition?arXiv preprint arXiv:2210.07574,
-
[8]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Pyramidal flow matching for efficient video generative modeling
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling. arXiv preprint arXiv:2410.05954,
-
[10]
Jeongsol Kim, Yeobin Hong, and Jong Chul Ye. Flowalign: Trajectory-regularized, inversion-free flow-based image editing.arXiv preprint arXiv:2505.23145,
-
[11]
Auto-Encoding Variational Bayes
Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models.arXiv preprint arXiv:2412.08629,
-
[13]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
I ^2 sb: Image-to-image schr\" o dinger bridge
Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A Theodorou, Weili Nie, and Anima Anandkumar. Image-to-image schrödinger bridge.arXiv preprint arXiv:2302.05872, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023b. Qihao Liu, Xi Yin, Alan Yui...
-
[16]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Decoupled Weight Decay Regularization
11 Preprint Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Fine-Grained Visual Classification of Aircraft
Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Litu Rout, Yujia Chen, Nataniel Ruiz, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. Semantic image inversion and editing using rectified stochastic differential equations.arXiv preprint arXiv:2410.10792,
-
[21]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020a. Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020b. Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild.arXiv preprint arXiv...
work page internal anchor Pith review Pith/arXiv arXiv 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.