Generative Modeling of Class Probability for Multi-Modal Representation Learning

Bumsoo Kim; Eunwoo Kim; Jungkyoo Shin

arxiv: 2503.17417 · v2 · submitted 2025-03-21 · 💻 cs.LG · cs.AI

Generative Modeling of Class Probability for Multi-Modal Representation Learning

Jungkyoo Shin , Bumsoo Kim , Eunwoo Kim This is my paper

Pith reviewed 2026-05-22 22:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-modal representation learningclass probability distributionsgenerative modelingvariational autoencodercross-modal alignmentclass anchorsout-of-domain generalization

0 comments

The pith

Encoding class anchors as prompts generates aligned class probability distributions that improve multi-modal alignment over contrastive methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generative approach to multi-modal representation learning that addresses modality discrepancies by producing class probability distributions for each input type. Class anchors are encoded as prompts to create these distributions, which are then aligned across modalities. A cross-modal probabilistic variational autoencoder is added to capture uncertainty during alignment. If the method works as described, it would yield stronger generalization, especially when test data comes from domains different from training data. Experiments on four benchmarks are presented to support the gains.

Core claim

The central claim is that conventional contrastive learning struggles with modality discrepancies, but encoding class anchors as prompts to generate and align class probability distributions for each modality produces more effective cross-modal representations; adding a cross-modal probabilistic variational autoencoder further models uncertainty to capture deeper modality relationships and data variations.

What carries the argument

Class-anchor-ALigned generative Modeling (CALM), which turns class anchors into prompts that generate per-modality class probability distributions, together with a cross-modal probabilistic variational autoencoder that models alignment uncertainty.

If this is right

Superior performance over state-of-the-art methods on four benchmark datasets.
Particularly strong gains in out-of-domain evaluations.
Improved capture of deeper relationships between modalities and data variations.
Better handling of modality discrepancies through probability distribution alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The probability-based alignment might prove useful in settings where one modality contains high noise or missing data.
Anchor choice could be studied further to see whether different selection strategies change the quality of generated distributions.
The uncertainty modeling component may transfer to other multi-modal tasks that require calibrated confidence estimates.
The generative framing could be extended to additional modalities such as video or sensor streams.

Load-bearing premise

Class probability distributions generated from class anchors will produce superior cross-modal alignment compared to contrastive baselines without introducing new misalignment modes.

What would settle it

An experiment in which the proposed method shows no statistically significant improvement over standard contrastive baselines on out-of-domain multi-modal benchmarks.

Figures

Figures reproduced from arXiv: 2503.17417 by Bumsoo Kim, Eunwoo Kim, Jungkyoo Shin.

**Figure 2.** Figure 2: An overview of our framework. We employ class labels from an independent dataset, transform them into prompts, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative video retrieval results on the MSR-VTT dataset. Selected anchors capture distinct semantic cues, either aligning [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Multi-modal understanding plays a crucial role in artificial intelligence by enabling models to jointly interpret inputs from different modalities. However, conventional approaches such as contrastive learning often struggle with modality discrepancies, leading to potential misalignments. In this paper, we propose a novel class anchor alignment approach that leverages class probability distributions for multi-modal representation learning. Our method, Class-anchor-ALigned generative Modeling (CALM), encodes class anchors as prompts to generate and align class probability distributions for each modality, enabling more effective alignment. Furthermore, we introduce a cross-modal probabilistic variational autoencoder to model uncertainty in the alignment, enhancing the ability to capture deeper relationships between modalities and data variations. Extensive experiments on four benchmark datasets demonstrate that our approach significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations. This highlights its superior generalization capabilities in multi-modal representation learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CALM frames multi-modal alignment around generating and matching class probability distributions from anchor prompts plus a cross-modal probabilistic VAE, but the provided abstract supplies no numbers, ablations, or implementation details to check whether it actually improves on contrastive baselines.

read the letter

The paper's main move is to replace direct contrastive embedding alignment with per-modality class probability distributions that are generated from class-anchor prompts and then aligned, plus a cross-modal probabilistic VAE to capture uncertainty. That combination is presented as the novel piece on top of existing contrastive and generative ideas. It does address a real practical issue—modality gaps that standard contrastive losses can leave unhandled—and the OOD emphasis in the abstract is a reasonable place to look for gains if the distributions turn out better calibrated. The abstract claims clear wins on four benchmarks with stronger out-of-domain results, which would matter if the numbers hold. The soft spots are exactly where the stress-test note flags them: nothing is said about how the anchors are selected or encoded, whether the generative step avoids prompt artifacts or label leakage, or why the VAE modeling prevents rather than adds misalignment modes such as overconfident wrong alignments. Without any quantitative results, baselines, error bars, or ablation evidence in the text we have, it is impossible to tell whether the central claim is supported or whether the method reduces to fitting on the same data used for evaluation. This is the kind of incremental extension that could be useful to the multi-modal representation community if the experiments are solid and reproducible, but the current write-up leaves the key assumptions untested in the summary. I would bring it to a reading group to see the full experimental section and check the equations for circularity. It is worth sending to peer review so referees can examine the actual numbers and design choices rather than desk-rejecting on the abstract alone.

Referee Report

3 major / 2 minor

Summary. The paper proposes Class-anchor-ALigned generative Modeling (CALM) for multi-modal representation learning. Class anchors are encoded as prompts to generate per-modality class probability distributions that are aligned across modalities; a cross-modal probabilistic variational autoencoder is added to model uncertainty. The abstract asserts that the method significantly outperforms state-of-the-art approaches on four benchmark datasets, with particular gains in out-of-domain settings.

Significance. If the empirical superiority and the absence of new misalignment modes can be demonstrated with rigorous controls, the generative class-probability approach could offer an alternative to standard contrastive objectives for handling modality gaps. The probabilistic VAE component might additionally provide calibrated uncertainty estimates useful for downstream tasks. At present the significance cannot be assessed because the abstract supplies no quantitative results, baselines, or ablations.

major comments (3)

[Abstract] Abstract: the central claim that CALM 'significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations' is unsupported by any numbers, tables, baselines, error bars, or ablation evidence. Without these data the claim that the generative alignment is superior cannot be evaluated.
[Method] Method description (class-anchor encoding): the paper states that class anchors are 'encoded as prompts to generate and align class probability distributions' but provides no specification of anchor selection, prompt construction, or proof that the resulting distributions reflect true cross-modal class correspondence rather than prompt-induced artifacts. This choice is load-bearing for the superiority claim over contrastive baselines.
[Method] Method description (cross-modal probabilistic VAE): no derivation or argument is given showing that the VAE latent modeling prevents rather than amplifies misalignment modes (e.g., mode collapse or overconfident incorrect alignments) in probability space. The abstract's generalization claim rests on this unexamined assumption.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., accuracy delta on the strongest OOD benchmark) to allow readers to gauge the magnitude of improvement.
[Method] Notation for the generated class probability distributions and the VAE latent variables should be introduced explicitly with consistent symbols.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below. We agree that the abstract requires quantitative support and that the method descriptions need expanded specifications and justifications. Revisions will be made to incorporate these elements.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that CALM 'significantly outperforms state-of-the-art methods, especially in out-of-domain evaluations' is unsupported by any numbers, tables, baselines, error bars, or ablation evidence. Without these data the claim that the generative alignment is superior cannot be evaluated.

Authors: We agree that the abstract lacks quantitative backing for its claims. In the revised version, we will incorporate specific performance metrics from the experiments on the four benchmark datasets, including accuracy improvements over baselines in out-of-domain settings, references to error bars, and key ablation results. This will enable direct evaluation of the superiority claim. revision: yes
Referee: [Method] Method description (class-anchor encoding): the paper states that class anchors are 'encoded as prompts to generate and align class probability distributions' but provides no specification of anchor selection, prompt construction, or proof that the resulting distributions reflect true cross-modal class correspondence rather than prompt-induced artifacts. This choice is load-bearing for the superiority claim over contrastive baselines.

Authors: The manuscript provides a high-level description but lacks the requested level of detail on anchor selection and prompt construction. We will revise Section 3 to explicitly specify the anchor selection process (using dataset class labels), the prompt templates employed, and include an ablation study comparing different prompt formulations to demonstrate that alignments reflect semantic correspondence rather than artifacts. revision: yes
Referee: [Method] Method description (cross-modal probabilistic VAE): no derivation or argument is given showing that the VAE latent modeling prevents rather than amplifies misalignment modes (e.g., mode collapse or overconfident incorrect alignments) in probability space. The abstract's generalization claim rests on this unexamined assumption.

Authors: We acknowledge the absence of a formal argument or derivation addressing potential misalignment amplification in the VAE component. We will add a dedicated discussion in the method section (with supporting ablations in the appendix) explaining how the cross-modal probabilistic modeling and latent regularization mitigate risks such as mode collapse, including empirical evidence from controlled experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: method and claims rest on experimental validation, not self-referential derivation

full rationale

The provided abstract and description introduce CALM as a new generative alignment method using class-anchor prompts and a cross-modal VAE, with performance claims supported by experiments on four benchmarks. No equations, derivation steps, fitted parameters presented as predictions, or self-citations are visible that would reduce any central result to its own inputs by construction. The derivation chain is not self-contained in a mathematical sense that triggers the enumerated patterns; claims depend on external empirical outcomes rather than internal redefinition or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on abstract; no explicit free parameters, axioms, or invented entities are stated. The method implicitly relies on the existence of meaningful class anchors and the validity of probabilistic modeling for alignment, but these are not enumerated.

pith-pipeline@v0.9.0 · 5672 in / 1052 out tokens · 37999 ms · 2026-05-22T22:36:54.040677+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac et al. Flamingo: a visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022. 3

work page 2022
[2]

METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, June 2005. 5

work page 2005
[3]

One transformer fits all distributions in multi- modal diffusion at scale

Fan Bao et al. One transformer fits all distributions in multi- modal diffusion at scale. In Proceedings of the 40th Inter- national Conference on Machine Learning, ICML’23, 2023. 3

work page 2023
[4]

Collecting highly parallel data for paraphrase evaluation

David Chen and William Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies , pages 190–200, June 2011. 2, 5

work page 2011
[5]

Uatvr: Uncertainty-adaptive text-video re- trieval

Bo Fang et al. Uatvr: Uncertainty-adaptive text-video re- trieval. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 13723–13733, 2023. 2, 3, 5, 6

work page 2023
[6]

Learning semantic relationship among instances for image- text matching

Zheren Fu, Zhendong Mao, Yan Song, and Yongdong Zhang. Learning semantic relationship among instances for image- text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15159– 15168, 2023. 1

work page 2023
[7]

Multi-modal transformer for video retrieval

Vincent Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In European Conference on Computer Vision, pages 214–229. Springer, 2020. 1

work page 2020
[8]

Generative adversarial networks

Ian Goodfellow et al. Generative adversarial networks. Com- mun. ACM, 63(11):139–144, Oct. 2020. 3

work page 2020
[9]

Mismatch quest: Visual and textual feed- back for image-text misalignment

Brian Gordon et al. Mismatch quest: Visual and textual feed- back for image-text misalignment. In18th European Confer- ence on Computer Vision, page 310–328, 2024. 1

work page 2024
[10]

X-pool: Cross-modal language- video attention for text-video retrieval

Satya Krishna Gorti et al. X-pool: Cross-modal language- video attention for text-video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5006–5015, 2022. 2, 3

work page 2022
[11]

Text with Knowledge Graph Augmented Transformer for Video Captioning

Xin Gu, Guang Chen, Yufei Wang, Libo Zhang, Tiejian Luo, and Longyin Wen. Text with Knowledge Graph Augmented Transformer for Video Captioning . In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18941–18951, 2023. 5, 7

work page 2023
[12]

Hollywood in homes: Crowd- sourcing data collection for activity understanding

Sigurdsson Gunnar A et al. Hollywood in homes: Crowd- sourcing data collection for activity understanding. In Euro- pean Conference on Computer Vision, 2016. 5

work page 2016
[13]

Localizing moments in video with temporal language

Lisa Anne Hendricks et al. Localizing moments in video with temporal language. In Proceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing, pages 1380–1390, Oct.-Nov. 2018. 2, 5

work page 2018
[14]

VideoCLIP: Contrastive pre-training for zero- shot video-text understanding

Xu Hu et al. VideoCLIP: Contrastive pre-training for zero- shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 6787–6800, Nov. 2021. 2, 3

work page 2021
[15]

Diffdis: Empowering generative dif- fusion model with cross-modal discrimination capability

Runhui Huang et al. Diffdis: Empowering generative dif- fusion model with cross-modal discrimination capability. In 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 15667–15677, 2023. 3

work page 2023
[16]

Machine vision therapy: Multi- modal large language models can enhance visual robustness via denoising in-context learning

Zhuo Huang, Chang Liu, Yinpeng Dong, Hang Su, Shibao Zheng, and Tongliang Liu. Machine vision therapy: Multi- modal large language models can enhance visual robustness via denoising in-context learning. InForty-first International Conference on Machine Learning, 2024. 3

work page 2024
[17]

Understanding and constructing latent modality structures in multi-modal representation learning

Qian Jiang et al. Understanding and constructing latent modality structures in multi-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7661–7671, 2023. 2

work page 2023
[18]

Expectation-maximization contrastive learn- ing for compact video-and-language representations

Peng Jin et al. Expectation-maximization contrastive learn- ing for compact video-and-language representations. In Pro- ceedings of the 36th International Conference on Neural In- formation Processing Systems, 2022. 2, 3, 5, 6

work page 2022
[19]

Diffusionret: Generative text-video retrieval with diffusion model

Peng Jin et al. Diffusionret: Generative text-video retrieval with diffusion model. In 2023 IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 2470–2481, Oct

work page 2023
[20]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In 2nd International Conference on Learning Representations, 2014. 3, 4

work page 2014
[21]

Image-text embedding learning via visual and textual semantic reasoning

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):641–656, 2023. 1

work page 2023
[22]

A universal model for cross modality mapping by relational reasoning

Zun Li et al. A universal model for cross modality mapping by relational reasoning. CoRR, abs/2102.13360, 2021. 2

work page arXiv 2021
[23]

Gmmseg: Gaussian mixture based generative semantic seg- mentation models

Chen Liang, Wenguan Wang, Jiaxu Miao, and Yi Yang. Gmmseg: Gaussian mixture based generative semantic seg- mentation models. Advances in Neural Information Process- ing Systems, 35:31360–31375, 2022. 2

work page 2022
[24]

Rouge: A package for automatic evalua- tion of summaries

Chin-Yew Lin. Rouge: A package for automatic evalua- tion of summaries. In Annual Meeting of the Association for Computational Linguistics, 2004. 5

work page 2004
[25]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 7

work page 2014
[26]

Pre-train, prompt, and predict: A system- atic survey of prompting methods in natural language pro- cessing

Pengfei Liu et al. Pre-train, prompt, and predict: A system- atic survey of prompting methods in natural language pro- cessing. ACM Comput. Surv., 55(9), Jan. 2023. 3

work page 2023
[27]

Ts2-net: Token shift and selection transformer for text-video retrieval

Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. Ts2-net: Token shift and selection transformer for text-video retrieval. In Proceedings of the European Confer- ence on Computer Vision (ECCV), 2022. 2, 3

work page 2022
[28]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2017. 5

work page 2017
[29]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomput- ing, 508:293–304, 2022

Huaishao Luo et al. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomput- ing, 508:293–304, 2022. 1, 2, 3, 5, 6

work page 2022
[30]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318,

work page
[31]

Improving video captioning with temporal composition of a visual-syntactic embedding

Jesus Perez-Martin, Benjamin Bustos, and Jorge P ´erez. Improving video captioning with temporal composition of a visual-syntactic embedding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3039–3049, 2021. 5, 7

work page 2021
[32]

Learning transferable visual models from natural language supervision

Alec Radford et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th In- ternational Conference on Machine Learning , pages 8748– 8763, 18–24 Jul 2021. 1, 2, 5

work page 2021
[33]

Movie description

Anna Rohrbach et al. Movie description. International Jour- nal of Computer Vision, 123:94–120, 2017. 2, 5

work page 2017
[34]

Accurate and Fast Compressed Video Captioning

Yaojie Shen et al. Accurate and Fast Compressed Video Captioning . In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15512–15521, Oct. 2023. 5, 7

work page 2023
[35]

Clip4caption: Clip for video caption

Mingkang Tang et al. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4858–4862, 2021. 1, 5, 7

work page 2021
[36]

Cross-modal variational align- ment of latent spaces

Thomas Theodoridis et al. Cross-modal variational align- ment of latent spaces. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4127–4136, 2020. 3

work page 2020
[37]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description eval- uation . In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015. 5

work page 2015
[38]

Omnivl: one foundation model for image- language and video-language tasks

Junke Wang et al. Omnivl: one foundation model for image- language and video-language tasks. In Proceedings of the 36th International Conference on Neural Information Pro- cessing Systems, 2022. 3

work page 2022
[39]

Text is mass: Modeling as stochas- tic embedding for text-video retrieval

Jiamian Wang et al. Text is mass: Modeling as stochas- tic embedding for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16551–16560, 2024. 2, 3, 5, 6

work page 2024
[40]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, 2020. 2

work page 2020
[41]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016. 2, 5

work page 2016
[42]

CLIP-vip: Adapting pre- trained image-text model to video-language alignment

Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. CLIP-vip: Adapting pre- trained image-text model to video-language alignment. In The Eleventh International Conference on Learning Repre- sentations, 2023. 3

work page 2023
[43]

Advances in variational inference

Cheng Zhang, Judith B ¨utepage, Hedvig Kjellstr ¨om, and Stephan Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2008–2026, 2017. 5

work page 2008

[1] [1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac et al. Flamingo: a visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022. 3

work page 2022

[2] [2]

METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments

Satanjeev Banerjee and Alon Lavie. METEOR: An auto- matic metric for MT evaluation with improved correlation with human judgments. InProceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, June 2005. 5

work page 2005

[3] [3]

One transformer fits all distributions in multi- modal diffusion at scale

Fan Bao et al. One transformer fits all distributions in multi- modal diffusion at scale. In Proceedings of the 40th Inter- national Conference on Machine Learning, ICML’23, 2023. 3

work page 2023

[4] [4]

Collecting highly parallel data for paraphrase evaluation

David Chen and William Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies , pages 190–200, June 2011. 2, 5

work page 2011

[5] [5]

Uatvr: Uncertainty-adaptive text-video re- trieval

Bo Fang et al. Uatvr: Uncertainty-adaptive text-video re- trieval. In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 13723–13733, 2023. 2, 3, 5, 6

work page 2023

[6] [6]

Learning semantic relationship among instances for image- text matching

Zheren Fu, Zhendong Mao, Yan Song, and Yongdong Zhang. Learning semantic relationship among instances for image- text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15159– 15168, 2023. 1

work page 2023

[7] [7]

Multi-modal transformer for video retrieval

Vincent Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal transformer for video retrieval. In European Conference on Computer Vision, pages 214–229. Springer, 2020. 1

work page 2020

[8] [8]

Generative adversarial networks

Ian Goodfellow et al. Generative adversarial networks. Com- mun. ACM, 63(11):139–144, Oct. 2020. 3

work page 2020

[9] [9]

Mismatch quest: Visual and textual feed- back for image-text misalignment

Brian Gordon et al. Mismatch quest: Visual and textual feed- back for image-text misalignment. In18th European Confer- ence on Computer Vision, page 310–328, 2024. 1

work page 2024

[10] [10]

X-pool: Cross-modal language- video attention for text-video retrieval

Satya Krishna Gorti et al. X-pool: Cross-modal language- video attention for text-video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5006–5015, 2022. 2, 3

work page 2022

[11] [11]

Text with Knowledge Graph Augmented Transformer for Video Captioning

Xin Gu, Guang Chen, Yufei Wang, Libo Zhang, Tiejian Luo, and Longyin Wen. Text with Knowledge Graph Augmented Transformer for Video Captioning . In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18941–18951, 2023. 5, 7

work page 2023

[12] [12]

Hollywood in homes: Crowd- sourcing data collection for activity understanding

Sigurdsson Gunnar A et al. Hollywood in homes: Crowd- sourcing data collection for activity understanding. In Euro- pean Conference on Computer Vision, 2016. 5

work page 2016

[13] [13]

Localizing moments in video with temporal language

Lisa Anne Hendricks et al. Localizing moments in video with temporal language. In Proceedings of the 2018 Confer- ence on Empirical Methods in Natural Language Processing, pages 1380–1390, Oct.-Nov. 2018. 2, 5

work page 2018

[14] [14]

VideoCLIP: Contrastive pre-training for zero- shot video-text understanding

Xu Hu et al. VideoCLIP: Contrastive pre-training for zero- shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Pro- cessing, pages 6787–6800, Nov. 2021. 2, 3

work page 2021

[15] [15]

Diffdis: Empowering generative dif- fusion model with cross-modal discrimination capability

Runhui Huang et al. Diffdis: Empowering generative dif- fusion model with cross-modal discrimination capability. In 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 15667–15677, 2023. 3

work page 2023

[16] [16]

Machine vision therapy: Multi- modal large language models can enhance visual robustness via denoising in-context learning

Zhuo Huang, Chang Liu, Yinpeng Dong, Hang Su, Shibao Zheng, and Tongliang Liu. Machine vision therapy: Multi- modal large language models can enhance visual robustness via denoising in-context learning. InForty-first International Conference on Machine Learning, 2024. 3

work page 2024

[17] [17]

Understanding and constructing latent modality structures in multi-modal representation learning

Qian Jiang et al. Understanding and constructing latent modality structures in multi-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7661–7671, 2023. 2

work page 2023

[18] [18]

Expectation-maximization contrastive learn- ing for compact video-and-language representations

Peng Jin et al. Expectation-maximization contrastive learn- ing for compact video-and-language representations. In Pro- ceedings of the 36th International Conference on Neural In- formation Processing Systems, 2022. 2, 3, 5, 6

work page 2022

[19] [19]

Diffusionret: Generative text-video retrieval with diffusion model

Peng Jin et al. Diffusionret: Generative text-video retrieval with diffusion model. In 2023 IEEE/CVF International Con- ference on Computer Vision (ICCV), pages 2470–2481, Oct

work page 2023

[20] [20]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-encoding vari- ational bayes. In 2nd International Conference on Learning Representations, 2014. 3, 4

work page 2014

[21] [21]

Image-text embedding learning via visual and textual semantic reasoning

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. Image-text embedding learning via visual and textual semantic reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):641–656, 2023. 1

work page 2023

[22] [22]

A universal model for cross modality mapping by relational reasoning

Zun Li et al. A universal model for cross modality mapping by relational reasoning. CoRR, abs/2102.13360, 2021. 2

work page arXiv 2021

[23] [23]

Gmmseg: Gaussian mixture based generative semantic seg- mentation models

Chen Liang, Wenguan Wang, Jiaxu Miao, and Yi Yang. Gmmseg: Gaussian mixture based generative semantic seg- mentation models. Advances in Neural Information Process- ing Systems, 35:31360–31375, 2022. 2

work page 2022

[24] [24]

Rouge: A package for automatic evalua- tion of summaries

Chin-Yew Lin. Rouge: A package for automatic evalua- tion of summaries. In Annual Meeting of the Association for Computational Linguistics, 2004. 5

work page 2004

[25] [25]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 7

work page 2014

[26] [26]

Pre-train, prompt, and predict: A system- atic survey of prompting methods in natural language pro- cessing

Pengfei Liu et al. Pre-train, prompt, and predict: A system- atic survey of prompting methods in natural language pro- cessing. ACM Comput. Surv., 55(9), Jan. 2023. 3

work page 2023

[27] [27]

Ts2-net: Token shift and selection transformer for text-video retrieval

Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. Ts2-net: Token shift and selection transformer for text-video retrieval. In Proceedings of the European Confer- ence on Computer Vision (ECCV), 2022. 2, 3

work page 2022

[28] [28]

Decoupled weight de- cay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight de- cay regularization. In International Conference on Learning Representations, 2017. 5

work page 2017

[29] [29]

Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomput- ing, 508:293–304, 2022

Huaishao Luo et al. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning.Neurocomput- ing, 508:293–304, 2022. 1, 2, 3, 5, 6

work page 2022

[30] [30]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318,

work page

[31] [31]

Improving video captioning with temporal composition of a visual-syntactic embedding

Jesus Perez-Martin, Benjamin Bustos, and Jorge P ´erez. Improving video captioning with temporal composition of a visual-syntactic embedding. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3039–3049, 2021. 5, 7

work page 2021

[32] [32]

Learning transferable visual models from natural language supervision

Alec Radford et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th In- ternational Conference on Machine Learning , pages 8748– 8763, 18–24 Jul 2021. 1, 2, 5

work page 2021

[33] [33]

Movie description

Anna Rohrbach et al. Movie description. International Jour- nal of Computer Vision, 123:94–120, 2017. 2, 5

work page 2017

[34] [34]

Accurate and Fast Compressed Video Captioning

Yaojie Shen et al. Accurate and Fast Compressed Video Captioning . In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15512–15521, Oct. 2023. 5, 7

work page 2023

[35] [35]

Clip4caption: Clip for video caption

Mingkang Tang et al. Clip4caption: Clip for video caption. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4858–4862, 2021. 1, 5, 7

work page 2021

[36] [36]

Cross-modal variational align- ment of latent spaces

Thomas Theodoridis et al. Cross-modal variational align- ment of latent spaces. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4127–4136, 2020. 3

work page 2020

[37] [37]

Lawrence Zitnick, and Devi Parikh

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description eval- uation . In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575, 2015. 5

work page 2015

[38] [38]

Omnivl: one foundation model for image- language and video-language tasks

Junke Wang et al. Omnivl: one foundation model for image- language and video-language tasks. In Proceedings of the 36th International Conference on Neural Information Pro- cessing Systems, 2022. 3

work page 2022

[39] [39]

Text is mass: Modeling as stochas- tic embedding for text-video retrieval

Jiamian Wang et al. Text is mass: Modeling as stochas- tic embedding for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16551–16560, 2024. 2, 3, 5, 6

work page 2024

[40] [40]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Conference on Machine Learning, 2020. 2

work page 2020

[41] [41]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016. 2, 5

work page 2016

[42] [42]

CLIP-vip: Adapting pre- trained image-text model to video-language alignment

Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. CLIP-vip: Adapting pre- trained image-text model to video-language alignment. In The Eleventh International Conference on Learning Repre- sentations, 2023. 3

work page 2023

[43] [43]

Advances in variational inference

Cheng Zhang, Judith B ¨utepage, Hedvig Kjellstr ¨om, and Stephan Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41:2008–2026, 2017. 5

work page 2008