pith. sign in

arxiv: 2605.15723 · v1 · pith:NEITGKFTnew · submitted 2026-05-15 · 💻 cs.LG · cs.CV

GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective

Pith reviewed 2026-05-20 19:38 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords multimodal alignmentgraph signal smoothingmultimodal attributed graphsfrozen embeddingsretrievalstructure-drivenpost-alignment
0
0 comments X

The pith

Graph structure from multimodal attributed graphs refines frozen vision-language embeddings for improved retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that relational context in multimodal attributed graphs can refine embeddings from frozen models like CLIP without any retraining of the encoders. It treats those embeddings as graph signals and applies a controlled smoothing process that respects different modality neighborhoods while stopping before representations lose their distinct meaning. A sympathetic reader would care because this turns existing corpus structure into a lightweight post-processing step that raises retrieval accuracy and stability on multiple benchmarks. The approach matters if it holds because current multimodal systems largely ignore entity relations and train only on isolated pairs.

Core claim

GOMA is a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals on MAGs. It learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before semantic boundaries collapse. This unified retrieval-oriented design lets the graph serve as an effective post-encoder in a transductive setting where the graph supplies only unlabeled context and self-pair edges are removed.

What carries the argument

Modality-aware propagation operators combined with finite-step coupled smoothing and adaptive readout of node-specific trajectories.

Load-bearing premise

Finite-step coupled smoothing without diagonal cross-modal shortcuts plus adaptive readout of node-specific trajectories can preserve useful smoothing before semantic boundaries collapse.

What would settle it

On the seven MAG benchmarks, a result where GOMA retrieval accuracy falls below the strongest graph competitor or where its stability advantage disappears would show that the smoothing regime fails to retain informative signals.

Figures

Figures reproduced from arXiv: 2605.15723 by Guoren Wang, Rong-Hua Li, Xunkai Li, Xu Wang, Yinlin Zhu.

Figure 1
Figure 1. Figure 1: Empirical evidence on Toys and Grocery. (a) Graph support: hard queries in the upper-left quadrant receive low pairwise similarity but high structural support. (b) Topology mismatch: category purity and V/T kNN overlap. (c) Finite-depth smoothing: mean retrieval rank improves then degrades at shallow depths; semantic separation peaks concurrently on Toys. Together they motivate GOMA. visual and textual mod… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed GOMA framework. Frozen multimodal embeddings are treated as graph signals, refined over learned modality-aware operators, and read out from finite smoothing trajectories. are jointly optimized, noisy structural routes receive low weights through CDE and topology contrast. For the cross-modal channel, node-level route (i, j) is typed as a visual-target/text-source candidate used by … view at source ↗
Figure 3
Figure 3. Figure 3: Protocol and topology controls. (a) Randomizing the candidate graph sharply reduces R@1, indicating that graph-side gains depend on meaningful neighborhood structure. (b) Self-pair image-text links create direct answer paths; reported GOMA removes them and uses only non-self cross-modal links. [2020b], Brody et al. [2022], Hou et al. [2023], Li et al. [2018]. In recommendation-oriented graph settings, mode… view at source ↗
Figure 4
Figure 4. Figure 4: In-depth analysis. Rank quality, seed stability, and training convergence against the strongest pairwise image-text and graph baselines. reproducibility information. All experiments are conducted with frozen pretrained multimodal features and a lightweight trainable post-alignment module. The backbone remains fixed throughout training. Unless otherwise stated, the reported numbers come from the best valida… view at source ↗
Figure 5
Figure 5. Figure 5: Hyperparameter sensitivity on Grocery. R@10 sweeps over propagation depth, cross￾modal coupling, and restart strength. MeanR is annotated at each point. modality gap = 0.48 Raw modality gap = 0.12 Linear modality gap = 0.08 DGF modality gap = 0.03 GOMA Image Text [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization on Grocery (referred in §5.2). Red = image, blue = text. DGF retains local modality offsets (gap = 0.08), while GOMA nearly overlaps paired neighborhoods (gap = 0.03). C.2 Baselines, Optimization, and Reproducibility Raw. This baseline uses the frozen multimodal features as they are, without any trainable post￾alignment module. Linear. This baseline adds only small modality-specific lin… view at source ↗
read the original abstract

Multimodal alignment is commonly learned from isolated image-text pairs via CLIP-style dual encoders, leaving the relational context among entities largely unused. Multimodal attributed graphs (MAGs), where nodes carry multimodal attributes and edges encode corpus structure, provide a natural setting for refining frozen vision-language embeddings. This refinement is challenging: visual, textual, and cross-modal relations often induce different neighborhood geometries, while unrestricted graph propagation can quickly over-smooth retrieval representations. Effectively leveraging graph context therefore requires simultaneously breaking modality-specific topological barriers, controlling the smoothing regime, and preserving informative smoothing before semantic boundaries collapse. We propose Graph-Optimized Multimodal Alignment (GOMA), a structure-driven post-alignment framework that views frozen multimodal embeddings as graph signals and addresses these requirements through a unified retrieval-oriented design. GOMA decouples three key design choices: where messages should flow, how multimodal evidence should propagate, and which smoothing depth should be retained. Concretely, it learns modality-aware propagation operators, performs finite-step coupled smoothing without diagonal cross-modal shortcuts, and adaptively reads out node-specific smoothing trajectories to preserve useful smoothing before collapse. All experiments follow a transductive MAG retrieval protocol where the graph serves only as unlabeled context and diagonal self-pair edges are removed. On seven MAG benchmarks, GOMA achieves state-of-the-art or tied state-of-the-art retrieval and remains substantially more stable than the strongest graph competitor, demonstrating that MAG structure can serve as an effective post-encoder for frozen multimodal embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Graph-Optimized Multimodal Alignment (GOMA), a post-alignment framework that treats frozen multimodal embeddings as graph signals on multimodal attributed graphs (MAGs). It decouples design choices by learning modality-aware propagation operators, applying finite-step coupled smoothing without diagonal cross-modal shortcuts, and using adaptive readout of node-specific smoothing trajectories to control smoothing depth and avoid semantic collapse. In a transductive retrieval protocol on seven MAG benchmarks (with the graph as unlabeled context and self-pair edges removed), GOMA reports state-of-the-art or tied SOTA retrieval performance and substantially higher stability than the strongest graph-based competitor, arguing that MAG structure serves as an effective post-encoder for frozen vision-language embeddings.

Significance. If the empirical claims hold under rigorous controls, the work would be significant for multimodal representation learning: it demonstrates a structure-driven refinement strategy that leverages relational context without retraining dual encoders, while explicitly addressing over-smoothing risks that arise from differing neighborhood geometries across modalities. The unified retrieval-oriented design and the emphasis on finite-step coupled smoothing without shortcuts provide a principled way to retain useful signal before boundaries collapse. The transductive protocol with unlabeled context is a clean evaluation setting that isolates the post-encoder contribution.

major comments (2)
  1. §5 (Experiments) and the abstract: the central SOTA and stability claims rest on retrieval metrics across seven benchmarks, yet no error bars, standard deviations, or statistical significance tests are reported for the performance gains or stability improvements. This is load-bearing because the abstract asserts 'substantially more stable' and 'state-of-the-art or tied state-of-the-art' without quantifying variability or the exact smoothing depths and data exclusion rules used, making it impossible to verify that the gains are attributable to the proposed operators rather than evaluation artifacts.
  2. §3.2–3.3 (Modality-aware operators and adaptive readout): the finite-step coupled smoothing omits diagonal cross-modal shortcuts and relies on adaptive readout of per-node trajectories to 'preserve useful smoothing before collapse.' However, the manuscript does not provide an ablation or theoretical bound showing that this specific combination reliably prevents semantic boundary collapse on retrieval embeddings when neighborhood geometries differ across modalities; without such evidence the attribution of observed gains to the structure-driven mechanism remains unverified.
minor comments (2)
  1. Notation for the modality-aware propagation operators is introduced without an explicit equation reference in the main text; adding a numbered equation for the operator definition would improve clarity.
  2. The description of the transductive protocol (removal of diagonal self-pair edges) is clear in the abstract but should be restated with a citation to the precise experimental subsection for readers who skip the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. The comments highlight important aspects for strengthening the empirical rigor and mechanistic understanding of GOMA. We address each major comment in detail below, indicating the revisions we plan to incorporate in the updated version.

read point-by-point responses
  1. Referee: §5 (Experiments) and the abstract: the central SOTA and stability claims rest on retrieval metrics across seven benchmarks, yet no error bars, standard deviations, or statistical significance tests are reported for the performance gains or stability improvements. This is load-bearing because the abstract asserts 'substantially more stable' and 'state-of-the-art or tied state-of-the-art' without quantifying variability or the exact smoothing depths and data exclusion rules used, making it impossible to verify that the gains are attributable to the proposed operators rather than evaluation artifacts.

    Authors: We agree that reporting measures of variability and clarifying experimental details are essential for validating the claims. In the revised manuscript, we will augment Section 5 with error bars representing standard deviations computed over 5 independent runs using different random seeds for the initialization of the modality-aware operators and the adaptive readout parameters. We will also perform statistical significance testing (e.g., paired t-tests) between GOMA and the strongest baselines to quantify the reliability of the performance gains. Furthermore, we will explicitly state the exact smoothing depths (number of propagation steps) selected for each benchmark via the adaptive readout, and reiterate the data exclusion rules (removal of diagonal self-pair edges) in the experimental protocol. These additions will allow readers to better assess that the improvements stem from the proposed structure-driven mechanisms rather than artifacts. revision: yes

  2. Referee: §3.2–3.3 (Modality-aware operators and adaptive readout): the finite-step coupled smoothing omits diagonal cross-modal shortcuts and relies on adaptive readout of per-node trajectories to 'preserve useful smoothing before collapse.' However, the manuscript does not provide an ablation or theoretical bound showing that this specific combination reliably prevents semantic boundary collapse on retrieval embeddings when neighborhood geometries differ across modalities; without such evidence the attribution of observed gains to the structure-driven mechanism remains unverified.

    Authors: We concur that additional evidence would better substantiate the role of the finite-step coupled smoothing and adaptive readout in mitigating semantic collapse. To address this, we will include a new ablation study in the experiments section. This study will compare the full GOMA model against variants: (i) fixed-step smoothing without adaptive readout, (ii) inclusion of diagonal cross-modal shortcuts, and (iii) modality-agnostic propagation operators. We will report retrieval performance as well as metrics indicative of embedding quality, such as average cosine similarity between originally aligned pairs after varying propagation steps, to demonstrate that the adaptive mechanism preserves useful signal before collapse. While a formal theoretical bound on collapse prevention is beyond the scope of this work and not provided here, the empirical ablations combined with the analysis in §3 on differing neighborhood geometries will provide stronger support for the attribution of gains to the proposed design choices. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark evaluation, not self-referential derivations

full rationale

The paper introduces GOMA as a proposed framework for refining frozen multimodal embeddings via graph signal smoothing on MAGs, specifying design choices such as modality-aware operators, finite-step coupled smoothing without diagonal shortcuts, and adaptive readout of trajectories. All performance claims (SOTA or tied-SOTA retrieval on seven benchmarks, improved stability) are presented as outcomes of transductive experiments where the graph provides unlabeled context and self-pair edges are removed. No equations, derivations, or fitted parameters are described in the provided text that would reduce the claimed improvements to quantities defined by construction from the same inputs or prior self-citations. The central premise that MAG structure serves as an effective post-encoder is supported by external benchmark results rather than any tautological reduction, making the chain self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The framework rests on domain assumptions about graph signal propagation and several proposed operators whose effectiveness is asserted rather than derived from first principles.

free parameters (2)
  • modality-aware propagation operator parameters
    Learned operators that control message flow per modality; their values are not stated as fixed constants from prior literature.
  • smoothing depth selection
    Adaptive readout implies per-node choice of trajectory length, which functions as a fitted or selected parameter.
axioms (2)
  • domain assumption Different modalities induce distinct neighborhood geometries that can be bridged by learned propagation operators without diagonal shortcuts.
    Invoked to justify breaking modality-specific topological barriers while preserving retrieval utility.
  • domain assumption Finite-step smoothing can be stopped before semantic boundaries collapse if node-specific trajectories are read out adaptively.
    Central premise for controlling the smoothing regime in the unified design.
invented entities (1)
  • modality-aware propagation operators no independent evidence
    purpose: To decouple where messages flow across visual, textual, and cross-modal relations.
    New operators introduced to handle differing neighborhood geometries; no independent falsifiable evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5810 in / 1504 out tokens · 60908 ms · 2026-05-20T19:38:42.616071+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 1 internal anchor

  1. [1]

    Kipf and Max Welling , title =

    Thomas N. Kipf and Max Welling , title =

  2. [2]

    International Conference on Learning Representations , year =

    Graph Attention Networks , author =. International Conference on Learning Representations , year =

  3. [3]

    Hamilton and Zhitao Ying and Jure Leskovec , title =

    William L. Hamilton and Zhitao Ying and Jure Leskovec , title =

  4. [4]

    Shaked Brody and Uri Alon and Eran Yahav , title =

  5. [5]

    Ming Chen and Zhewei Wei and Zengfeng Huang and Bolin Ding and Yaliang Li , title =

  6. [6]

    Keyulu Xu and Weihua Hu and Jure Leskovec and Stefanie Jegelka , title =

  7. [7]

    Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning , booktitle =

    Qimai Li and Zhichao Han and Xiao. Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning , booktitle =

  8. [8]

    Jing Zhu and Yuhang Zhou and Shengyi Qian and Zhongmou He and Tong Zhao and Neil Shah and Danai Koutra , title =

  9. [9]

    CoRR , volume =

    Hao Yan and Chaozhuo Li and Zhigang Yu and Jun Yin and Ruochen Liu and Peiyan Zhang and Weihao Han and Mingzheng Li and Zhengxin Zeng and Hao Sun and Weiwei Deng and Feng Sun and Qi Zhang and Senzhang Wang , title =. CoRR , volume =

  10. [10]

    Farhat and Marinka Zitnik , title =

    Yasha Ektefaie and George Dasoulas and Ayush Noori and Maha R. Farhat and Marinka Zitnik , title =. Nat. Mac. Intell. , volume =

  11. [11]

    CoRR , volume =

    Ciyuan Peng and Jiayuan He and Feng Xia , title =. CoRR , volume =

  12. [12]

    Yinwei Wei and Xiang Wang and Liqiang Nie and Xiangnan He and Richang Hong and Tat

  13. [13]

    Zhulin Tao and Yinwei Wei and Xiang Wang and Xiangnan He and Xianglin Huang and Tat. Inf. Process. Manag. , volume =

  14. [14]

    Zhiqiang Guo and Jianjun Li and Guohui Li and Chaoyang Wang and Si Shi and Bin Ruan , title =

  15. [15]

    Jun Hu and Bryan Hooi and Bingsheng He and Yinwei Wei , title =

  16. [16]

    Yufei He and Yuan Sui and Xiaoxin He and Yue Liu and Yifei Sun and Bryan Hooi , title =

  17. [17]

    Xuying Ning and Dongqi Fu and Tianxin Wei and Wujiang Xu and Jingrui He , title =

  18. [18]

    CoRR , volume =

    Zhaochen Guo and Zhixiang Shen and Xuanting Xie and Liangjian Wen and Zhao Kang , title =. CoRR , volume =

  19. [19]

    2025 , eprint=

    Cross-Contrastive Clustering for Multimodal Attributed Graphs with Dual Graph Filtering , author=. 2025 , eprint=

  20. [20]

    CoRR , volume =

    Jun Hu and Yufei He and Yuan Li and Bryan Hooi and Bingsheng He , title =. CoRR , volume =

  21. [21]

    CoRR , volume =

    Jiajin Liu and Dongzhe Fan and Jiacheng Shen and Chuanhao Ji and Daochen Zha and Qiaoyu Tan , title =. CoRR , volume =

  22. [22]

    2023 , eprint=

    Multimodal Graph Learning for Generative Tasks , author=. 2023 , eprint=

  23. [23]

    Yanqiao Zhu and Weizhi Xu and Jinghao Zhang and Yuanqi Du and Jieyu Zhang and Qiang Liu and Carl Yang and Shu Wu , title =

  24. [24]

    Nils Reimers and Iryna Gurevych , title =

  25. [25]

    DINOv2: Learning Robust Visual Features without Supervision , journal =

    Maxime Oquab and Timoth. DINOv2: Learning Robust Visual Features without Supervision , journal =

  26. [26]

    CoRR , volume =

    Alexey Dosovitskiy and Lucas Beyer and Alexander Kolesnikov and Dirk Weissenborn and Xiaohua Zhai and Thomas Unterthiner and Mostafa Dehghani and Matthias Minderer and Georg Heigold and Sylvain Gelly and Jakob Uszkoreit and Neil Houlsby , title =. CoRR , volume =

  27. [27]

    McAuley , title =

    Jianmo Ni and Jiacheng Li and Julian J. McAuley , title =

  28. [28]

    NeurIPS Datasets and Benchmarks , year =

    Karan Desai and Gaurav Kaul and Zubin Aysola and Justin Johnson , title =. NeurIPS Datasets and Benchmarks , year =

  29. [29]

    Jiaqi Zhang and Yu Cheng and Yongxin Ni and Yunzhu Pan and Zheng Yuan and Junchen Fu and Youhua Li and Jie Wang and Fajie Yuan , title =

  30. [30]

    Hamilton and Jure Leskovec , title =

    Rex Ying and Ruining He and Kaifeng Chen and Pong Eksombatchai and William L. Hamilton and Jure Leskovec , title =

  31. [31]

    Xiangnan He and Kuan Deng and Xiang Wang and Yan Li and Yongdong Zhang and Meng Wang , title =

  32. [32]

    Wenqi Fan and Yao Ma and Qing Li and Yuan He and Yihong Eric Zhao and Jiliang Tang and Dawei Yin , title =

  33. [33]

    Le and Geoffrey E

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =

  34. [34]

    Ilya Loshchilov and Frank Hutter , title =

  35. [35]

    Yi Fang and Bowen Jin and Jiacheng Shen and Sirui Ding and Qiaoyu Tan and Jiawei Han , title =

  36. [36]

    CoRR , volume =

    Dongzhe Fan and Yi Fang and Jiajin Liu and Djellel Difallah and Qiaoyu Tan , title =. CoRR , volume =

  37. [37]

    InstructG2I: Synthesizing Images from Multimodal Attributed Graphs , booktitle =

    Bowen Jin and Ziqi Pang and Bingjun Guo and Yu. InstructG2I: Synthesizing Images from Multimodal Attributed Graphs , booktitle =

  38. [38]

    Zhenyu Hou and Yufei He and Yukuo Cen and Xiao Liu and Yuxiao Dong and Evgeny Kharlamov and Jie Tang , title =

  39. [39]

    ImageBind One Embedding Space to Bind Them All , booktitle =

    Rohit Girdhar and Alaaeldin El. ImageBind One Embedding Space to Bind Them All , booktitle =

  40. [40]

    2023 , eprint=

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

  41. [41]

    2024 , eprint=

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

  42. [42]

    Suppressed for Anonymity , author=

  43. [43]

    Alec Radford and Jong Wook Kim and Chris Hallacy and Aditya Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever , title =

  44. [44]

    Chunxiao Liu and Zhendong Mao and Tianzhu Zhang and Hongtao Xie and Bin Wang and Yongdong Zhang , title =

  45. [45]

    NeurIPS Workshop , year =

    Aaron van den Oord and Yazhe Li and Oriol Vinyals , title =. NeurIPS Workshop , year =

  46. [46]

    International conference on machine learning , pages=

    Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

  47. [47]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Sigmoid loss for language image pre-training , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  48. [48]

    International Conference on Learning Representations , year=

    FILIP: Fine-grained Interactive Language-Image Pre-Training , author=. International Conference on Learning Representations , year=

  49. [49]

    Advances in neural information processing systems , volume=

    Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks , author=. Advances in neural information processing systems , volume=

  50. [50]

    European conference on computer vision , pages=

    Uniter: Universal image-text representation learning , author=. European conference on computer vision , pages=. 2020 , organization=

  51. [51]

    Advances in neural information processing systems , volume=

    Align before fuse: Vision and language representation learning with momentum distillation , author=. Advances in neural information processing systems , volume=

  52. [52]

    International conference on machine learning , pages=

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation , author=. International conference on machine learning , pages=. 2022 , organization=

  53. [53]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  54. [54]

    International Conference on Learning Representations , year=

    Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , author=. International Conference on Learning Representations , year=

  55. [55]

    European conference on computer vision , pages=

    Oscar: Object-semantics aligned pre-training for vision-language tasks , author=. European conference on computer vision , pages=. 2020 , organization=

  56. [56]

    Advances in neural information processing systems , volume=

    Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

  57. [57]

    Advances in Neural Information Processing Systems , volume=

    Language is not all you need: Aligning perception with language models , author=. Advances in Neural Information Processing Systems , volume=

  58. [58]

    Advances in Neural Information Processing Systems , pages=

    Devise: A deep visual-semantic embedding model , author=. Advances in Neural Information Processing Systems , pages=

  59. [59]

    VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

    Vse++: Improving visual-semantic embeddings with hard negatives , author=. arXiv preprint arXiv:1707.05612 , year=

  60. [60]

    Proceedings of the European Conference on Computer Vision , pages=

    Stacked cross attention for image-text matching , author=. Proceedings of the European Conference on Computer Vision , pages=

  61. [61]

    Camp: Cross-modal adaptive message passing for text-image retrieval , author=. Proc. IEEE Int. Conf. Comput. Vis. (ICCV) , pages=

  62. [62]

    Proceedings of the IEEE International Conference on Computer Vision , pages=

    Visual semantic reasoning for image-text matching , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=

  63. [63]

    CVPR , pages=

    Polysemous visual-semantic embedding for cross-modal retrieval , author=. CVPR , pages=

  64. [64]

    CVPR , pages=

    Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval , author=. CVPR , pages=

  65. [65]

    Consensus-aware visual-semantic embedding for image-text matching , author=. Proc. Eur. Conf. Comput. Vis. (ECCV) , pages=

  66. [66]

    Context-aware attention network for image-text retrieval , author=. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , pages=

  67. [67]

    CVPR , pages=

    Probabilistic embeddings for cross-modal retrieval , author=. CVPR , pages=

  68. [68]

    CVPR , pages=

    Learning the best pooling strategy for visual semantic embedding , author=. CVPR , pages=

  69. [69]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Similarity reasoning and filtration for image-text matching , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  70. [70]

    Proceedings of the IEEE International Conference on Computer Vision , pages=

    Wasserstein coupled graph learning for cross-modal retrieval , author=. Proceedings of the IEEE International Conference on Computer Vision , pages=. 2021 , organization=

  71. [71]

    CVPR , pages=

    Negative-aware attention framework for image-text matching , author=. CVPR , pages=

  72. [72]

    Proceedings of the 30th ACM International Conference on Multimedia , pages=

    Deep evidential learning with noisy correspondence for cross-modal retrieval , author=. Proceedings of the 30th ACM International Conference on Multimedia , pages=

  73. [73]

    CVPR , pages=

    Improving Cross-Modal Retrieval with Set of Diverse Embeddings , author=. CVPR , pages=

  74. [74]

    CVPR , month=

    Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network , author=. CVPR , month=. 2023 , pages=

  75. [75]

    Learning Semantic Relationship Among Instances for Image-Text Matching , author=. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , pages=

  76. [76]

    arXiv preprint arXiv:2310.17468 , year=

    Cross-modal Active Complementary Learning with Self-refining Correspondence , author=. arXiv preprint arXiv:2310.17468 , year=

  77. [77]

    Fu, Zheren and Zhang, Lei and Xia, Hou and Mao, Zhendong , title=. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , month=. 2024 , pages=

  78. [78]

    Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence , year=

    Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment , author=. Proceedings of the 39th Annual AAAI Conference on Artificial Intelligence , year=

  79. [79]

    Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=

    Yang Liu and Wentao Feng and Zhuoyao Liu and Shudong Huang and Jiancheng Lv , title=. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pages=. 2025 , month=

  80. [80]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Yamaguchi, Shin'ya and Feng, Dewei and Kanai, Sekitoshi and Adachi, Kazuki and Chijiwa, Daiki , title=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

Showing first 80 references.