hub

Unifying visual- semantic embeddings with multimodal neural language models

· 2014 · cs.LG · arXiv 1411.2539

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open full Pith review browse 10 citing papers arXiv PDF

abstract

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

USV: Towards Understanding the User-generated Short-form Videos

cs.CV · 2026-05-20 · unverdicted · novelty 7.0

Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.

VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

cs.CV · 2025-05-26 · conditional · novelty 6.0

VisRet improves text-to-image retrieval by generating images from text queries and then retrieving within the image modality, reporting average nDCG@30 gains of 0.125 with CLIP and 0.121 with E5-V across four benchmarks.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

VisualBERT: A Simple and Performant Baseline for Vision and Language

cs.CV · 2019-08-09 · conditional · novelty 6.0

VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.

Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval

cs.IR · 2019-06-26 · unverdicted · novelty 6.0

Soft-attention on audio inputs increases tempo robustness in cross-modal audio-to-sheet-music retrieval on synthesized piano data.

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.

Microsoft COCO Captions: Data Collection and Evaluation Server

cs.CV · 2015-04-01 · accept · novelty 6.0

Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

cs.LG · 2025-11-09 · unverdicted · novelty 5.0

MULTIBENCH++ is a new large-scale benchmark integrating over 30 datasets across 15 modalities and 20 tasks, accompanied by an open-source automated evaluation pipeline that establishes new performance baselines for multimodal fusion.

Root Mean Square Layer Normalization

cs.LG · 2019-10-16 · conditional · novelty 5.0

RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.

citing papers explorer

Showing 10 of 10 citing papers.

USV: Towards Understanding the User-generated Short-form Videos cs.CV · 2026-05-20 · unverdicted · none · ref 36 · internal anchor
Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models cs.CV · 2026-04-27 · unverdicted · none · ref 10
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval cs.CV · 2025-05-26 · conditional · none · ref 5 · internal anchor
VisRet improves text-to-image retrieval by generating images from text queries and then retrieving within the image modality, reporting average nDCG@30 gains of 0.125 with CLIP and 0.121 with E5-V across four benchmarks.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 139 · internal anchor
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
VisualBERT: A Simple and Performant Baseline for Vision and Language cs.CV · 2019-08-09 · conditional · none · ref 83 · internal anchor
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.
Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval cs.IR · 2019-06-26 · unverdicted · none · ref 24 · internal anchor
Soft-attention on audio inputs increases tempo robustness in cross-modal audio-to-sheet-music retrieval on synthesized piano data.
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding cs.CV · 2026-04-17 · unverdicted · none · ref 19
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
Microsoft COCO Captions: Data Collection and Evaluation Server cs.CV · 2015-04-01 · accept · none · ref 24
Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains cs.LG · 2025-11-09 · unverdicted · none · ref 27 · internal anchor
MULTIBENCH++ is a new large-scale benchmark integrating over 30 datasets across 15 modalities and 20 tasks, accompanied by an open-source automated evaluation pipeline that establishes new performance baselines for multimodal fusion.
Root Mean Square Layer Normalization cs.LG · 2019-10-16 · conditional · none · ref 14 · internal anchor
RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.

Unifying visual- semantic embeddings with multimodal neural language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer