Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Richard S. Zemel; Ruslan Salakhutdinov; Ryan Kiros

arxiv: 1411.2539 · v1 · pith:655ERAI4new · submitted 2014-11-10 · 💻 cs.LG · cs.CL· cs.CV

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Ryan Kiros , Ruslan Salakhutdinov , Richard S. Zemel This is my paper

classification 💻 cs.LG cs.CLcs.CV

keywords multimodalimageslanguagespaceembeddingmodelsneuralblue

0 comments

read the original abstract

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

USV: Towards Understanding the User-generated Short-form Videos
cs.CV 2026-05 unverdicted novelty 7.0

Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
cs.CV 2026-04 unverdicted novelty 7.0

LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
cs.CV 2026-04 unverdicted novelty 6.0

SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
cs.CV 2025-05 conditional novelty 6.0

VisRet improves text-to-image retrieval by generating images from text queries and then retrieving within the image modality, reporting average nDCG@30 gains of 0.125 with CLIP and 0.121 with E5-V across four benchmarks.
Demystifying CLIP Data
cs.CV 2023-09 accept novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
VisualBERT: A Simple and Performant Baseline for Vision and Language
cs.CV 2019-08 conditional novelty 6.0

VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.
Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval
cs.IR 2019-06 unverdicted novelty 6.0

Soft-attention on audio inputs increases tempo robustness in cross-modal audio-to-sheet-music retrieval on synthesized piano data.
Microsoft COCO Captions: Data Collection and Evaluation Server
cs.CV 2015-04 accept novelty 6.0

Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
cs.LG 2025-11 unverdicted novelty 5.0

MULTIBENCH++ is a new large-scale benchmark integrating over 30 datasets across 15 modalities and 20 tasks, accompanied by an open-source automated evaluation pipeline that establishes new performance baselines for mu...
Root Mean Square Layer Normalization
cs.LG 2019-10 conditional novelty 5.0

RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.