pith. sign in

arxiv: 1411.2539 · v1 · pith:655ERAI4new · submitted 2014-11-10 · 💻 cs.LG · cs.CL· cs.CV

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

classification 💻 cs.LG cs.CLcs.CV
keywords multimodalimageslanguagespaceembeddingmodelsneuralblue
0
0 comments X
read the original abstract

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. USV: Towards Understanding the User-generated Short-form Videos

    cs.CV 2026-05 unverdicted novelty 7.0

    Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.

  2. LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-laye...

  3. SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

    cs.CV 2026-04 unverdicted novelty 6.0

    SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.

  4. VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval

    cs.CV 2025-05 conditional novelty 6.0

    VisRet improves text-to-image retrieval by generating images from text queries and then retrieving within the image modality, reporting average nDCG@30 gains of 0.125 with CLIP and 0.121 with E5-V across four benchmarks.

  5. Demystifying CLIP Data

    cs.CV 2023-09 accept novelty 6.0

    MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

  6. VisualBERT: A Simple and Performant Baseline for Vision and Language

    cs.CV 2019-08 conditional novelty 6.0

    VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.

  7. Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval

    cs.IR 2019-06 unverdicted novelty 6.0

    Soft-attention on audio inputs increases tempo robustness in cross-modal audio-to-sheet-music retrieval on synthesized piano data.

  8. Microsoft COCO Captions: Data Collection and Evaluation Server

    cs.CV 2015-04 accept novelty 6.0

    Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.

  9. MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains

    cs.LG 2025-11 unverdicted novelty 5.0

    MULTIBENCH++ is a new large-scale benchmark integrating over 30 datasets across 15 modalities and 20 tasks, accompanied by an open-source automated evaluation pipeline that establishes new performance baselines for mu...

  10. Root Mean Square Layer Normalization

    cs.LG 2019-10 conditional novelty 5.0

    RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.