Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
hub
Unifying visual- semantic embeddings with multimodal neural language models
10 Pith papers cite this work. Polarity classification is still indexing.
abstract
Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* - "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.
hub tools
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
VisRet improves text-to-image retrieval by generating images from text queries and then retrieving within the image modality, reporting average nDCG@30 gains of 0.125 with CLIP and 0.121 with E5-V across four benchmarks.
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.
Soft-attention on audio inputs increases tempo robustness in cross-modal audio-to-sheet-music retrieval on synthesized piano data.
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.
MULTIBENCH++ is a new large-scale benchmark integrating over 30 datasets across 15 modalities and 20 tasks, accompanied by an open-source automated evaluation pipeline that establishes new performance baselines for multimodal fusion.
RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.
citing papers explorer
-
USV: Towards Understanding the User-generated Short-form Videos
Introduces the USV dataset of 224K short user-generated videos and benchmarks topic recognition plus video-text retrieval with MMF-Net and VTCL baselines.
-
LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models
LearnPruner prunes vision tokens to 5.5% of the original count while retaining about 95% of VLM performance and delivering 3.2 times faster inference by fixing attention sink in encoders and using unbiased middle-layer attention in LLMs.
-
VisRet: Visualization Improves Knowledge-Intensive Text-to-Image Retrieval
VisRet improves text-to-image retrieval by generating images from text queries and then retrieving within the image modality, reporting average nDCG@30 gains of 0.125 with CLIP and 0.121 with E5-V across four benchmarks.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
VisualBERT: A Simple and Performant Baseline for Vision and Language
VisualBERT is a Transformer model that implicitly aligns text and image regions through self-attention and achieves competitive or superior results on VQA, VCR, NLVR2, and Flickr30K after pre-training on captions.
-
Learning Soft-Attention Models for Tempo-invariant Audio-Sheet Music Retrieval
Soft-attention on audio inputs increases tempo robustness in cross-modal audio-to-sheet-music retrieval on synthesized piano data.
-
SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding
SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.
-
Microsoft COCO Captions: Data Collection and Evaluation Server
Microsoft COCO Captions provides 1.5 million human captions across 330,000 images and a public server to evaluate captioning models with BLEU, METEOR, ROUGE, and CIDEr.
-
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
MULTIBENCH++ is a new large-scale benchmark integrating over 30 datasets across 15 modalities and 20 tasks, accompanied by an open-source automated evaluation pipeline that establishes new performance baselines for multimodal fusion.
-
Root Mean Square Layer Normalization
RMSNorm delivers re-scaling invariance and comparable accuracy to LayerNorm while cutting computation by skipping mean subtraction, yielding 7-64% runtime reductions across tested models.