VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

David J. Fleet; Fartash Faghri; Jamie Ryan Kiros; Sanja Fidler

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1707.05612 v4 pith:BWYZ5PAJ submitted 2017-07-18 cs.LG cs.CLcs.CV

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

Fartash Faghri , David J. Fleet , Jamie Ryan Kiros , Sanja Fidler This is my paper

classification cs.LG cs.CLcs.CV

keywords retrievalembeddingshardapproachfunctionslossmethodsms-coco

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Identifying Latent Concepts and Structures for Generalized Category Discovery
cs.CV 2026-07 unverdicted novelty 7.0

CPF-GCD enforces low-rank compositional structure on vision backbone features via spatial primitive fields so that novel categories emerge as new activation patterns over a shared vocabulary of reusable visual primitives.
Cross-modal linkage risk in clinical vision-language models
cs.CV 2026-06 conditional novelty 7.0

Clinical VLMs enable image-to-report retrieval far above chance (15-50x at N=100-10k), persisting beyond disease labels, with targeted DP on projection heads cutting Recall@1 by 61.8% and preserving AUROC.
LC-ICL: Label-Guided Contrastive In-Context Learning for Robust Information Extraction
cs.CL 2026-06 unverdicted novelty 6.0

LC-ICL improves few-shot NER and RE by using label-guided contrastive demonstrations that pair positive samples with error-annotated negative samples.
MAPS: Multi-Anchor Projection Similarity for Joint Vision-Language Geo-Localization
cs.CV 2026-06 unverdicted novelty 6.0

MAPS defines a new projection-based similarity for joint vision-language geo-localization queries and pairs it with a contrastive loss to reach claimed state-of-the-art retrieval performance.
ATCCaps: A Call-Sign-Aware Speech Dataset for Air Traffic Control Recognition
cs.SD 2026-06 unverdicted novelty 6.0

ATCCaps is a call-sign-aware ATC speech dataset containing 202.94 hours of audio, 170385 utterances and 922 unique call signs, constructed via transcript parsing, ADS-B metadata, normalization, filtering and LLM captioning.
Organizational Control Layer: Governance Infrastructure at the Execution Boundary of LLM Agent Systems
cs.MA 2026-06 unverdicted novelty 6.0

OCL is a governance layer for LLM agents that cuts unsafe executions from 88% to near-zero and raises valid success from 12% to 96% in adversarial buyer-seller negotiations across frontier LLMs.
ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search
cs.CV 2026-06 unverdicted novelty 6.0

ROGLE automates region-level supervision via Region-to-Sentence Matching and introduces the P-VLG benchmark to improve fine-grained alignment in text-based person search over CLIP-based models.
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
cs.CV 2026-05 unverdicted novelty 6.0

LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.
GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective
cs.LG 2026-05 unverdicted novelty 6.0

GOMA refines frozen multimodal embeddings via modality-aware graph signal smoothing on attributed graphs to improve retrieval while avoiding over-smoothing.
Federated Cross-Modal Retrieval with Missing Modalities via Semantic Routing and Adapter Personalization
cs.CV 2026-04 unverdicted novelty 6.0

RCSR is a personalization-friendly federated framework that improves cross-modal retrieval accuracy and stability under missing modalities via semantic routing and adapters.
PandaGPT: One Model To Instruction-Follow Them All
cs.CL 2023-05 conditional novelty 6.0

A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.
Uncertainty-Aware Cross-Modal Remote Sensing Image-Text Retrieval via Evidential Learning
cs.IR 2026-07 conditional novelty 5.0

ELC models image–text matches as Dirichlet distributions, aligns uncertainty with retrieval correctness, and selectively applies RS-aware TTA to high-uncertainty queries for more robust CMRSITR under noise.
ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search
cs.CV 2026-06 unverdicted novelty 5.0

ROGLE introduces automated pseudo region-sentence pairs via RSM and multi-granular learning to boost fine-grained alignment in text-based person search, plus the P-VLG benchmark with over 100k annotated regions.
Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval
cs.CV 2026-04 unverdicted novelty 5.0

STBIR fuses sketches and text via curriculum robustness, category optimization, and staged alignment to outperform prior methods on a new fine-grained benchmark dataset.
Variational Adapter for Cross-modal Similarity Representation
cs.CV 2026-05 unverdicted novelty 4.0

VACSR reformulates cross-modal similarity learning as variational inference with regularization to mitigate binary annotation compression in image-text tasks.
DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval
cs.CV 2026-06 unverdicted novelty 3.0

DREAM reports new state-of-the-art recall@1 scores of 49.4%, 49.7%, and 27.3% on MSRVTT, MSVD, and LSMDC by combining masked and permuted language modeling with cascaded group attention in vision.