From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

Peter Young, Alice Lai, Micah Hodosh, Julia Hockenmaier · 2014 · DOI 10.1162/tacl_a_00166

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open at publisher browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

SimCSE: Simple Contrastive Learning of Sentence Embeddings

cs.CL · 2021-04-18 · conditional · novelty 8.0

SimCSE achieves 76.3% unsupervised and 81.6% supervised Spearman's correlation on STS tasks with BERT-base, improving prior best results by 4.2% and 2.2% via simple contrastive learning.

NEST: Narrative Event Structures in Time for Long Video Understanding

cs.CV · 2026-06-18 · unverdicted · novelty 7.0

NEST is a new benchmark dataset for narrative event structures in long videos, with baselines reporting ETD below 8%, EL under 6%, EAE below 11%, and ERE at 35-44% F1.

LAION-5B: An open large-scale dataset for training next generation image-text models

cs.CV · 2022-10-16 · accept · novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

cs.CV · 2026-05-27 · unverdicted · novelty 5.0

VCap pairs reference captions as witnesses with visual signals as adjudicators to deliver hypergeometric-precision rewards for RL in visual captioning, enabling an 8B model to outperform SOTA on benchmarks and improve weak-to-strong generalization.

One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

cs.CL · 2026-04-30 · unverdicted · novelty 5.0

A single hub text can unreasonably match many images in CLIP-based similarity, exposing vulnerabilities in cross-modal encoders for caption evaluation and retrieval.

From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

A quantitative analysis of semantic information in deep representations of text and images

cs.CL · 2025-05-21 · unverdicted · novelty 5.0

Semantic information in deep representations is distributed across many tokens and concentrated in specific layers, with directed predictability strongest in middle layers for text and varying by modality and language.

ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP

cs.CV · 2026-06-25 · unverdicted · novelty 4.0

ReasonCLIP-58M applies continual pretraining with visually grounded reasoning captions on 58M examples to improve CLIP-style models on commonsense and compositional reasoning tasks.

Multilingual Vision-Language Models, A Survey

cs.CL · 2025-09-26 · accept · novelty 3.0

The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

LAION-5B: An open large-scale dataset for training next generation image-text models cs.CV · 2022-10-16 · accept · none · ref 91
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer