Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever · 2021

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

A training-free pipeline combining SAM3 for masks and transformed DINOv3 embeddings for prototype matching delivers the first reported baseline for fine-grained fungi segmentation across one-shot to few-hundred-shot regimes.

Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

Ramen enables robust test-time adaptation of vision-language models under mixed-domain shifts by actively selecting domain-consistent and prediction-balanced samples via an embedding-gradient cache.

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.

Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence

cs.CV · 2026-04-10 · unverdicted · novelty 5.0

Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.

citing papers explorer

Showing 4 of 4 citing papers.

Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline cs.CV · 2026-05-21 · unverdicted · none · ref 14
A training-free pipeline combining SAM3 for masks and transformed DINOv3 embeddings for prototype matching delivers the first reported baseline for fine-grained fungi segmentation across one-shot to few-hundred-shot regimes.
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection cs.CV · 2026-04-23 · unverdicted · none · ref 35
Ramen enables robust test-time adaptation of vision-language models under mixed-domain shifts by actively selecting domain-consistent and prediction-balanced samples via an embedding-gradient cache.
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models cs.CV · 2026-04-14 · unverdicted · none · ref 32
MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence cs.CV · 2026-04-10 · unverdicted · none · ref 33
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.

Learning transferable visual models from natural language supervision

fields

years

verdicts

representative citing papers

citing papers explorer