Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al · 2021

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.

A Sanity Check on Composed Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

cs.CV · 2025-11-23 · unverdicted · novelty 7.0

TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.

DynProto: Dynamic Prototype Evolution for Out-of-Distribution Detection

cs.CV · 2026-04-26 · unverdicted · novelty 6.0

DynProto dynamically builds OOD prototypes from ID-only data via coarse caching and fine clustering of confused samples to improve OOD detection in VLMs, cutting FPR95 by 11.6% on ImageNet benchmarks.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Memorization in Stable Diffusion is driven by the structural duplication of the CLIP <eot> embedding inside <pad> tokens, which causes over-reliance on that vector; simple inference-time masking or token replacement suppresses it without quality loss.

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

cs.CV · 2026-03-09 · unverdicted · novelty 6.0

GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.

Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

cs.AI · 2025-12-03 · unverdicted · novelty 6.0

Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

cs.CV · 2025-05-26 · unverdicted · novelty 6.0

VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

citing papers explorer

Showing 9 of 9 citing papers.

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers cs.CV · 2026-05-12 · unverdicted · none · ref 40
A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.
A Sanity Check on Composed Image Retrieval cs.CV · 2026-04-14 · unverdicted · none · ref 38
The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.
TRANSPORTER: Transferring Visual Semantics from VLM Manifolds cs.CV · 2025-11-23 · unverdicted · none · ref 70
TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.
DynProto: Dynamic Prototype Evolution for Out-of-Distribution Detection cs.CV · 2026-04-26 · unverdicted · none · ref 37
DynProto dynamically builds OOD prototypes from ID-only data via coarse caching and fine clustering of confused samples to improve OOD detection in VLMs, cutting FPR95 by 11.6% on ImageNet benchmarks.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 51
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings cs.CV · 2026-04-06 · unverdicted · none · ref 23
Memorization in Stable Diffusion is driven by the structural duplication of the CLIP <eot> embedding inside <pad> tokens, which causes over-reliance on that vector; simple inference-time masking or token replacement suppresses it without quality loss.
Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations cs.CV · 2026-03-09 · unverdicted · none · ref 28
GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents cs.AI · 2025-12-03 · unverdicted · none · ref 47
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction cs.CV · 2025-05-26 · unverdicted · none · ref 58
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

Learn- ing transferable visual models from natural language super- vision

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer