Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al · 2021

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Dynamic Full-body Motion Agent with Object Interaction via Blending Pre-trained Modular Controllers

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

A two-stage framework augments HOI data with dynamic priors and blends pre-trained dynamic motion and static interaction agents via a composer network to enable long-term dynamic human-object interactions with higher success rates and reduced training time.

A Sanity Check on Composed Image Retrieval

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

The paper creates FISD, a controlled benchmark for composed image retrieval that removes query ambiguity via generative models, and proposes a multi-round agentic evaluation to assess models in interactive settings.

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds

cs.CV · 2025-11-23 · unverdicted · novelty 7.0

TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.

DynProto: Dynamic Prototype Evolution for Out-of-Distribution Detection

cs.CV · 2026-04-26 · unverdicted · novelty 6.0

DynProto dynamically captures coarse OOD patterns confused with each ID class and refines them into prototypes during testing to enable similarity-based OOD detection that outperforms prior methods on ImageNet benchmarks.

SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

Memorization In Stable Diffusion Is Unexpectedly Driven by CLIP Embeddings

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Memorization in Stable Diffusion is driven by the structural duplication of the CLIP <eot> embedding inside <pad> tokens, which causes over-reliance on that vector; simple inference-time masking or token replacement suppresses it without quality loss.

Boosting MLLM Spatial Reasoning with Geometrically Referenced 3D Scene Representations

cs.CV · 2026-03-09 · unverdicted · novelty 6.0

GR3D turns 3D scene geometry into ID-indexed text references, enabling zero-shot MLLM spatial reasoning gains of 9% on VSI-Bench and 12% on MindCube.

Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

cs.AI · 2025-12-03 · unverdicted · novelty 6.0

Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

cs.CV · 2025-05-26 · unverdicted · novelty 6.0

VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

citing papers explorer

Showing 3 of 3 citing papers after filters.

TRANSPORTER: Transferring Visual Semantics from VLM Manifolds cs.CV · 2025-11-23 · unverdicted · none · ref 70
TRANSPORTER generates videos from VLM logits using optimal transport to interpret model predictions on object attributes, actions, and scenes.
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents cs.AI · 2025-12-03 · unverdicted · none · ref 47
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction cs.CV · 2025-05-26 · unverdicted · none · ref 58
VLM-3R augments VLMs with implicit 3D tokens from monocular video via geometry encoding and 200K+ 3D reconstructive QA pairs, plus a new 138K-pair temporal benchmark, to support spatial and embodied reasoning.

Learn- ing transferable visual models from natural language super- vision

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer