Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al · 2021

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics

cs.CV · 2026-04-26 · unverdicted · novelty 6.0

PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

cs.CV · 2025-04-22 · unverdicted · novelty 6.0

FreeGraftor performs subject-driven text-to-image generation without training by cross-image feature grafting via semantic matching, position-constrained attention fusion, and a noise initialization strategy that preserves reference geometry.

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Proposes GLIA framework to adapt Vision Transformers for blind image quality assessment via dual-stream global-local interaction, claiming higher accuracy and robustness with reduced parameters.

citing papers explorer

Showing 3 of 3 citing papers.

PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics cs.CV · 2026-04-26 · unverdicted · none · ref 28
PhysLayer is a framework that decomposes images into depth layers, simulates physics with depth awareness, and synthesizes videos guided by language for more plausible animations.
FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation cs.CV · 2025-04-22 · unverdicted · none · ref 25
FreeGraftor performs subject-driven text-to-image generation without training by cross-image feature grafting via semantic matching, position-constrained attention fusion, and a noise initialization strategy that preserves reference geometry.
Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction cs.CV · 2026-05-18 · unverdicted · none · ref 17
Proposes GLIA framework to adapt Vision Transformers for blind image quality assessment via dual-stream global-local interaction, claiming higher accuracy and robustness with reduced parameters.

Learning transferable visual models from natural language supervision

fields

years

verdicts

representative citing papers

citing papers explorer