pith. sign in

super hub Mixed citations

Learning Transferable Visual Models From Natural Language Supervision

Mixed citation behavior. Most common role is background (69%).

171 Pith papers citing it
Background 69% of classified citations
abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

citation-role summary

background 36 method 8 baseline 4 other 1

citation-polarity summary

claims ledger

  • abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

authors

co-cited works

clear filters

representative citing papers

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Vision Harnessing Agent for Open Ad-hoc Segmentation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

VASA is a vision-guided agent for open ad-hoc segmentation that creates and validates masks through planning, tool use, and error recovery, outperforming baselines on the new PARS benchmark and RefCOCOm.

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.

citing papers explorer

Showing 11 of 11 citing papers after filters.

  • Editing Models with Task Arithmetic cs.LG · 2022-12-08 · accept · none · ref 84 · internal anchor

    Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

  • LAION-5B: An open large-scale dataset for training next generation image-text models cs.CV · 2022-10-16 · accept · none · ref 59 · internal anchor

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  • Hierarchical Text-Conditional Image Generation with CLIP Latents cs.CV · 2022-04-13 · accept · none · ref 40 · internal anchor

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  • GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models cs.CV · 2021-12-20 · accept · none · ref 20 · internal anchor

    A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.

  • Diffusion Models Beat GANs on Image Synthesis cs.LG · 2021-05-11 · accept · none · ref 49 · internal anchor

    Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.

  • DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset cs.RO · 2024-03-19 · accept · none · ref 41 · internal anchor

    DROID is a new 76k-trajectory in-the-wild robot manipulation dataset spanning 564 scenes and 84 tasks that improves policy performance and generalization when used for training.

  • OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models cs.CV · 2023-05-13 · accept · none · ref 8 · internal anchor

    OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.

  • Shap-E: Generating Conditional 3D Implicit Functions cs.CV · 2023-05-03 · accept · none · ref 48 · internal anchor

    Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.

  • Scaling Laws and Interpretability of Learning from Repeated Data cs.LG · 2022-05-21 · accept · none · ref 1 · internal anchor

    Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

  • Woosh: A Sound Effects Foundation Model cs.SD · 2026-04-02 · accept · none · ref 28 · internal anchor

    Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.

  • Optical Context Compression Is Just (Bad) Autoencoding cs.CV · 2025-12-03 · accept · none · ref 12 · internal anchor

    Vision-based optical context compression performs no better than direct autoencoding baselines like mean pooling or hierarchical encoders across compression ratios.