super hub Mixed citations

Learning Transferable Visual Models From Natural Language Supervision

Aditya Ramesh, Alec Radford, Chris Hallacy, Gabriel Goh, Jong Wook Kim, Sandhini Agarwal · 2021 · cs.CV · arXiv 2103.00020

Mixed citation behavior. Most common role is background (69%).

225 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 225 citing papers more from Aditya Ramesh arXiv PDF

abstract

State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 36 method 8 baseline 4 other 1

citation-polarity summary

background 34 use method 8 baseline 4 unclear 2 support 1

claims ledger

abstract State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (i

authors

Aditya Ramesh Alec Radford Chris Hallacy Gabriel Goh Jong Wook Kim Sandhini Agarwal

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Editing Models with Task Arithmetic

cs.LG · 2022-12-08 · accept · novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

Prompt-to-Prompt Image Editing with Cross Attention Control

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cs.CV · 2022-08-02 · unverdicted · novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

cs.AI · 2026-06-28 · conditional · novelty 7.0

AOI adds keyframe capture, volume-gated audio transcription, and visual narration to computer-use agents, producing +17 to +48 pp gains over screenshot baselines on DynaCU-Bench with no retraining.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Unleashing Infinite Motion: Scaling Expressive Quadrupedal Motion via Generative Video Priors

cs.RO · 2026-06-26 · conditional · novelty 7.0

Uni-Mo generates 7,488 language-annotated quadruped motions via LLM prompts and video diffusion, lifts them to 3D trajectories, and trains policies achieving 96.7% real-robot success on 392 sampled motions.

Evaluation Pitfalls and Challenges in Multimedia Event Extraction

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

A systematic analysis of evaluation practices in multimedia event extraction reveals that minor methodological choices cause large performance swings and overestimation of cross-modal grounding ability.

Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

cs.AI · 2026-06-22 · unverdicted · novelty 7.0

STREAM decouples text and music conditioning in a diffusion transformer via AdaLN for structure and BEAM for beats, plus new Motorica++ dataset and editability metrics, claiming SOTA music alignment with preserved semantics.

Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

cs.LG · 2026-06-09 · accept · novelty 7.0

A spiked signal-plus-noise model yields separation ratios that partition multimodal problems into four regimes where alignment, prediction, both, or neither succeed.

$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.

The Regularizing Power of Language-Training Deepfake Detectors

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

PInVerify is a new offline embodied benchmark for active instance verification that supplies multi-view captures and 6-sector navigation topology, with MLLM baselines reaching 85.6% after fine-tuning but showing no reliable benefit from tested next-best-view strategies.

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

Dex2HOI is a dual-stream diffusion model with bidirectional cross-attention and motion fusion that generates long bimanual single- and two-object HOI sequences from text at real-time speeds.

Colosseum V2: Benchmarking Generalization for Vision Language Action Models

cs.RO · 2026-05-26 · unverdicted · novelty 7.0

Introduces Colosseum V2 benchmark for evaluating VLA model generalization in robotic manipulation with 28 tasks, revealing limitations in current methods and sim-real correlations.

Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning

cs.CV · 2026-05-26 · unverdicted · novelty 7.0

A selector trained once on LLaVA-665K in CLIP space selects 15% of instructions to reach 98.3% of full-data performance and generalizes to an unseen dataset and different VLMs.

Garment Particles: A 2D--3D Symmetric Garment Representation for Generation and Editing

cs.GR · 2026-05-25 · unverdicted · novelty 7.0

Garment Particles is a 5D point cloud representation jointly encoding 2D sewing patterns and 3D geometry, supporting rectified flow generation from high-level inputs and diffusion-based editing of patterns or shapes.

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

MCPO applies contrastive learning to GRPO-style RL by treating cross-domain correct rollouts as positives and incorrect ones as negatives to improve multi-domain reasoning performance in LRMs.

PEDESTRIANQA: A Benchmark for Vision-Language Models on Pedestrian Intention and Trajectory Prediction

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

PedestrianQA is a new benchmark that turns pedestrian behavior prediction into VLM question-answering with rationales, reporting improved intention classification, trajectory accuracy, and explanation quality after fine-tuning on multiple existing video datasets.

GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

GenRecon lifts object-level generative priors to scene-scale reconstruction by chunking scenes and using projection-based conditioning on multi-view features, claiming 16% better results than prior methods.

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.

citing papers explorer

Showing 25 of 25 citing papers after filters.

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation cs.SD · 2025-12-30 · unverdicted · none · ref 48 · 2 links · internal anchor
PhyAVBench provides the first systematic benchmark and metric for audio-physics grounding in T2AV, I2AV, and V2A models using controlled prompt pairs and real video ground truth.
AudioMoG: Guiding Audio Generation with Mixture-of-Guidance cs.SD · 2025-09-28 · unverdicted · none · ref 59 · internal anchor
AudioMoG is a mixture-of-guidance sampling technique that combines CFG and AG signals to outperform single-guidance baselines in text-to-audio generation at equivalent speed.
Flexible Multitask Learning with Factorized Diffusion Policy cs.RO · 2025-12-26 · unverdicted · none · ref 31 · internal anchor
A factorized modular diffusion policy improves fitting of multimodal robot actions and enables flexible task adaptation without catastrophic forgetting.
MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications cs.AI · 2025-11-17 · unverdicted · none · ref 28 · internal anchor
MM-Telco creates multimodal benchmarks for telecom and demonstrates that fine-tuned LLMs and VLMs achieve significant performance gains on domain-specific tasks.
Foundation Models for Discovery and Exploration in Chemical Space physics.chem-ph · 2025-10-20 · unverdicted · none · ref 13 · internal anchor
MIST models up to 10x larger than prior work, fine-tuned on over 400 structure-property tasks, match or exceed SOTA on benchmarks and demonstrate zero-shot olfactory perception mapping consistent with hyperbolic geometry.
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning cs.CV · 2025-10-18 · unverdicted · none · ref 45 · internal anchor
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
Artificial Phantasia: Emergent Mental Imagery in Large Language Models cs.AI · 2025-09-27 · unverdicted · none · ref 68 · internal anchor
LLMs achieve higher accuracy than humans on compositional imagery tasks previously argued to require pictorial representations, supporting emergent propositional mental imagery in AI.
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis cs.CV · 2025-09-20 · unverdicted · none · ref 29 · internal anchor
VC-Inspector introduces a lightweight open-source LMM and a controllable factual-error generation framework that achieves state-of-the-art correlation with human judgments on reference-free video caption evaluation.
Scalable Option Learning in High-Throughput Environments cs.LG · 2025-08-30 · unverdicted · none · ref 52 · internal anchor
SOL is a new hierarchical RL algorithm that reaches 35x higher throughput and outperforms flat agents when trained on 30 billion frames in NetHack while showing positive scaling.
Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters? cs.CV · 2025-07-14 · conditional · none · ref 35 · internal anchor
The ITW-SM dataset and targeted optimization of detector design choices yield a 26.87% average AUC improvement for state-of-the-art AI-generated image detectors under real-world social media conditions.
HeartcareGPT: A Unified Multimodal ECG Suite for Dual Signal-Image Modeling and Understanding cs.LG · 2025-06-06 · unverdicted · none · ref 18 · internal anchor
HeartcareGPT proposes Dual Stream Projection Alignment (DSPA) on a structure-aware tokenizer for unified ECG signal-image modeling, supported by Heartcare-400K dataset and Heartcare-Bench.
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning cs.CL · 2025-05-24 · unverdicted · none · ref 5 · internal anchor
v1 adds a point-and-copy mechanism for dynamic visual token referencing in multimodal reasoning, trained on a new 300K dataset with grounding annotations, and outperforms baselines on multimodal math tasks.
GraspVLA: a Grasping Foundation Model Pre-trained on Billion-scale Synthetic Action Data cs.RO · 2025-05-06 · unverdicted · none · ref 3 · internal anchor
GraspVLA shows that pretraining a grasping model on a billion synthetic action frames enables zero-shot open-vocabulary performance and sim-to-real transfer.
Beyond Cross-Modal Alignment: Measuring and Leveraging Modality Gap in Vision-Language Models cs.CV · 2025-02-16 · unverdicted · none · ref 27 · internal anchor
Introduces Modality Dominance Score (MDS) to measure modality-specific features in VLMs and applies training-free editing to improve bias mitigation, adversarial generation, and modality control.
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification cs.CL · 2025-12-08 · unverdicted · none · ref 29 · internal anchor
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
Optical Context Compression Is Just (Bad) Autoencoding cs.CV · 2025-12-03 · accept · none · ref 12 · internal anchor
Vision-based optical context compression performs no better than direct autoencoding baselines like mean pooling or hierarchical encoders across compression ratios.
Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning cs.LG · 2025-06-11 · unverdicted · none · ref 43 · internal anchor
BYOL-γ uses self-predictive representations to approximate successor representations, improving zero-shot combinatorial generalization in goal-conditioned behavioral cloning.
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration cs.CV · 2025-05-27 · unverdicted · none · ref 14 · internal anchor
CAAC mitigates hallucinations in LVLMs via Visual-Token Calibration and Adaptive Attention Re-Scaling guided by model confidence, showing gains on CHAIR, AMBER, and POPE especially in long-form generation.
3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography cs.CV · 2025-02-04 · unverdicted · none · ref 3 · internal anchor
A 3D self-supervised foundation model trained on over 360k head CT scans improves downstream disease classification on limited-label internal and external datasets versus scratch-trained and prior models.
Are vision-language models ready to zero-shot replace supervised classification models in agriculture? cs.CV · 2025-12-17 · unverdicted · none · ref 2 · internal anchor
Zero-shot VLMs reach at most 62% accuracy on agricultural classification tasks while supervised models like YOLO11 perform markedly higher, indicating they are not ready to replace task-specific systems.
Physics-Based Benchmarking Metrics for Multimodal Synthetic Images cs.CV · 2025-11-19 · unverdicted · none · ref 1 · internal anchor
PCMDE is a three-stage metric that extracts multimodal features, fuses components with confidence weights, and applies LLM-based physics-guided reasoning to assess synthetic image quality beyond standard scores like BLEU or CLIPScore.
CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images cs.CV · 2025-06-13 · unverdicted · none · ref 13 · internal anchor
A lightweight multi-modal CLIP pipeline predicts exact-match geographical tags on a Kaggle subset of the Geograph crowdsourced image archive by fusing image, location, and title embeddings.
Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving cs.CV · 2025-06-05 · unverdicted · none · ref 18 · internal anchor
Introduces structured NuScenes-S dataset and 0.9B FastDrive VLM claiming 20% higher decision accuracy and over 10x inference speedup versus larger unstructured VLMs.
Developing an AI Course for Synthetic Chemistry Students cs.AI · 2025-11-23 · unverdicted · none · ref 1 · internal anchor
AI4CHEM is a beginner-friendly introductory course that teaches data-driven chemistry methods to synthetic chemistry students using web-based platforms, chemistry-specific examples, and active learning without requiring prior coding skills.
Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity cs.AI · 2025-09-11 · unreviewed · ref 35 · internal anchor

Learning Transferable Visual Models From Natural Language Supervision

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer