Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever · 2021

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting attention latency by 3x.

AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT

cs.CV · 2026-05-02 · conditional · novelty 5.0

AttnRouter routes edits to optimal attention operations per category on MMDiT, raising the CLIP-T+DINO-I composite score 6.4% above baseline while an automatic classifier recovers 98% of the gain.

citing papers explorer

Showing 2 of 2 citing papers.

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing cs.CV · 2026-05-17 · unverdicted · none · ref 28
FastOCR dynamically selects a small subset of visual tokens per decoding step using focal-guided pruning and cross-step reuse, retaining 98% accuracy on Qwen2.5-VL while attending to only 5% of tokens and cutting attention latency by 3x.
AttnRouter: Per-Category Attention Routing for Training-Free Image Editing on MMDiT cs.CV · 2026-05-02 · conditional · none · ref 21
AttnRouter routes edits to optimal attention operations per category on MMDiT, raising the CLIP-T+DINO-I composite score 6.4% above baseline while an automatic classifier recovers 98% of the gain.

Learning transferable visual models from natural language supervision

fields

years

verdicts

representative citing papers

citing papers explorer