ICML , year=

Learning transferable visual models from natural language supervision , author=

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

browse 7 citing papers

representative citing papers

SwordBench: Evaluating Orthogonality of Steering Image Representations

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.

Disparities In Negation Understanding Across Languages In Vision-Language Models

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

VLMs exhibit affirmation bias that varies by language, with a new multilingual benchmark showing CLIP at or below chance on non-Latin scripts, MultiCLIP most uniform, and SpaceVLM corrections effective unevenly across typologies.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

DreamFusion: Text-to-3D using 2D Diffusion

cs.CV · 2022-09-29 · accept · novelty 7.0

Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

cs.CV · 2024-12-18 · unverdicted · novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

MVDream: Multi-view Diffusion for 3D Generation

cs.CV · 2023-08-31 · conditional · novelty 6.0

MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.

citing papers explorer

Showing 7 of 7 citing papers.

SwordBench: Evaluating Orthogonality of Steering Image Representations cs.CV · 2026-05-10 · unverdicted · none · ref 77
SwordBench benchmarks steering methods for concept removal in vision models and shows that linear SVMs achieve strong separability and orthogonality but incur collateral damage, while sparse autoencoders often perform better and no method reaches perfect steering even in simple cases.
Disparities In Negation Understanding Across Languages In Vision-Language Models cs.CL · 2026-04-21 · unverdicted · none · ref 3
VLMs exhibit affirmation bias that varies by language, with a new multilingual benchmark showing CLIP at or below chance on non-Latin scripts, MultiCLIP most uniform, and SpaceVLM corrections effective unevenly across typologies.
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 92
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
DreamFusion: Text-to-3D using 2D Diffusion cs.CV · 2022-09-29 · accept · none · ref 28
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 5
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning cs.CV · 2024-12-18 · unverdicted · none · ref 182
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
MVDream: Multi-view Diffusion for 3D Generation cs.CV · 2023-08-31 · conditional · none · ref 160
MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.

ICML , year=

fields

years

verdicts

representative citing papers

citing papers explorer