Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, Tom Duerig · 2021

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

representative citing papers

Test-Time Distillation for Continual Model Adaptation

cs.CV · 2025-06-03 · conditional · novelty 7.0

CoDiRe blends VLM and target model predictions via MSP-based weighting and Optimal Transport rectification to enable stable continual test-time adaptation, outperforming CoTTA by 10.55% on ImageNet-C at 48% of the compute cost.

Vision Foundation Models as Generalist Tokenizers for Image Generation

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.

Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models

cs.CR · 2026-04-07 · unverdicted · novelty 6.0

Introduces a text-guided backdoor attack using common textual words as triggers and visual perturbations for stealthy, adjustable control on multimodal pretrained models.

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

cs.RO · 2026-01-31 · unverdicted · novelty 6.0

CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.

GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation

cs.RO · 2025-06-17 · unverdicted · novelty 6.0

GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.

Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings

cs.LG · 2025-11-26 · conditional · novelty 5.0

Generative purification with consensus aggregation reduces adversarial illusion attack success rates to near zero on ImageBind while improving alignment on both clean and attacked inputs.

Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models

cs.LG · 2025-07-01 · unverdicted · novelty 5.0

JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.

Let ViT Speak: Generative Language-Image Pre-training

cs.CV · 2026-05-01

citing papers explorer

Showing 8 of 8 citing papers.

Test-Time Distillation for Continual Model Adaptation cs.CV · 2025-06-03 · conditional · none · ref 16
CoDiRe blends VLM and target model predictions via MSP-based weighting and Optimal Transport rectification to enable stable continual test-time adaptation, outperforming CoTTA by 10.55% on ImageNet-C at 48% of the compute cost.
Vision Foundation Models as Generalist Tokenizers for Image Generation cs.CV · 2026-05-18 · unverdicted · none · ref 35
VFMTok builds a generalist image tokenizer on frozen VFMs using adaptive quantization and semantic alignment, delivering gFID 1.36 for autoregressive and 1.25 for continuous generation on ImageNet with 3x faster convergence.
Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models cs.CR · 2026-04-07 · unverdicted · none · ref 14
Introduces a text-guided backdoor attack using common textual words as triggers and visual perturbations for stealthy, adjustable control on multimodal pretrained models.
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining cs.RO · 2026-01-31 · unverdicted · none · ref 26
CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
GAF: Gaussian Action Field as a 4D Representation for Dynamic World Modeling in Robotic Manipulation cs.RO · 2025-06-17 · unverdicted · none · ref 28
GAF creates 4D dynamic scene models by adding motion to 3D Gaussians, enabling better reconstruction and 7.3% higher success in robotic tasks.
Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings cs.LG · 2025-11-26 · conditional · none · ref 12
Generative purification with consensus aggregation reduces adversarial illusion attack success rates to near zero on ImageBind while improving alignment on both clean and attacked inputs.
Escaping Plato's Cave: JAM for Aligning Independently Trained Vision and Language Models cs.LG · 2025-07-01 · unverdicted · none · ref 16
JAM aligns frozen vision and language models via joint autoencoders and multimodal Spread Loss, reliably inducing cross-modal alignment across layer depths, objectives, and model scales.
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unreviewed · ref 31

Scaling up visual and vision-language representation learning with noisy text supervision

fields

years

verdicts

representative citing papers

citing papers explorer