hub Canonical reference

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, Yonghui Wu · 2022 · cs.CV · arXiv 2205.01917

Canonical reference. 78% of citing Pith papers cite this work as background.

41 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

Exploring large-scale pretrained foundation models is of significant interest in computer vision because these models can be quickly transferred to many downstream tasks. This paper presents Contrastive Captioner (CoCa), a minimalist design to pretrain an image-text encoder-decoder foundation model jointly with contrastive loss and captioning loss, thereby subsuming model capabilities from contrastive approaches like CLIP and generative methods like SimVLM. In contrast to standard encoder-decoder transformers where all decoder layers attend to encoder outputs, CoCa omits cross-attention in the first half of decoder layers to encode unimodal text representations, and cascades the remaining decoder layers which cross-attend to the image encoder for multimodal image-text representations. We apply a contrastive loss between unimodal image and text embeddings, in addition to a captioning loss on the multimodal decoder outputs which predicts text tokens autoregressively. By sharing the same computational graph, the two training objectives are computed efficiently with minimal overhead. CoCa is pretrained end-to-end and from scratch on both web-scale alt-text data and annotated images by treating all labels simply as text, seamlessly unifying natural language supervision for representation learning. Empirically, CoCa achieves state-of-the-art performance with zero-shot transfer or minimal task-specific adaptation on a broad range of downstream tasks, spanning visual recognition (ImageNet, Kinetics-400/600/700, Moments-in-Time), crossmodal retrieval (MSCOCO, Flickr30K, MSR-VTT), multimodal understanding (VQA, SNLI-VE, NLVR2), and image captioning (MSCOCO, NoCaps). Notably on ImageNet classification, CoCa obtains 86.3% zero-shot top-1 accuracy, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 1 method 1

citation-polarity summary

background 7 use dataset 1 use method 1

representative citing papers

OZ-TAL: Online Zero-Shot Temporal Action Localization

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.

Bottleneck Tokens for Unified Multimodal Retrieval

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

InstrAct: Towards Action-Centric Understanding in Instructional Videos

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.

Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

cs.CV · 2025-12-19 · conditional · novelty 7.0

A new open pipeline and dataset enable training of a vision-language model for whole-slide pathology VQA that outperforms MedGemma on tissue identification, neoplasm detection, and differential diagnosis.

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

cs.CV · 2025-10-26 · unverdicted · novelty 7.0

SARVLM is the first vision-language foundation model for SAR, trained via domain transfer on a 1M image-text dataset and outperforming prior models on 13 benchmarks for retrieval, recognition, detection, and captioning.

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

cs.CV · 2025-08-06 · unverdicted · novelty 7.0

The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.

LRM: Large Reconstruction Model for Single Image to 3D

cs.CV · 2023-11-08 · conditional · novelty 7.0

LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

cs.CV · 2023-01-30 · unverdicted · novelty 7.0

BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.

LAION-5B: An open large-scale dataset for training next generation image-text models

cs.CV · 2022-10-16 · accept · novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

PaLI: A Jointly-Scaled Multilingual Language-Image Model

cs.CV · 2022-09-14 · conditional · novelty 7.0

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

cs.CV · 2022-05-23 · accept · novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

TextTeacher: What Can Language Teach About Images?

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.

Multimodal LLMs under Pairwise Modalities

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.

STELLAR: Scaling 3D Perception Large Models for Autonomous Driving

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

A new post-hoc alignment technique uses learnable anchors to capture token-level relative similarities between modalities, outperforming global alignment baselines on zero-shot classification, retrieval, and segmentation with scarce paired examples.

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

cs.AI · 2026-05-15 · unverdicted · novelty 6.0

TTE-Flash trains latent think tokens with CoT generation loss and embedding tokens with contrastive loss to deliver high-performance multimodal representations without generating explicit reasoning at inference time.

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.

Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.

Compared to What? Baselines and Metrics for Counterfactual Prompting

cs.CL · 2026-05-01 · conditional · novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

cs.RO · 2026-01-31 · unverdicted · novelty 6.0

CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.

A cross-species neural foundation model for end-to-end speech decoding

cs.CL · 2025-11-21 · unverdicted · novelty 6.0

A cross-species pretrained neural encoder combined with end-to-end training and audio LLMs reduces word error rate in neural speech decoding from 24.69% to 10.22% while aligning attempted and imagined speech.

Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP

cs.CV · 2025-02-26 · conditional · novelty 6.0

Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.

Capabilities of Gemini Models in Medicine

cs.AI · 2024-04-29 · unverdicted · novelty 6.0

Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.

citing papers explorer

Showing 41 of 41 citing papers.

OZ-TAL: Online Zero-Shot Temporal Action Localization cs.CV · 2026-05-11 · unverdicted · none · ref 41 · internal anchor
Defines OZ-TAL task and presents a training-free VLM-based method that outperforms prior approaches for online and offline zero-shot temporal action localization on THUMOS14 and ActivityNet-1.3.
Bottleneck Tokens for Unified Multimodal Retrieval cs.LG · 2026-04-13 · unverdicted · none · ref 24 · internal anchor
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
InstrAct: Towards Action-Centric Understanding in Instructional Videos cs.CV · 2026-04-09 · unverdicted · none · ref 36 · internal anchor
InstrAction pretrains video foundation models using action-centric data filtering, hard negatives, an Action Perceiver module, DTW-Align, and Masked Action Modeling to reduce static bias and outperform prior models on a new InstrAct Bench for semantic, procedural, and retrieval tasks.
InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding cs.CV · 2026-04-09 · unverdicted · none · ref 64 · internal anchor
InstAP introduces instance-aware pre-training with a new dual-granularity dataset InstVL that improves both fine-grained instance retrieval and global video understanding over standard VLP baselines.
Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling cs.CV · 2025-12-19 · conditional · none · ref 21 · internal anchor
A new open pipeline and dataset enable training of a vision-language model for whole-slide pathology VQA that outperforms MedGemma on tissue identification, neoplasm detection, and differential diagnosis.
SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery cs.CV · 2025-10-26 · unverdicted · none · ref 45 · internal anchor
SARVLM is the first vision-language foundation model for SAR, trained via domain transfer on a 1M image-text dataset and outperforming prior models on 13 benchmarks for retrieval, recognition, detection, and captioning.
Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting cs.CV · 2025-08-06 · unverdicted · none · ref 8 · internal anchor
The paper offers a comprehensive survey and proposes a new taxonomy for continual learning strategies in VLMs and MLLMs to combat catastrophic forgetting beyond traditional methods.
LRM: Large Reconstruction Model for Single Image to 3D cs.CV · 2023-11-08 · conditional · none · ref 33 · internal anchor
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models cs.CV · 2023-01-30 · unverdicted · none · ref 12 · internal anchor
BLIP-2 bootstraps vision-language pre-training from frozen image encoders and LLMs via a lightweight two-stage Querying Transformer, delivering SOTA results with 54x fewer trainable parameters than Flamingo80B on zero-shot VQAv2.
LAION-5B: An open large-scale dataset for training next generation image-text models cs.CV · 2022-10-16 · accept · none · ref 92 · internal anchor
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
PaLI: A Jointly-Scaled Multilingual Language-Image Model cs.CV · 2022-09-14 · conditional · none · ref 88 · internal anchor
PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding cs.CV · 2022-05-23 · accept · none · ref 80 · internal anchor
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
TextTeacher: What Can Language Teach About Images? cs.CV · 2026-05-21 · unverdicted · none · ref 72 · internal anchor
TextTeacher uses frozen text embeddings from captions as semantic anchors to guide vision model training, improving ImageNet accuracy by up to 2.7 p.p. and transfer performance by 1.0 p.p. on average.
Multimodal LLMs under Pairwise Modalities cs.CV · 2026-05-20 · unverdicted · none · ref 69 · internal anchor
A two-stage framework enables multimodal LLMs to learn shared latent representations from pairwise modality data and achieve cross-modal generation when incorporating new modalities.
STELLAR: Scaling 3D Perception Large Models for Autonomous Driving cs.CV · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
STELLAR trains up to 500M-parameter multi-modal models on 50M driving scenes and reports empirical scaling trends plus new state-of-the-art results on the Waymo Open Dataset.
Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data cs.CV · 2026-05-16 · unverdicted · none · ref 16 · internal anchor
A new post-hoc alignment technique uses learnable anchors to capture token-level relative similarities between modalities, outperforming global alignment baselines on zero-shot classification, retrieval, and segmentation with scarce paired examples.
TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens cs.AI · 2026-05-15 · unverdicted · none · ref 32 · internal anchor
TTE-Flash trains latent think tokens with CoT generation loss and embedding tokens with contrastive loss to deliver high-performance multimodal representations without generating explicit reasoning at inference time.
What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion cs.CV · 2026-05-08 · unverdicted · none · ref 99 · internal anchor
Prior-Aligned AutoEncoders shape latent manifolds with spatial coherence, local continuity, and global semantics to improve latent diffusion, achieving SOTA gFID 1.03 on ImageNet 256x256 with up to 13x faster convergence.
Joint Semantic Token Selection and Prompt Optimization for Interpretable Prompt Learning cs.CV · 2026-05-06 · unverdicted · none · ref 18 · internal anchor
IPL alternates discrete semantic token selection using approximate submodular optimization with continuous prompt optimization to boost both interpretability and task performance in vision-language model adaptation.
Compared to What? Baselines and Metrics for Counterfactual Prompting cs.CL · 2026-05-01 · conditional · none · ref 91 · internal anchor
Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining cs.RO · 2026-01-31 · unverdicted · none · ref 56 · internal anchor
CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
A cross-species neural foundation model for end-to-end speech decoding cs.CL · 2025-11-21 · unverdicted · none · ref 19 · internal anchor
A cross-species pretrained neural encoder combined with end-to-end training and audio LLMs reduces word error rate in neural speech decoding from 24.69% to 10.22% while aligning attempted and imagined speech.
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP cs.CV · 2025-02-26 · conditional · none · ref 7 · internal anchor
Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.
Capabilities of Gemini Models in Medicine cs.AI · 2024-04-29 · unverdicted · none · ref 176 · internal anchor
Med-Gemini sets new records on 10 of 14 medical benchmarks including 91.1% on MedQA-USMLE, beats GPT-4V by 44.5% on multimodal tasks, and surpasses humans on medical text summarization.
Revisiting Feature Prediction for Learning Visual Representations from Video cs.CV · 2024-02-15 · conditional · none · ref 293 · internal anchor
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
CogVLM: Visual Expert for Pretrained Language Models cs.CV · 2023-11-06 · conditional · none · ref 25 · internal anchor
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 68 · internal anchor
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Vision Transformers Need Registers cs.CV · 2023-09-28 · unverdicted · none · ref 279 · internal anchor
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Sigmoid Loss for Language Image Pre-Training cs.CV · 2023-03-27 · conditional · none · ref 56 · internal anchor
SigLIP replaces softmax-based contrastive loss with a simple pairwise sigmoid loss for vision-language pre-training, decoupling batch size from normalization and reaching strong zero-shot performance with limited compute.
Aligning Text-to-Image Models using Human Feedback cs.LG · 2023-02-23 · unverdicted · none · ref 23 · internal anchor
A three-stage fine-tuning process uses human ratings to train a reward model and then improves text-to-image alignment by maximizing reward-weighted likelihood.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation cs.CV · 2022-06-22 · unverdicted · none · ref 25 · internal anchor
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unverdicted · none · ref 77 · internal anchor
GenLIP pretrains ViTs to generate language tokens from visual tokens via autoregressive language modeling, matching strong baselines on multimodal tasks with less data.
From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media cs.CV · 2026-04-23 · unverdicted · none · ref 56 · internal anchor
VLMs recover reliable population-level trends in climate change visual discourse on social media even when per-image accuracy is only moderate.
DAPL: Integration of Positive and Negative Descriptions in Text-Based Person Search cs.CV · 2024-05-13 · unverdicted · none · ref 6 · internal anchor
DAPL integrates positive and negative descriptions using Dual Image-Attribute Contrastive learning, Sensitive Image-Attribute Matching, and Dynamic Token-wise Similarity loss to improve text-based person search accuracy.
Hallucination of Multimodal Large Language Models: A Survey cs.CV · 2024-04-29 · accept · none · ref 194 · internal anchor
The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks cs.CV · 2023-12-21 · unverdicted · none · ref 170 · internal anchor
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning cs.CV · 2022-12-06 · unverdicted · none · ref 7 · internal anchor
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
GIT: A Generative Image-to-text Transformer for Vision and Language cs.CV · 2022-05-27 · unverdicted · none · ref 33 · internal anchor
GIT achieves new state-of-the-art results on 12 vision-language benchmarks, including surpassing human performance on TextCaps, via a simplified single-encoder single-decoder transformer scaled on large pre-training data.
Character-Centered Dialogue Generation from Scene-Level Prompts cs.CV · 2025-05-22 · unverdicted · none · ref 66 · internal anchor
A training-free framework generates expressive, character-grounded dialogue and speech from scene prompts using vision-language encoders, LLMs, and a recursive narrative memory bank for cross-scene consistency.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models cs.CV · 2024-02-27 · unverdicted · none · ref 79 · internal anchor
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation cs.AI · 2026-05-20 · unreviewed · ref 40 · internal anchor

CoCa: Contrastive Captioners are Image-Text Foundation Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer