hub Mixed citations

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao · 2023 · cs.CV · arXiv 2303.15389

Mixed citation behavior. Most common role is background (65%).

74 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 74 citing papers arXiv PDF

abstract

Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 method 4 baseline 1 dataset 1 other 1

citation-polarity summary

background 13 use method 4 baseline 1 unclear 1 use dataset 1

claims ledger

abstract Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion se
background [116] Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025. [117] JD Open Source. Joyai-image: Awakening spatial intelligence in unified multimodal understanding and generation, 2026. URLhttps://github.com/jd-opensource/JoyAI-Image. [118] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao
method Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate the VideoChat-Embed based on BLIP-2 [18] and StableVicuna [10](Figure 2a). Concretely, we incorporate the pretrained ViT-G [39] with Global Multi-Head Relation Aggrega- tor (GMHRA), a temporal modeling module used in InternVideo [46] and UniFormerV2 [20]. For the token
background ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 5 [46] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4 [47] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv prep
background Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [36] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937-13949, 2021. [37] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at sc
background 2 Related Work In this section, we first review existing 3D representation learning methods based on vision-language pretraining, and then summarize commonly used 3D scene datasets for pretraining and the evaluation protocols for vision-language models. 3D Vision-Language Pretraining.3D vision-language pretraining aligns a 3D encoder with pretrained CLIP models [17,43,47,49] and has become a com- mon paradigm for 3D representation learning. Most previous works adopt point clouds as the input mod
baseline 224 49 MetaCLIP [66] 67.7 59.6 - 52.8 - 46.6 - 72.9 - - - 256 64 OpenCLIP [27] 72.8 64.8 - 59.6 - 39.9 57.9 64.9 84.8 - - SigLIP 2 74.0 66.9 81.4 66.1 66.6 47.2 63.7 75.5 89.3 38.3 49.0 B/16 224 196 CLIP [50] 68.3 61.9 - 55.3 - 33.1 52.4 62.1 81.9 - - OpenCLIP [27] 70.2 62.3 - 56.0 - 42.3 59.4 69.8 86.3 - - MetaCLIP [66] 72.4 65.1 - 60.0 - 48.9 - 77.1 - - - EVA-CLIP [57] 74.7 67.0 - 62.3 - 42.2 58.7 71.2 85.7 - - SigLIP [71] 76.2 69.5 82.8 70.7 69.9 47.2 64.5 77.9 89.6 22.4 29.3 DFN [19] 76.2 68

co-cited works

representative citing papers

MolSight: Molecular Property Prediction with Images

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

Vision encoders on single 2D molecular images with a chemistry-informed curriculum achieve top or near-top results on 10 property prediction tasks at 80x lower FLOPs than multi-modal competitors.

Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

cs.CV · 2026-04-14 · unverdicted · novelty 8.0

MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.

Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings

cs.IR · 2026-05-13 · unverdicted · novelty 7.0

Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.

Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.

BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering

cs.CL · 2026-04-24 · unverdicted · novelty 7.0

BERAG applies Bayesian ensemble weighting of individual documents via token-by-token posterior updates in retrieval-augmented generation, yielding gains on knowledge-based visual QA tasks.

OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

cs.CV · 2026-03-29 · unverdicted · novelty 7.0

A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.

SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis

cs.CV · 2026-03-23 · unverdicted · novelty 7.0

SteelDefectX is a new multi-form vision-language dataset and benchmark for analyzing steel surface defects using 7,778 images across 25 categories.

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

cs.CV · 2026-03-10 · unverdicted · novelty 7.0

WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.

Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering

cs.RO · 2026-01-30 · unverdicted · novelty 7.0

NeuroKalman mitigates state drift in vision-language UAV navigation by using memory-augmented Kalman filtering where attention retrieves historical anchors to correct predictions without gradient updates.

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

cs.CV · 2025-11-28 · conditional · novelty 7.0

PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.

An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval

cs.CV · 2025-03-28 · unverdicted · novelty 7.0

Empirical study of a fully synthetic data generation pipeline for text-based person retrieval that tests its use as a replacement or augmentation for real data across scenarios.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

cs.CV · 2024-10-17 · unverdicted · novelty 7.0

Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

cs.CV · 2024-10-07 · conditional · novelty 7.0

VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

cs.CV · 2024-06-24 · unverdicted · novelty 7.0

Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

cs.CV · 2024-06-13 · conditional · novelty 7.0

MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

VideoChat: Chat-Centric Video Understanding

cs.CV · 2023-05-10 · conditional · novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.

WOW-Seg: A Word-free Open World Segmentation Model

cs.CV · 2026-05-16 · conditional · novelty 6.0

WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.

citing papers explorer

Showing 50 of 74 citing papers.

MolSight: Molecular Property Prediction with Images cs.CV · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
Vision encoders on single 2D molecular images with a chemistry-informed curriculum achieve top or near-top results on 10 property prediction tasks at 80x lower FLOPs than multi-modal competitors.
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks cs.CV · 2026-04-14 · unverdicted · none · ref 36 · internal anchor
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding cs.LG · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.
Same Image, Different Meanings: Toward Retrieval of Context-Dependent Meanings cs.IR · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection cs.CV · 2026-04-25 · unverdicted · none · ref 29 · internal anchor
Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.
BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering cs.CL · 2026-04-24 · unverdicted · none · ref 25 · internal anchor
BERAG applies Bayesian ensemble weighting of individual documents via token-by-token posterior updates in retrieval-augmented generation, yielding gains on knowledge-based visual QA tasks.
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance cs.CV · 2026-04-09 · unverdicted · none · ref 44 · internal anchor
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 43 · internal anchor
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models cs.CV · 2026-03-29 · unverdicted · none · ref 30 · internal anchor
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis cs.CV · 2026-03-23 · unverdicted · none · ref 34 · internal anchor
SteelDefectX is a new multi-form vision-language dataset and benchmark for analyzing steel surface defects using 7,778 images across 25 categories.
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition cs.CV · 2026-03-10 · unverdicted · none · ref 36 · internal anchor
WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.
Mitigating Error Accumulation in Continuous Navigation via Memory-Augmented Kalman Filtering cs.RO · 2026-01-30 · unverdicted · none · ref 15 · internal anchor
NeuroKalman mitigates state drift in vision-language UAV navigation by using memory-augmented Kalman filtering where attention retrieves historical anchors to correct predictions without gradient updates.
PowerCLIP: Powerset Alignment for Contrastive Pre-Training cs.CV · 2025-11-28 · conditional · none · ref 53 · internal anchor
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval cs.CV · 2025-03-28 · unverdicted · none · ref 44 · internal anchor
Empirical study of a fully synthetic data generation pipeline for text-based person retrieval that tests its use as a replacement or augmentation for real data across scenarios.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 74 · internal anchor
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks cs.CV · 2024-10-07 · conditional · none · ref 31 · internal anchor
VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs cs.CV · 2024-06-24 · unverdicted · none · ref 123 · internal anchor
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding cs.CV · 2024-06-13 · conditional · none · ref 57 · internal anchor
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 35 · internal anchor
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
VideoChat: Chat-Centric Video Understanding cs.CV · 2023-05-10 · conditional · none · ref 39 · internal anchor
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register cs.CV · 2026-05-19 · unverdicted · none · ref 30 · internal anchor
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
WOW-Seg: A Word-free Open World Segmentation Model cs.CV · 2026-05-16 · conditional · none · ref 19 · internal anchor
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 106 · internal anchor
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 37 · internal anchor
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers cs.CL · 2026-05-08 · unverdicted · none · ref 13 · 2 links · internal anchor
GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text performance unchanged.
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics cs.CV · 2026-04-27 · conditional · none · ref 12 · internal anchor
CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted trade-off in original task performance.
Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts cs.CV · 2026-04-23 · unverdicted · none · ref 22 · internal anchor
Cross-AUC exposes large robustness drops in existing face forgery detectors across datasets, while the SFAM model with semantic alignment and region-specific experts delivers better performance on public benchmarks.
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment cs.CV · 2026-04-23 · unverdicted · none · ref 54 · internal anchor
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
Exploring High-Order Self-Similarity for Video Understanding cs.CV · 2026-04-22 · unverdicted · none · ref 73 · internal anchor
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
Cross-Attentive Multiview Fusion of Vision-Language Embeddings cs.CV · 2026-04-14 · unverdicted · none · ref 34 · internal anchor
CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning cs.CV · 2026-04-14 · unverdicted · none · ref 32 · internal anchor
Dual-modality anchors from text descriptions and test-time image statistics filter views and ensemble predictions to improve test-time prompt tuning, achieving SOTA on 15 datasets.
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models cs.CV · 2026-04-14 · unverdicted · none · ref 62 · internal anchor
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment cs.CV · 2026-04-13 · unverdicted · none · ref 48 · internal anchor
TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering cs.CV · 2026-04-07 · unverdicted · none · ref 29 · internal anchor
WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning cs.CV · 2026-04-03 · unverdicted · none · ref 57 · internal anchor
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
Contrastive Language-Colored Pointmap Pretraining for Unified 3D Scene Understanding cs.CV · 2026-04-02 · unverdicted · none · ref 47 · internal anchor
UniScene3D learns unified 3D scene representations from colored pointmaps using contrastive CLIP pretraining plus cross-view geometric and grounded view alignments, achieving state-of-the-art results on viewpoint grounding, scene retrieval, classification, and 3D VQA.
Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 32 · internal anchor
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models cs.CV · 2026-02-02 · unverdicted · none · ref 23 · internal anchor
ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining cs.RO · 2026-01-31 · unverdicted · none · ref 44 · internal anchor
CLAMP pretrains 3D multi-view encoders with contrastive learning on point clouds and actions, then initializes diffusion policies for more sample-efficient fine-tuning on robotic tasks.
R3G: A Reasoning--Retrieval--Reranking Framework for Vision-Centric Answer Generation cs.CV · 2026-01-25 · unverdicted · none · ref 26 · internal anchor
R3G improves vision-centric visual question answering by generating reasoning plans to guide two-stage image retrieval and reranking, achieving state-of-the-art results on MRAG-Bench across six MLLM backbones.
Calibrated Multimodal Representation Learning with Missing Modalities cs.CV · 2025-11-15 · unverdicted · none · ref 82 · internal anchor
CalMRL mitigates anchor shift in multimodal representation learning by calibrating incomplete alignments through representation-level imputation of missing modalities using priors and a bi-step optimization with closed-form shared latent posteriors.
VFM-VAE: Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models cs.CV · 2025-10-21 · unverdicted · none · ref 18 · internal anchor
VFM-VAE uses a frozen VFM directly as LDM tokenizer via a custom decoder, reaching gFID 2.22 in 80 epochs and 1.62 after 640 epochs.
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering cs.CV · 2025-08-31 · unverdicted · none · ref 38 · internal anchor
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning cs.CV · 2025-07-18 · conditional · none · ref 84 · internal anchor
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation cs.CV · 2025-04-21 · unverdicted · none · ref 38 · internal anchor
Introduces FG-BMK benchmark and evaluates twelve LVLMs on fine-grained semantic recognition and feature tasks, identifying influences from training paradigms and perturbation sensitivity.
Perception Encoder: The best visual embeddings are not at the output of the network cs.CV · 2025-04-17 · unverdicted · none · ref 129 · internal anchor
Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 47 · internal anchor
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP cs.CV · 2025-02-26 · conditional · none · ref 83 · internal anchor
Grad-ECLIP produces gradient-based visual and textual explanation heatmaps for CLIP by applying channel and spatial weights to token features instead of relying on sparse self-attention maps.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning cs.CV · 2024-12-18 · unverdicted · none · ref 187 · internal anchor
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 82 · internal anchor
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

EVA-CLIP: Improved Training Techniques for CLIP at Scale

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer