Vision encoders on single 2D molecular images with a chemistry-informed curriculum achieve top or near-top results on 10 property prediction tasks at 80x lower FLOPs than multi-modal competitors.
hub Mixed citations
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Mixed citation behavior. Most common role is background (65%).
abstract
Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion seen samples achieves 82.0 zero-shot top-1 accuracy on ImageNet-1K val. A smaller EVA-02-CLIP-L/14+ with only 430 million parameters and 6 billion seen samples achieves 80.4 zero-shot top-1 accuracy on ImageNet-1K val. To facilitate open access and open research, we release the complete suite of EVA-CLIP to the community at https://github.com/baaivision/EVA/tree/master/EVA-CLIP.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Contrastive language-image pre-training, CLIP for short, has gained increasing attention for its potential in various scenarios. In this paper, we propose EVA-CLIP, a series of models that significantly improve the efficiency and effectiveness of CLIP training. Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs. Notably, our largest 5.0B-parameter EVA-02-CLIP-E/14+ with only 9 billion se
- background [116] Wei Song, Yuran Wang, Zijia Song, Yadong Li, Haoze Sun, Weipeng Chen, Zenan Zhou, Jianhua Xu, Jiaqi Wang, and Kaicheng Yu. Dualtoken: Towards unifying visual understanding and generation with dual visual vocabularies. arXiv preprint arXiv:2503.14324, 2025. [117] JD Open Source. Joyai-image: Awakening spatial intelligence in unified multimodal understanding and generation, 2026. URLhttps://github.com/jd-opensource/JoyAI-Image. [118] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao
- method Figure 2: Architecture and training paradigm of VideoChat-Embed. It is built on BLIP-2 [18] and StableVicuna [10]. The training contains two-stage alignment and instruction tuning. 3.2.1 Architecture In this paper, we instantiate the VideoChat-Embed based on BLIP-2 [18] and StableVicuna [10](Figure 2a). Concretely, we incorporate the pretrained ViT-G [39] with Global Multi-Head Relation Aggrega- tor (GMHRA), a temporal modeling module used in InternVideo [46] and UniFormerV2 [20]. For the token
- background ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 5 [46] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4 [47] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv prep
- background Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748-8763. PmLR, 2021. [36] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937-13949, 2021. [37] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at sc
- background 2 Related Work In this section, we first review existing 3D representation learning methods based on vision-language pretraining, and then summarize commonly used 3D scene datasets for pretraining and the evaluation protocols for vision-language models. 3D Vision-Language Pretraining.3D vision-language pretraining aligns a 3D encoder with pretrained CLIP models [17,43,47,49] and has become a com- mon paradigm for 3D representation learning. Most previous works adopt point clouds as the input mod
- baseline 224 49 MetaCLIP [66] 67.7 59.6 - 52.8 - 46.6 - 72.9 - - - 256 64 OpenCLIP [27] 72.8 64.8 - 59.6 - 39.9 57.9 64.9 84.8 - - SigLIP 2 74.0 66.9 81.4 66.1 66.6 47.2 63.7 75.5 89.3 38.3 49.0 B/16 224 196 CLIP [50] 68.3 61.9 - 55.3 - 33.1 52.4 62.1 81.9 - - OpenCLIP [27] 70.2 62.3 - 56.0 - 42.3 59.4 69.8 86.3 - - MetaCLIP [66] 72.4 65.1 - 60.0 - 48.9 - 77.1 - - - EVA-CLIP [57] 74.7 67.0 - 62.3 - 42.2 58.7 71.2 85.7 - - SigLIP [71] 76.2 69.5 82.8 70.7 69.9 47.2 64.5 77.9 89.6 22.4 29.3 DFN [19] 76.2 68
co-cited works
representative citing papers
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
T-VSS is a lightweight test-time defense that steers attacked visual features in VLMs using sample-specific low-rank subspaces and reliability-weighted entropy minimization to improve robustness.
ImageAuditor is the first MIA for IRAG that achieves over 80% AUROC with four queries by using reward-guided policy optimization for cross-modal retrieval and task-specific prompting for signal extraction.
GeoFlowVLM learns joint distributions of l2-normalized VLM embeddings on the product hypersphere via Riemannian flow matching to expose both aleatoric and epistemic uncertainty through derived entropy and typicality scores.
Image meanings grow more context-dependent with semantic abstraction, requiring narrative grounding for accurate retrieval at higher levels.
Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.
BERAG applies Bayesian ensemble weighting of individual documents via token-by-token posterior updates in retrieval-augmented generation, yielding gains on knowledge-based visual QA tasks.
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
SteelDefectX is a new multi-form vision-language dataset and benchmark for analyzing steel surface defects using 7,778 images across 25 categories.
NeuroKalman mitigates state drift in vision-language UAV navigation by using memory-augmented Kalman filtering where attention retrieves historical anchors to correct predictions without gradient updates.
PowerCLIP improves CLIP-style models by exhaustively aligning powersets of image regions to textual parse trees via efficient non-linear aggregators that approximate the full combinatorial loss.
Empirical study of a fully synthetic data generation pipeline for text-based person retrieval that tests its use as a replacement or augmentation for real data across scenarios.
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.
Cambrian-1 is a vision-centric multimodal LLM family that evaluates over 20 vision encoders, introduces CV-Bench and the Spatial Vision Aggregator, and releases open models, code, and data achieving strong performance on visual grounding tasks.
MuirBench is a new benchmark showing that top multimodal LLMs struggle with robust multi-image understanding, with GPT-4o at 68% and open-source models below 33% accuracy.
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
TPB is an AdaBoost-style ensemble method for text prompts in VLMs that improves few-shot accuracy by targeting hard examples and maintains gains across model transfers.
CriterionSI infers clustering criteria from sequential user drags via LLMs to produce progressively aligned image cluster layouts.
ProMSA is a progressive multimodal search agent for KB-VQA that iteratively selects search tools under budgets, trained via rejection-sampling SFT then TN-GSPO RL, reporting gains on E-VQA and InfoSeek over RAG baselines.
citing papers explorer
-
MolSight: Molecular Property Prediction with Images
Vision encoders on single 2D molecular images with a chemistry-informed curriculum achieve top or near-top results on 10 property prediction tasks at 80x lower FLOPs than multi-modal competitors.
-
Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
MSLA is the first physically deployable attack that uses adversarial lighting to break semantic alignment in VLMs such as CLIP, LLaVA, and BLIP, causing classification failures and hallucinations in real scenes.
-
T-VSS: Test-Time Visual Subspace Steering for Adversarial Robustness of Vision-Language Models
T-VSS is a lightweight test-time defense that steers attacked visual features in VLMs using sample-specific low-rank subspaces and reliability-weighted entropy minimization to improve robustness.
-
Exploring Hierarchical Consistency and Unbiased Objectness for Open-Vocabulary Object Detection
Hierarchical confidence calibration and LoCLIP adaptation improve pseudo-label quality for open-vocabulary object detection, achieving new state-of-the-art results on COCO and LVIS benchmarks.
-
OVS-DINO: Open-Vocabulary Segmentation via Structure-Aligned SAM-DINO with Language Guidance
OVS-DINO structurally aligns DINO with SAM to revitalize attenuated boundary features, achieving SOTA gains of 2.1% average and 6.3% on Cityscapes in weakly-supervised open-vocabulary segmentation.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
-
SteelDefectX: A Multi-Form Vision-Language Dataset and Benchmark for Steel Surface Defect Analysis
SteelDefectX is a new multi-form vision-language dataset and benchmark for analyzing steel surface defects using 7,778 images across 25 categories.
-
ProMSA:Progressive Multimodal Search Agents for Knowledge-Based Visual Question Answering
ProMSA is a progressive multimodal search agent for KB-VQA that iteratively selects search tools under budgets, trained via rejection-sampling SFT then TN-GSPO RL, reporting gains on E-VQA and InfoSeek over RAG baselines.
-
Language-Instructed Vision Embeddings for Controllable and Generalizable Perception
LIVE uses language to generate task-centric vision embeddings at inference, reducing hallucinations by 34 points on MMVP, outperforming larger VLMs on VQA, and generalizing to unseen tasks.
-
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: From Evaluation to Diagnosis
FG-BMK benchmark shows current LVLMs are inadequate at fine-grained recognition due to bottlenecks in visual representations, semantic grounding, modality alignment, and category knowledge.
-
CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection
CL-CLIP uses CLIP image-text cost volumes to create class-specific pathways processed by a multi-expert RoI head, improving continual object detection on VOC and COCO over the F-ViT baseline.
-
The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP
LRA-EE early exit with spatio-semantic patch averaging, multi-feature gating, and layer-adaptive thresholds bypasses quantization noise in CLIP, cutting FLOPs 13.4% while raising ImageNet-1K zero-shot Top-1 from 58.72% to 61.16%.
-
UniRefiner: Teaching Pre-trained ViTs to Self-Dispose Dross via Contrastive Register
UniRefiner uses contrastive registers and a dual alignment objective to remove three categories of spurious tokens from pre-trained ViTs, yielding up to 9.4% mIoU gains on ADE20K and 22% zero-shot segmentation improvements.
-
WOW-Seg: A Word-free Open World Segmentation Model
WOW-Seg proposes a word-free open-world segmentation model using Mask2Token and Cascade Attention Mask modules, reporting 89.7 semantic similarity and 82.4 semantic IoU on LVIS with one-eighth the parameters of prior SOTA plus a new 7,662-class benchmark.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
-
Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics
CLIP models understand 360-degree textual semantics via explicit identifiers but show limited comprehension of visual semantics under horizontal circular shifts, which a LoRA fine-tuning approach improves with a noted trade-off in original task performance.
-
Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts
Cross-AUC exposes large robustness drops in existing face forgery detectors across datasets, while the SFAM model with semantic alignment and region-specific experts delivers better performance on public benchmarks.
-
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
-
Exploring High-Order Self-Similarity for Video Understanding
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
-
Cross-Attentive Multiview Fusion of Vision-Language Embeddings
CAMFusion fuses multiview 2D vision-language embeddings via cross-attention and multiview consistency self-supervision to produce better 3D semantic and instance representations, outperforming averaging and reaching SOTA on benchmarks including zero-shot out-of-domain cases.
-
Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning
Dual-modality anchors from text descriptions and test-time image statistics filter views and ensemble predictions to improve test-time prompt tuning, achieving SOTA on 15 datasets.
-
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.
-
TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
TIPSv2 improves dense patch-text alignment in vision-language pretraining through distillation and iBOT++ modifications, yielding models on par with or better than recent baselines on 9 tasks across 20 datasets.
-
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.
-
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
-
Vision Transformers Need More Than Registers
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
-
TuringViT: Making SOTA Vision Transformers Accessible to All
TuringViT claims a new ViT design with linear attention and curated data that matches SOTA performance using 10% of typical pretraining data while supporting dynamic resolutions and improving VLM integration.
-
The Professor: Multi-Teacher Unsupervised Prompt Distillation for Vision-Language Models
Multi-teacher confidence-weighted ensembling in unsupervised prompt distillation raises average harmonic mean from 87.52 to 89.28 across four base-to-novel datasets, with largest gains on domain-shifted EuroSAT.
-
Multimodal Concept Bottleneck Models
MM-CBM adds dual concept bottleneck layers to CLIP to enable interpretable multimodal vision tasks, reporting up to 51.26% average accuracy gains over prior CBMs across four benchmarks.
-
One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling
OSTB estimates a consensus sample-to-class transport plan from multiple frozen VLMs to perform model selection by reliability ranking, target adaptation via transport-conditioned classifiers, and ensembling via reliability-aware integration.
-
FineGen: A VLM-based Multi-Agent Framework for Fine-Grained Image-Text Dataset Construction
FineGen uses a VLM multi-agent pipeline to build FineGen-100K, a 147k-sample hierarchical dataset of attribute-specific hard negatives, reporting 96.7% validity and +14.4% downstream accuracy gain on hard samples in FG-OVD.
-
SIGMA: Bridging Structural and Distributional Gaps for Vision Foundation Model Adaptation
SIGMA proposes a lightweight PEFT adapter consisting of scale-adaptive fusion and semantic modulation to bridge structural and distributional gaps when adapting vision foundation models to dense tasks.
-
What Matters for Grocery Product Retrieval with Open Source Vision Language Models
Systematic zero-shot benchmarking of open-source VLMs on multimodal grocery product retrieval shows data quality outperforms scale, introduces semantic power density as an efficiency metric, and identifies a persistent top-1 precision gap.
-
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
-
ReasonCLIP-58M: Visually Grounded Commonsense Reasoning Supervision for CLIP
ReasonCLIP-58M applies continual pretraining with visually grounded reasoning captions on 58M examples to improve CLIP-style models on commonsense and compositional reasoning tasks.
-
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
Rank-aware selective fusion via attention-based gating and decoupled presence/salience heads with unsupervised domain adaptation outperforms baselines and ranks 2nd on the BlEmoRE challenge for blended emotion recognition.
-
Boosting Robust AIGI Detection with LoRA-based Pairwise Training
LoRA-based pairwise training with distortion and size simulations boosts robust AIGI detection under severe distortions, placing third in the NTIRE challenge.
-
NTIRE 2026 Challenge on Robust AI-Generated Image Detection in the Wild
The NTIRE 2026 challenge provides a dataset of over 294,000 real and AI-generated images with 36 transformations to benchmark robust detection models.
- RGB-Pointmap Pretraining for Unified 3D Scene Understanding
- WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
- Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
- R3G: A Reasoning-Retrieval-Reranking Framework for Vision-Centric Answer Generation