AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
Susskind, and Alaaeldin El-Nouby
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
baseline 2polarities
baseline 2representative citing papers
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
Modality-mutual attention (MMA) is introduced to replace causal attention in MLLMs, enabling mutual attention between image and text tokens and claiming SOTA results on 12 multimodal benchmarks with no extra parameters.
Ensemble of vision transformers reaches 96.77% AUC and 9% EER on DF-Wild deepfake test set, outperforming the prior Effort baseline by 7% AUC and 8% EER.
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilingual understanding at scales from 86M to 1B parameters.
citing papers explorer
-
AVA-Bench: Atomic Visual Ability Benchmark for Vision Foundation Models
AVA-Bench evaluates vision foundation models by disentangling 14 atomic visual abilities with aligned training-test distributions to reveal precise ability fingerprints.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.
-
Seeing is Understanding: Unlocking Causal Attention into Modality-Mutual Attention for Multimodal LLMs
Modality-mutual attention (MMA) is introduced to replace causal attention in MLLMs, enabling mutual attention between image and text tokens and claiming SOTA results on 12 multimodal benchmarks with no extra parameters.
-
Towards Generalizable Deepfake Image Detection with Vision Transformers
Ensemble of vision transformers reaches 96.77% AUC and 9% EER on DF-Wild deepfake test set, outperforming the prior Effort baseline by 7% AUC and 8% EER.
-
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
SigLIP 2 models trained with a unified recipe of captioning, self-supervised losses, and curated diverse data outperform prior SigLIP versions on classification, retrieval, localization, dense prediction, and multilingual understanding at scales from 86M to 1B parameters.