Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.
arXiv preprint arXiv:2212.07143 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 5roles
background 1polarities
background 1representative citing papers
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumonia detection.
Cross-dataset testing of nearest-neighbor and Mahalanobis anomaly detectors on CLIP, DINOv2, ResNet-50 and EfficientNet embeddings shows same-dataset AUC averaging 0.704 dropping to 0.499 on other datasets, with false-alarm rates around 31,931 per hour at usable operating points.
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
citing papers explorer
-
$A^2$: Smaller Self-Supervised ViTs Localize Better than Larger Ones
Smaller self-supervised ViTs localize objects better via attention than larger ViTs, enabling A² to decouple localization from feature extraction for competitive performance on distribution-shifted benchmarks.
-
Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Gromov-Wasserstein distance between modalities provides a stronger, inference-only predictor of final VLM performance than conventional encoder metrics, backed by theory linking it to cross-modal learnability and verified across 60+ training runs.
-
Benchmark AUC Is Not Deployable Reliability: A Cross-Dataset Audit of Off-the-Shelf Features for Surveillance Video Anomaly Detection
Cross-dataset testing of nearest-neighbor and Mahalanobis anomaly detectors on CLIP, DINOv2, ResNet-50 and EfficientNet embeddings shows same-dataset AUC averaging 0.704 dropping to 0.499 on other datasets, with false-alarm rates around 31,931 per hour at usable operating points.