SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection and segmentation benchmarks.
hub
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
An asymmetric multi-level distillation framework lets a student ViT approximate clean-image representations from distorted inputs alone, outperforming prior methods on classification under distortions.
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.
OASIC uses anomaly-based masking and severity estimation to select occlusion-matched models, improving AUC on occluded images by up to 23.7 points.
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.
Ensembling inpainting models with median filtering, histogram matching, pixel averaging, and lightweight U-Net refinement yields more anatomically plausible and accurate inpainted MRI regions than individual baseline models.
citing papers explorer
-
SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters
SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection and segmentation benchmarks.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
-
Distilling Vision Transformers for Distortion-Robust Representation Learning
An asymmetric multi-level distillation framework lets a student ViT approximate clean-image representations from distorted inputs alone, outperforming prior methods on classification under distortions.
-
Exploring High-Order Self-Similarity for Video Understanding
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
-
Self-supervised pretraining for an iterative image size agnostic vision transformer
A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
-
PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL
PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.
-
OASIC: Occlusion-Agnostic and Severity-Informed Classification
OASIC uses anomaly-based masking and severity estimation to select occlusion-matched models, improving AUC on occluded images by up to 23.7 points.
-
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
-
Why Do Vision Language Models Struggle To Recognize Human Emotions?
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.
-
Post-Processing Methods for Improving Accuracy in MRI Inpainting
Ensembling inpainting models with median filtering, histogram matching, pixel averaging, and lightweight U-Net refinement yields more anatomically plausible and accurate inpainted MRI regions than individual baseline models.