hub

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K · 2022

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters

cs.CV · 2026-05-04 · unverdicted · novelty 7.0

SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection and segmentation benchmarks.

SS3D: End2End Self-Supervised 3D from Web Videos

cs.CV · 2026-04-24 · unverdicted · novelty 6.0 · 3 refs

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.

Distilling Vision Transformers for Distortion-Robust Representation Learning

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

An asymmetric multi-level distillation framework lets a student ViT approximate clean-image representations from distorted inputs alone, outperforming prior methods on classification under distortions.

Exploring High-Order Self-Similarity for Video Understanding

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

Self-supervised pretraining for an iterative image size agnostic vision transformer

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.

PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.

OASIC: Occlusion-Agnostic and Severity-Informed Classification

cs.CV · 2026-04-05 · conditional · novelty 6.0

OASIC uses anomaly-based masking and severity estimation to select occlusion-matched models, improving AUC on occluded images by up to 23.7 points.

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.

Why Do Vision Language Models Struggle To Recognize Human Emotions?

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.

Post-Processing Methods for Improving Accuracy in MRI Inpainting

cs.CV · 2025-10-17 · unverdicted · novelty 4.0

Ensembling inpainting models with median filtering, histogram matching, pixel averaging, and lightweight U-Net refinement yields more anatomically plausible and accurate inpainted MRI regions than individual baseline models.

citing papers explorer

Showing 9 of 9 citing papers after filters.

SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters cs.CV · 2026-05-04 · unverdicted · none · ref 22
SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection and segmentation benchmarks.
SS3D: End2End Self-Supervised 3D from Web Videos cs.CV · 2026-04-24 · unverdicted · none · ref 19 · 3 links
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.
Distilling Vision Transformers for Distortion-Robust Representation Learning cs.CV · 2026-04-24 · unverdicted · none · ref 10
An asymmetric multi-level distillation framework lets a student ViT approximate clean-image representations from distorted inputs alone, outperforming prior methods on classification under distortions.
Exploring High-Order Self-Similarity for Video Understanding cs.CV · 2026-04-22 · unverdicted · none · ref 22
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.
Self-supervised pretraining for an iterative image size agnostic vision transformer cs.CV · 2026-04-22 · unverdicted · none · ref 25
A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL cs.LG · 2026-04-09 · unverdicted · none · ref 13
PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution cs.CV · 2026-05-21 · unverdicted · none · ref 16
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
Why Do Vision Language Models Struggle To Recognize Human Emotions? cs.CV · 2026-04-16 · unverdicted · none · ref 23
VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.
Post-Processing Methods for Improving Accuracy in MRI Inpainting cs.CV · 2025-10-17 · unverdicted · none · ref 6
Ensembling inpainting models with median filtering, histogram matching, pixel averaging, and lightweight U-Net refinement yields more anatomically plausible and accurate inpainted MRI regions than individual baseline models.

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer