pith. sign in

hub

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

hub tools

citation-role summary

background 2

citation-polarity summary

fields

cs.CV 9 cs.LG 1

years

2026 9 2025 1

roles

background 2

polarities

background 2

clear filters

representative citing papers

SS3D: End2End Self-Supervised 3D from Web Videos

cs.CV · 2026-04-24 · unverdicted · novelty 6.0 · 3 refs

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.

Why Do Vision Language Models Struggle To Recognize Human Emotions?

cs.CV · 2026-04-16 · unverdicted · novelty 5.0

VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.

Post-Processing Methods for Improving Accuracy in MRI Inpainting

cs.CV · 2025-10-17 · unverdicted · novelty 4.0

Ensembling inpainting models with median filtering, histogram matching, pixel averaging, and lightweight U-Net refinement yields more anatomically plausible and accurate inpainted MRI regions than individual baseline models.

citing papers explorer

Showing 9 of 9 citing papers after filters.

  • SpectraDINO: Bridging the Spectral Gap in Vision Foundation Models via Lightweight Adapters cs.CV · 2026-05-04 · unverdicted · none · ref 22

    SpectraDINO adapts frozen DINOv2 backbones to multispectral data via per-modality adapters and staged distillation with cosine, contrastive, patch, and neighborhood-structure losses, achieving SOTA on object detection and segmentation benchmarks.

  • SS3D: End2End Self-Supervised 3D from Web Videos cs.CV · 2026-04-24 · unverdicted · none · ref 19 · 3 links

    SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior self-supervised baselines.

  • Distilling Vision Transformers for Distortion-Robust Representation Learning cs.CV · 2026-04-24 · unverdicted · none · ref 10

    An asymmetric multi-level distillation framework lets a student ViT approximate clean-image representations from distorted inputs alone, outperforming prior methods on classification under distortions.

  • Exploring High-Order Self-Similarity for Video Understanding cs.CV · 2026-04-22 · unverdicted · none · ref 22

    The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

  • Self-supervised pretraining for an iterative image size agnostic vision transformer cs.CV · 2026-04-22 · unverdicted · none · ref 25

    A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.

  • PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL cs.LG · 2026-04-09 · unverdicted · none · ref 13

    PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.

  • Accelerating Vision Foundation Models with Drop-in Depthwise Convolution cs.CV · 2026-05-21 · unverdicted · none · ref 16

    Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.

  • Why Do Vision Language Models Struggle To Recognize Human Emotions? cs.CV · 2026-04-16 · unverdicted · none · ref 23

    VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using language summaries of skipped frames is proposed to mitigate this.

  • Post-Processing Methods for Improving Accuracy in MRI Inpainting cs.CV · 2025-10-17 · unverdicted · none · ref 6

    Ensembling inpainting models with median filtering, histogram matching, pixel averaging, and lightweight U-Net refinement yields more anatomically plausible and accurate inpainted MRI regions than individual baseline models.