CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
hub
arXiv preprint arXiv:2104.10972 , year=
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
dataset 2polarities
use dataset 2representative citing papers
WePE encodes 2D patch positions in Vision Transformers via Weierstrass elliptic functions on the complex plane to exploit double periodicity and derive relative positions algebraically.
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.
Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
Using contrastive examples with vision-language models and a new CLIP-based scoring method called CSP produces more faithful and granular neuron labels than prior activation-only approaches.
StableTTA improves ImageNet-1K accuracy across 71 vision models by stabilizing logit aggregation under coherent-batch inference and enabling efficient single-forward-pass adaptation.
MePo refines pretrained backbones via meta-learning on constructed pseudo tasks and initializes a meta covariance matrix to enable robust second-order alignment, yielding 12-15% gains on CIFAR-100, ImageNet-R and CUB-200 in rehearsal-free GCL settings.
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
HandyLabel enables real-time data annotation by mapping hand gestures to labels, with ResNet50 on skeleton-preprocessed HaGRID data reaching 0.923 F1-score and 88.9% of 46 study participants preferring it to traditional post-processing tools.
The report overviews five maritime computer vision benchmark challenges, their datasets, protocols, quantitative results, and top team approaches from the MaCVi 2026 workshop.
citing papers explorer
-
CanViT: Toward Active-Vision Foundation Models
CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
-
Weierstrass Positional Encoding for Vision Transformers
WePE encodes 2D patch positions in Vision Transformers via Weierstrass elliptic functions on the complex plane to exploit double periodicity and derive relative positions algebraically.
-
Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
Exploiting linear structure in VLM embeddings, a synthetic-data pre-training method yields background-invariant representations that exceed 90% worst-group accuracy on Waterbirds even under 100% spurious correlation with no minority examples in training.
-
Parameter-Efficient Adaptation of Pre-Trained Vision Foundation Models for Active and Passive Seismic Data Denoising
Adapting vision foundation models with LoRA and kurtosis-guided unsupervised test-time adaptation matches or exceeds domain-specific models for seismic denoising across multiple sites and unseen data.
-
Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
Using contrastive examples with vision-language models and a new CLIP-based scoring method called CSP produces more faithful and granular neuron labels than prior activation-only approaches.
-
StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods
StableTTA improves ImageNet-1K accuracy across 71 vision models by stabilizing logit aggregation under coherent-batch inference and enabling efficient single-forward-pass adaptation.
-
MePo: Meta Post-Refinement for Rehearsal-Free General Continual Learning
MePo refines pretrained backbones via meta-learning on constructed pseudo tasks and initializes a meta covariance matrix to enable robust second-order alignment, yielding 12-15% gains on CIFAR-100, ImageNet-R and CUB-200 in rehearsal-free GCL settings.
-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind aligns video, infrared, depth, and audio to a frozen language encoder via contrastive learning on the new VIDAL-10M dataset, extending video-language pretraining to N modalities.
-
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetson Orin GPU.
-
HandyLabel: Towards Post-Processing to Real-Time Annotation Using Skeleton Based Hand Gesture Recognition
HandyLabel enables real-time data annotation by mapping hand gestures to labels, with ResNet50 on skeleton-preprocessed HaGRID data reaching 0.923 F1-score and 88.9% of 46 study participants preferring it to traditional post-processing tools.
-
4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview
The report overviews five maritime computer vision benchmark challenges, their datasets, protocols, quantitative results, and top team approaches from the MaCVi 2026 workshop.
- Page image classification for content-specific data processing