A training-free pipeline combining SAM3 for masks and transformed DINOv3 embeddings for prototype matching delivers the first reported baseline for fine-grained fungi segmentation across one-shot to few-hundred-shot regimes.
Learning transferable visual models from natural language supervision
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4years
2026 4verdicts
UNVERDICTED 4representative citing papers
Ramen enables robust test-time adaptation of vision-language models under mixed-domain shifts by actively selecting domain-consistent and prediction-balanced samples via an embedding-gradient cache.
MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.
citing papers explorer
-
Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline
A training-free pipeline combining SAM3 for masks and transformed DINOv3 embeddings for prototype matching delivers the first reported baseline for fine-grained fungi segmentation across one-shot to few-hundred-shot regimes.
-
Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Ramen enables robust test-time adaptation of vision-language models under mixed-domain shifts by actively selecting domain-consistent and prediction-balanced samples via an embedding-gradient cache.
-
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.
-
Tora3: Trajectory-Guided Audio-Video Generation with Physical Coherence
Tora3 uses shared object trajectories as kinematic priors to jointly guide visual motion and acoustic events in audio-video generation, improving realism and synchronization.