ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.
Preprint, arXiv:2112.07658
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3years
2026 3representative citing papers
STS is a two-stage pruning framework that decouples structural diversity via repulsion sampling from semantic filtering via cross-attention to reduce redundancy in visual tokens for VLMs.
ViT-FREE enables early exiting from pretrained ViTs for face verification with up to 20% speedup and 1.5 accuracy drop on IJB-C, plus a synthetic-data fine-tuning variant for shallow exits.
citing papers explorer
-
Token-Sparse Medical Multimodal Reasoning via Dual-Stream Reinforcement Learning
ViToS uses dual-stream RL with cross-feedback optimization to prune medical image tokens to 77% length while reporting 108.27% and 104.16% relative performance on two 7B VLMs across seven benchmarks.
-
When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics
STS is a two-stage pruning framework that decouples structural diversity via repulsion sampling from semantic filtering via cross-attention to reduce redundancy in visual tokens for VLMs.
-
ViT-FREE: Efficient Face Recognition via Early Exiting and Synthetic Adaptation
ViT-FREE enables early exiting from pretrained ViTs for face verification with up to 20% speedup and 1.5 accuracy drop on IJB-C, plus a synthetic-data fine-tuning variant for shallow exits.