A two-stage diversity-plus-entropy token selection framework speeds up visual geometry transformers by over 85% on 500-image scenes while preserving baseline accuracy.
arXiv preprint arXiv:2512.21691 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
years
2026 3representative citing papers
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
citing papers explorer
-
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
A two-stage diversity-plus-entropy token selection framework speeds up visual geometry transformers by over 85% on 500-image scenes while preserving baseline accuracy.
-
OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL
OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.
- GHOST: Geometry-Hierarchical Online Streaming Token Eviction for Efficient 3D Reconstruction