Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Sigmoid loss for language image pre-training,
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
SlotVLA uses slot attention to model object-relation representations for multitask robotic manipulation, reducing visual tokens while achieving competitive generalization on the new LIBERO+ benchmark.
citing papers explorer
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
SlotVLA uses slot attention to model object-relation representations for multitask robotic manipulation, reducing visual tokens while achieving competitive generalization on the new LIBERO+ benchmark.