DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.
Learn to explain: Multimodal reasoning via thought chains for science question answering
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.
citing papers explorer
-
A More Word-like Image Tokenization for MLLMs
DiVT clusters patch embeddings into coherent semantic units and adapts token count to image complexity, matching or exceeding baselines with fewer visual tokens on multimodal benchmarks.
-
Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models
CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.