CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
hub Mixed citations
Tinyllava: A framework of small-scale large multimodal models
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.
citing papers explorer
-
LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models
LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.
-
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
-
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
-
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.