hub Mixed citations

Tinyllava: A framework of small-scale large multimodal models

· 2024 · arXiv 2402.14289

Mixed citation behavior. Most common role is background (60%).

13 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 13 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2

citation-polarity summary

background 3 baseline 2

representative citing papers

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

cs.CV · 2024-12-11 · unverdicted · novelty 7.0

CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

cs.LG · 2026-05-30 · unverdicted · novelty 6.0

DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.

Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

cs.LG · 2025-06-02 · unverdicted · novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

Are We on the Right Way for Evaluating Large Vision-Language Models?

cs.CV · 2024-03-29 · conditional · novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

cs.CV · 2024-03-14 · unverdicted · novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

cs.AI · 2025-11-28 · unverdicted · novelty 5.0

AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.

Online In-Context Distillation for Low-Resource Vision Language Models

cs.CV · 2025-10-20 · unverdicted · novelty 5.0

Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

cs.LG · 2026-04-23 · unverdicted · novelty 2.0

The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

citing papers explorer

Showing 4 of 4 citing papers after filters.

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models cs.LG · 2026-05-30 · unverdicted · none · ref 58
LASER introduces curvature-weighted SVD from second-order loss approximation and loss-aware rank allocation to compress VLMs, reporting over 2.3x decoding speedup under low-precision settings.
DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation cs.LG · 2026-05-30 · unverdicted · none · ref 37
DREAM-S combines neural architecture search, target-aware supernet training, and attention-entropy-guided distillation to accelerate speculative decoding in VLMs, reporting up to 3.85x speedup over standard methods.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics cs.LG · 2025-06-02 · unverdicted · none · ref 50
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models cs.LG · 2026-04-23 · unverdicted · none · ref 70
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

Tinyllava: A framework of small-scale large multimodal models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer