hub Mixed citations

Tinyllava: A framework of small-scale large multimodal models

Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, Lei Huang · 2024 · arXiv 2402.14289

Mixed citation behavior. Most common role is background (60%).

10 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2

citation-polarity summary

background 3 baseline 2

representative citing papers

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding

cs.CV · 2024-12-11 · unverdicted · novelty 7.0

CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.

LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

cs.LG · 2025-06-02 · unverdicted · novelty 6.0

SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.

Are We on the Right Way for Evaluating Large Vision-Language Models?

cs.CV · 2024-03-29 · conditional · novelty 6.0

Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

cs.CV · 2024-03-14 · unverdicted · novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

cs.AI · 2025-11-28 · unverdicted · novelty 5.0

AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.

Online In-Context Distillation for Low-Resource Vision Language Models

cs.CV · 2025-10-20 · unverdicted · novelty 5.0

Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.

Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models

cs.LG · 2026-04-23 · unverdicted · novelty 2.0

The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

citing papers explorer

Showing 10 of 10 citing papers.

The ART of Composition: Attention-Regularized Training for Compositional Visual Grounding cs.CV · 2024-12-11 · unverdicted · none · ref 65
CompART adds a composition loss on decomposed captions to regularize attention sums and improves multi-object grounding plus VQA across four VLM types and six benchmarks.
LLaVA-CKD: Bottom-Up Cascaded Knowledge Distillation for Vision-Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 16
A cascaded knowledge distillation method with intermediate teachers improves efficiency of vision-language models like LLaVA while achieving state-of-the-art results on seven VQA benchmarks.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 56
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 63
G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics cs.LG · 2025-06-02 · unverdicted · none · ref 50
SmolVLA is a small efficient VLA model that achieves performance comparable to 10x larger models while training on one GPU and deploying on consumer hardware via community data and chunked asynchronous action prediction.
Are We on the Right Way for Evaluating Large Vision-Language Models? cs.CV · 2024-03-29 · conditional · none · ref 53
Current LVLM benchmarks overestimate capabilities because many questions can be answered without images due to design flaws or data leakage; MMStar is a human-curated set of 1,500 vision-indispensable samples across 6 capabilities and 18 axes with new metrics for leakage and true multi-modal gain.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training cs.CV · 2024-03-14 · unverdicted · none · ref 133
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture cs.AI · 2025-11-28 · unverdicted · none · ref 59
AgroCoT is a new Chain-of-Thought VQA benchmark with 4759 samples to evaluate reasoning capabilities of vision-language models in agriculture.
Online In-Context Distillation for Low-Resource Vision Language Models cs.CV · 2025-10-20 · unverdicted · none · ref 25
Online In-Context Distillation lets small VLMs gain up to 33% performance with as little as 4% teacher annotations by distilling knowledge through dynamic in-context demonstrations at inference.
Focus Session: Hardware and Software Techniques for Accelerating Multimodal Foundation Models cs.LG · 2026-04-23 · unverdicted · none · ref 70
The paper compiles hardware-software co-design techniques including mixed-precision quantization, structural pruning, speculative decoding, and transformer accelerators to speed up multimodal foundation models, with examples in medical and code tasks.

Tinyllava: A framework of small-scale large multimodal models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer