hub Mixed citations

Llama-vid: An image is worth 2 tokens in large language models

Yanwei Li, Chengyao Wang, Jiaya Jia · 2023 · arXiv 2311.17043

Mixed citation behavior. Most common role is method (60%).

16 Pith papers citing it

Method 60% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 3 background 2

citation-polarity summary

use method 3 background 2

representative citing papers

AffectVerse: Emotional World Models for Multimodal Affective Computing

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

cs.CV · 2026-03-28 · unverdicted · novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

cs.CV · 2024-07-10 · unverdicted · novelty 7.0

LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.

LVBench: An Extreme Long Video Understanding Benchmark

cs.CV · 2024-06-12 · accept · novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.

MLVU: Benchmarking Multi-task Long Video Understanding

cs.CV · 2024-06-06 · conditional · novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.

ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs

cs.CV · 2025-07-29 · unverdicted · novelty 6.0

ReGATE introduces a teacher-student adaptive token elision method that reduces training tokens to 38% while matching or exceeding baseline accuracy on multimodal benchmarks.

Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks

cs.RO · 2024-12-09 · unverdicted · novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.

LLaVA-Video: Video Instruction Tuning With Synthetic Data

cs.CV · 2024-10-03 · unverdicted · novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

cs.CV · 2024-08-19 · unverdicted · novelty 6.0

LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

cs.CV · 2024-07-11 · unverdicted · novelty 6.0

Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.

TempCompass: Do Video LLMs Really Understand Videos?

cs.CV · 2024-03-01 · unverdicted · novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.

NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation

cs.CV · 2024-02-24 · unverdicted · novelty 6.0

NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

cs.CV · 2024-04-25 · conditional · novelty 5.0

A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

cs.CV · 2024-03-27 · unverdicted · novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

cs.CV · 2026-05-16

citing papers explorer

Showing 16 of 16 citing papers.

AffectVerse: Emotional World Models for Multimodal Affective Computing cs.CV · 2026-05-19 · unverdicted · none · ref 20
AffectVerse improves multimodal emotion recognition by at least 2.57% on nine benchmarks through an Emotion World Module that performs short-horizon latent affective prediction via cross-modal temporal imagination and belief aggregation.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark cs.CV · 2026-03-28 · unverdicted · none · ref 24
SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models cs.CV · 2024-07-10 · unverdicted · none · ref 31
LLaVA-NeXT-Interleave unifies multi-image, video, and 3D capabilities in large multimodal models via a new 1.18M-sample interleaved dataset and benchmark, achieving leading results across those tasks while preserving single-image performance.
LVBench: An Extreme Long Video Understanding Benchmark cs.CV · 2024-06-12 · accept · none · ref 19
LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
MLVU: Benchmarking Multi-task Long Video Understanding cs.CV · 2024-06-06 · conditional · none · ref 27
MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs cs.CV · 2025-07-29 · unverdicted · none · ref 22
ReGATE introduces a teacher-student adaptive token elision method that reduces training tokens to 38% while matching or exceeding baseline accuracy on multimodal benchmarks.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks cs.RO · 2024-12-09 · unverdicted · none · ref 51
Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-world tests.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 193
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
LongVILA: Scaling Long-Context Visual Language Models for Long Videos cs.CV · 2024-08-19 · unverdicted · none · ref 14
LongVILA scales visual-language models from 8 to 2048 video frames with 99.8% needle-in-a-haystack accuracy using long-context extension, supervised fine-tuning, and multi-modal sequence parallelism on up to 256 GPUs.
What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction cs.CV · 2024-07-11 · unverdicted · none · ref 33
Introduces the QEVD benchmark for asynchronous situated interaction in fitness coaching and proposes a streaming baseline to address limitations of existing vision-language models.
TempCompass: Do Video LLMs Really Understand Videos? cs.CV · 2024-03-01 · unverdicted · none · ref 100
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation cs.CV · 2024-02-24 · unverdicted · none · ref 57
NaVid, a video-based VLM trained on 510k navigation and 763k web samples, achieves SOTA VLN performance using only monocular RGB video for next-step action planning in sim and real environments.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 74
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning cs.CV · 2024-04-25 · conditional · none · ref 21
A temporal pooling layer added to LLaVA smooths video feature distributions and lifts performance on dense video captioning and QA to new SOTA levels without extra parameters.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models cs.CV · 2024-03-27 · unverdicted · none · ref 10
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation cs.CV · 2026-05-16 · unreviewed · ref 6

Llama-vid: An image is worth 2 tokens in large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer