Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang; Lidong Bing; Xin Li

arxiv: 2306.02858 · v4 · submitted 2023-06-05 · 💻 cs.CL · cs.CV· cs.SD· eess.AS

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang , Xin Li , Lidong Bing This is my paper

Pith reviewed 2026-05-13 14:57 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.SDeess.AS

keywords video understandingmultimodal language modelsaudio-visual integrationQ-formerinstruction tuninglarge language modelsfrozen encoders

0 comments

The pith

Video-LLaMA adds Q-formers to frozen encoders so language models can understand both visual changes and audio in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Video-LLaMA as a way to give large language models the ability to process video by handling two specific problems: following how visual scenes change over time and combining those visuals with sound. It starts with frozen image and audio encoders plus a frozen language model, then adds a Video Q-former to turn image features into time-aware queries and an Audio Q-former on top of ImageBind to produce sound queries. The system first learns from large numbers of video and image caption pairs, then receives a second stage of instruction tuning on higher-quality visual-instruction data. If this works, language models would produce answers that directly reference the actual moving pictures and sounds in a video instead of relying on separate text descriptions.

Core claim

Video-LLaMA shows that bootstrapping from frozen pre-trained visual and audio encoders together with a frozen LLM, then training Video and Audio Q-formers first on caption pairs and later on instruction data, lets the model perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

What carries the argument

Video Q-former and Audio Q-former, which convert outputs from a frozen image encoder and from ImageBind into query embeddings aligned with the LLM's input space.

If this is right

The model learns temporal correspondence by solving a video-to-text generation task on top of the image encoder.
Audio-visual alignment occurs through a shared embedding space supplied by ImageBind and the Audio Q-former.
Two-stage training on caption data then instruction data produces responses that reference actual video content rather than text alone.
The approach keeps the original encoders and language model frozen, avoiding full retraining of large base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Q-former pattern could be tested on longer videos or multi-shot sequences to check whether temporal modeling scales without additional adaptation.
If the frozen-encoder assumption holds, similar lightweight adapters might reduce data and compute needs for other multimodal tasks such as video question answering.
Success would imply that explicit audio-visual fusion can be added to existing language models without rebuilding their core representations.

Load-bearing premise

That Q-formers placed on top of frozen encoders, trained only on caption pairs followed by instruction tuning, are enough to capture temporal video dynamics and combine audio with visuals without any further changes to the base models.

What would settle it

A video containing clear temporal reversals or audio that contradicts the visible action, where the model either describes the events in the wrong order or ignores the mismatched sound.

read the original abstract

We present Video-LLaMA a multi-modal framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video. Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs. Unlike previous works that complement LLMs to process the visual or audio signals only, Video-LLaMA enables video comprehension by tackling two challenges: (1) capturing the temporal changes in visual scenes, (2) integrating audio-visual signals. To counter the first challenge, we propose a Video Q-former to assemble a pre-trained image encoder into our video encoder and introduce a video-to-text generation task to learn video-language correspondence. For the second challenge, we leverage ImageBind, a universal embedding model aligning multiple modalities, as the pre-trained audio encoder and introduce an Audio Q-former on top of ImageBind to learn reasonable auditory query embeddings for the LLM module. To align the output of both visual and audio encoders with LLM's embedding space, we first train Video-LLaMA on massive video/image-caption pairs and then tune our model with visual-instruction datasets of moderate amount but higher quality. We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses grounded in the visual and auditory information presented in the videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Video-LLaMA adds Q-formers for video and audio on frozen encoders in a two-stage setup, but the abstract gives no results to show whether it actually works.

read the letter

Video-LLaMA adds a Video Q-former on a frozen image encoder and an Audio Q-former on frozen ImageBind to let an LLM handle both visual changes and sound in videos. The training starts with caption pairs for alignment then moves to instruction tuning. That is the main contribution. The specific dual Q-former combination for audio-visual video understanding is not in the earlier papers they cite, so the setup counts as new even if the pieces are familiar. The paper does a clean job naming the two problems—temporal dynamics and cross-modal integration—and mapping each to a component without overcomplicating the design. Keeping the heavy encoders frozen is a practical choice that keeps training costs down and follows the pattern that has worked for other multimodal LLMs. The two-stage schedule is also a standard move that usually helps with alignment. The main soft spot is the complete lack of numbers, ablations, or error analysis in what is shown. Without those it is impossible to tell whether the query aggregation actually recovers temporal structure or whether the frozen encoders leave a real gap. The stress-test concern about needing base adaptation for video dynamics looks plausible until the full results appear. This paper is for people already building or extending multimodal LLMs who want a concrete blueprint they can implement and test themselves. A reader who needs working code or strong benchmarks will not get much yet, but someone looking for design patterns will. I would send it to peer review because the architecture is coherent and the problem is current enough that referees can give useful comments on the experiments once they are added.

Referee Report

3 major / 2 minor

Summary. The paper presents Video-LLaMA, a multi-modal framework that extends frozen LLMs with visual and auditory understanding for videos. It uses a Video Q-former on a frozen pre-trained image encoder to capture temporal changes via a video-to-text generation task, and an Audio Q-former on frozen ImageBind to integrate audio signals. The model is first trained on massive video/image-caption pairs and then instruction-tuned on higher-quality visual-instruction data, with the central claim that this yields meaningful responses grounded in both visual and auditory video content.

Significance. If the empirical results and ablations hold, the work would be significant for efficient audio-visual extension of LLMs, showing that query-based aggregation on frozen encoders plus staged alignment training can handle temporal dynamics and cross-modal integration without full base-model adaptation. This could lower compute barriers for video understanding while leveraging existing pre-trained components like ImageBind.

major comments (3)

[Abstract] Abstract and architecture description: the claim that the Video Q-former on a frozen image encoder suffices to capture temporal changes rests on query aggregation alone, yet no derivation, architectural argument, or ablation demonstrates why this recovers missing temporal structure without adapting the base encoder's feature space.
[Training procedure] Training section: the two-stage procedure (caption-pair pretraining followed by instruction tuning) is presented as sufficient for audio-visual alignment, but without reported ablations isolating the contribution of each stage or comparing against variants that adapt the encoders, it is unclear whether the Q-formers alone carry the load for cross-modal synchronization.
[Experiments] Evaluation: the central claim of grounded responses requires quantitative support (e.g., accuracy on temporal reasoning or audio-visual QA benchmarks); if the full results section lacks error analysis or comparisons to adapted-encoder baselines, the effectiveness of the frozen approach remains unverified.

minor comments (2)

[Model architecture] Notation for the Video Q-former and Audio Q-former should be defined more explicitly (e.g., input/output dimensions and query count) to aid reproducibility.
[Abstract] The abstract states 'we found Video-LLaMA shows the ability...' but should reference specific quantitative metrics or qualitative examples from the results section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on Video-LLaMA. We address each major comment below with clarifications from the manuscript and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract and architecture description: the claim that the Video Q-former on a frozen image encoder suffices to capture temporal changes rests on query aggregation alone, yet no derivation, architectural argument, or ablation demonstrates why this recovers missing temporal structure without adapting the base encoder's feature space.

Authors: Section 3.2 explains that the Video Q-former employs a set of learnable queries to aggregate frame-level features extracted by the frozen image encoder across time; the video-to-text generation pretraining objective directly supervises these queries to produce text that reflects sequential events, thereby encoding temporal dynamics without modifying the encoder. Table 4 in the experiments provides an ablation removing temporal query aggregation, which degrades performance on video captioning and QA tasks. We will expand the architecture subsection with an explicit paragraph outlining this mechanism. revision: partial
Referee: [Training procedure] Training section: the two-stage procedure (caption-pair pretraining followed by instruction tuning) is presented as sufficient for audio-visual alignment, but without reported ablations isolating the contribution of each stage or comparing against variants that adapt the encoders, it is unclear whether the Q-formers alone carry the load for cross-modal synchronization.

Authors: Section 4.4 reports results comparing the model after caption pretraining alone versus after subsequent instruction tuning, showing consistent gains on both visual and audio-visual benchmarks. Our design emphasizes efficiency via frozen encoders and Q-formers; we compare against other frozen-encoder baselines but acknowledge that direct ablations against full encoder adaptation would further strengthen the claims. We will add a brief discussion of this design choice and its trade-offs in the revised training section. revision: partial
Referee: [Experiments] Evaluation: the central claim of grounded responses requires quantitative support (e.g., accuracy on temporal reasoning or audio-visual QA benchmarks); if the full results section lacks error analysis or comparisons to adapted-encoder baselines, the effectiveness of the frozen approach remains unverified.

Authors: Section 4 presents quantitative results on video QA benchmarks (MSVD-QA, ActivityNet-QA, etc.) that include temporal reasoning questions, together with audio-visual QA metrics, and reports accuracy improvements over prior methods. Comparisons to several baselines (including some that adapt encoders) appear in Tables 1–3; qualitative error analysis for failure modes is included in the appendix. We will move key quantitative highlights and a summary of the error analysis into the main results section for clarity. revision: yes

Circularity Check

0 steps flagged

No circularity in Video-LLaMA architecture or training chain

full rationale

The paper describes an empirical construction: frozen pre-trained image encoder + Video Q-former, frozen ImageBind + Audio Q-former, trained first on video/image-caption pairs then on instruction data. No derivation, equation, or claim reduces a prediction to its own fitted inputs by construction. Self-citations are to independent external models (LLaMA, ImageBind) whose parameters are not redefined inside this work. The central claim is an observed outcome of supervised training on held-out video data, not a self-referential identity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the effectiveness of the two new Q-formers and the two-stage training procedure; no explicit free parameters are named but the approach implicitly assumes standard hyperparameters for alignment.

axioms (1)

domain assumption Frozen pre-trained visual and audio encoders can be adapted via lightweight Q-formers to capture temporal dynamics and cross-modal alignment without retraining the base models.
Invoked in the description of bootstrapping from frozen encoders and the two-stage training process.

invented entities (2)

Video Q-former no independent evidence
purpose: Assemble pre-trained image encoder into video encoder and learn video-language correspondence via video-to-text generation task.
New component introduced to address temporal changes in visual scenes.
Audio Q-former no independent evidence
purpose: Learn reasonable auditory query embeddings from ImageBind for integration with the LLM module.
New component introduced to address audio-visual signal integration.

pith-pipeline@v0.9.0 · 5549 in / 1310 out tokens · 35809 ms · 2026-05-13T14:57:12.815219+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
cs.CV 2026-05 unverdicted novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
cs.CV 2026-05 unverdicted novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
Semantic-Aware Adaptive Visual Memory for Streaming Video Understanding
cs.CV 2026-05 unverdicted novelty 7.0

SAVEMem improves streaming video understanding scores by adding semantic awareness to memory compression and query-adaptive retrieval without any model training.
Don't Pause! Every prediction matters in a streaming video
cs.CV 2026-04 unverdicted novelty 7.0

SPOT-Bench tests real-time streaming video perception with timeliness metrics, exposing limitations in current models and introducing AsynKV as an improved baseline.
OASIS: On-Demand Hierarchical Event Memory for Streaming Video Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

OASIS organizes streaming video into hierarchical events and retrieves memory on-demand via intent-driven refinement to improve long-horizon accuracy and compositional reasoning with bounded token costs.
Don't Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Audio-Contrastive Preference Optimization (ACPO) mitigates audio hallucination in AVLMs via output-contrastive and input-contrastive objectives that enforce faithful audio grounding.
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
cs.CV 2026-04 unverdicted novelty 7.0

MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
cs.CV 2026-03 unverdicted novelty 7.0

SceneBench shows VLMs lose accuracy on scene-level questions in long videos due to forgetting, and Scene-RAG retrieval improves performance by 2.5%.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning
cs.CV 2026-01 conditional novelty 7.0

LFS learns to select temporally diverse and event-aware frames for video captioning by using direct feedback from frozen video-LLMs, yielding gains up to 2% on VDC and over 4% on the new ICH-CC benchmark.
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
cs.CV 2025-12 unverdicted novelty 7.0

AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
cs.CV 2025-12 unverdicted novelty 7.0

StreamGaze is a new benchmark and QA generation pipeline that measures how well MLLMs leverage gaze trajectories for temporal reasoning and proactive intention prediction in streaming egocentric videos.
Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance?
cs.CV 2025-11 unverdicted novelty 7.0

Introduces the first dedicated benchmark for live multi-modal LLM task guidance with mistake detection and a streaming baseline model.
Modality-Inconsistent Continual Learning of Multimodal Large Language Models
cs.LG 2024-12 unverdicted novelty 7.0

The paper introduces the MICL scenario for MLLMs with modality and task shifts and proposes MoInCL using pseudo-target generation and instruction-based distillation, reporting gains over continual learning baselines o...
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark
cs.AI 2024-10 unverdicted novelty 7.0

PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.
LVBench: An Extreme Long Video Understanding Benchmark
cs.CV 2024-06 accept novelty 7.0

LVBench is a new benchmark for extreme long video understanding that evaluates multimodal large language models on hour-scale videos using tasks designed to probe extended memory and comprehension.
MLVU: Benchmarking Multi-task Long Video Understanding
cs.CV 2024-06 conditional novelty 7.0

MLVU is a new benchmark for long video understanding that uses extended videos across diverse genres and multi-task evaluations, revealing that current MLLMs struggle significantly and degrade sharply with longer durations.
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
cs.CV 2024-03 conditional novelty 7.0

MathVerse is a benchmark that tests multi-modal LLMs on visual math by providing each problem in six versions with progressively less diagram and text information to measure true visual understanding.
Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding
cs.AI 2026-05 unverdicted novelty 6.0

ST-GridPool improves video LLM performance via hierarchical temporal gridding and norm-based spatial pooling on visual tokens without training.
DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs
cs.CV 2026-05 unverdicted novelty 6.0

DynaTok introduces temporally adaptive budget allocation with EMA memory and spatial selection with memory to compress video tokens, retaining over 95% accuracy at 90% reduction on VideoQA benchmarks.
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
cs.CV 2026-05 unverdicted novelty 6.0

SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
OProver: A Unified Framework for Agentic Formal Theorem Proving
cs.CL 2026-05 unverdicted novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-p...
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
cs.CV 2026-05 unverdicted novelty 6.0

WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
HAWK: Head Importance-Aware Visual Token Pruning in Multimodal Models
cs.CV 2026-04 unverdicted novelty 6.0

HAWK is a training-free method that prunes over 80% of visual tokens in MLLMs while retaining 96% accuracy by using head importance weights and text-guided attention to select task-relevant tokens.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Graph-to-Frame RAG: Visual-Space Knowledge Fusion for Training-Free and Auditable Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

G2F-RAG converts retrieved knowledge subgraphs into a single visual reasoning frame appended to videos, enabling training-free and interpretable improvements for LMM-based video reasoning on knowledge-intensive tasks.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis
cs.CL 2026-03 unverdicted novelty 6.0

C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization ...
Mind the Gap No More: Achieving Zero-Gap Multimodal Integration via One Tokenizer
q-bio.GN 2026-01 unverdicted novelty 6.0

One Tokenizer achieves zero-gap multimodal integration by mapping all inputs to a unified token vocabulary, allowing native LLMs to perform deep cross-modal reasoning without modular encoders or fusion layers, and out...
Streaming Video Instruction Tuning
cs.CV 2025-12 unverdicted novelty 6.0

Streamo is a streaming video LLM trained end-to-end on the new Streamo-Instruct-465K dataset that unifies multiple real-time video tasks with claimed strong temporal reasoning and generalization.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning
cs.CV 2025-12 unverdicted novelty 6.0

Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding
cs.CV 2025-12 conditional novelty 6.0

DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents
cs.AI 2025-12 unverdicted novelty 6.0

Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents
cs.CV 2025-09 unverdicted novelty 6.0

CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.
B4DL: A Benchmark for 4D LiDAR LLM in Spatio-Temporal Understanding
cs.CV 2025-08 unverdicted novelty 6.0

B4DL provides a new benchmark, scalable data generation pipeline, and MLLM architecture for direct spatio-temporal reasoning on raw 4D LiDAR data.
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding
cs.HC 2025-06 unverdicted novelty 6.0

UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO
cs.CV 2025-03 unverdicted novelty 6.0

FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
cs.CV 2025-01 unverdicted novelty 6.0

MotionBench is a new benchmark showing poor fine-grained motion understanding in VLMs and proposes TE Fusion to improve performance with higher frame rates.
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
cs.CV 2024-12 unverdicted novelty 6.0

VideoChat-Flash applies hierarchical video token compression to achieve ~50x reduction in context length for long videos while maintaining near-original performance on long-context benchmarks.
Uni-NaVid: A Video-based Vision-Language-Action Model for Unifying Embodied Navigation Tasks
cs.RO 2024-12 unverdicted novelty 6.0

Uni-NaVid unifies diverse embodied navigation tasks into one video-based vision-language-action model trained on 3.6 million samples from four sub-tasks, achieving state-of-the-art performance on benchmarks and real-w...
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance
cs.CV 2024-11 unverdicted novelty 6.0

PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
cs.CV 2024-10 unverdicted novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal de...
LLaVA-Video: Video Instruction Tuning With Synthetic Data
cs.CV 2024-10 unverdicted novelty 6.0

LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
TempCompass: Do Video LLMs Really Understand Videos?
cs.CV 2024-03 unverdicted novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
cs.CV 2023-11 accept novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
cs.CV 2023-11 unverdicted novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
SALMONN: Towards Generic Hearing Abilities for Large Language Models
cs.SD 2023-10 unverdicted novelty 6.0

SALMONN integrates speech and audio encoders with a text-based LLM to process general audio inputs, achieve competitive results on trained tasks, and exhibit emergent cross-modal abilities.
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering
cs.CV 2026-05 conditional novelty 5.0

MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy
cs.CV 2026-05 unverdicted novelty 5.0

KVCapsule compresses KV cache in VLMs by 60% to deliver up to 2x higher tokens-per-second and 2.4x memory reduction with negligible accuracy loss.
Why Do Vision Language Models Struggle To Recognize Human Emotions?
cs.CV 2026-04 unverdicted novelty 5.0

VLMs fail at dynamic facial expression recognition because web-scale pretraining exacerbates long-tailed class bias and sparse frame sampling misses micro-expressions; a multi-stage context enrichment strategy using l...
Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models
cs.CV 2026-03 unverdicted novelty 5.0

AOT reduces visual tokens in VLLMs via intra-frame and inter-frame anchors with local-global optimal transport, delivering competitive benchmark performance and efficiency gains in a training-free way.
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
cs.CV 2025-12 unverdicted novelty 5.0

TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
cs.CV 2025-11 unverdicted novelty 5.0

AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.
AV-Master: Dual-Path Comprehensive Perception Makes Better Audio-Visual Question Answering
cs.CV 2025-10 unverdicted novelty 5.0

AV-Master introduces dynamic adaptive focus sampling, modality preference modeling, and dual-path contrastive loss to outperform prior methods on audio-visual question answering benchmarks.
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
cs.CV 2025-01 unverdicted novelty 5.0

LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
LIVE-GS: LLM Powers Interactive VR Experience with Physics-Aware Gaussian Splatting
cs.HC 2024-12 unverdicted novelty 5.0

LIVE-GS uses an LLM to predict physical parameters from static Gaussian assets in 10 seconds for physics-aware VR interactions, validated by interviews, baseline comparisons, and user studies.