hub Canonical reference

Valley: Video assistant with large language model enhanced ability

Ruipu Luo, Ziwang Zhao, Min Yang, Junwei Dong, Minghui Qiu, Pengcheng Lu, Tao Wang, Zhongyu Wei · 2023 · arXiv 2306.07207

Canonical reference. 80% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1

citation-polarity summary

background 4 baseline 1

representative citing papers

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

cs.CL · 2023-07-30 · unverdicted · novelty 7.0

SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.

Dynamic Model Merging Made Slim

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding

cs.CV · 2025-12-07 · conditional · novelty 6.0

DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.

UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding

cs.HC · 2025-06-23 · unverdicted · novelty 6.0

UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.

FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO

cs.CV · 2025-03-12 · unverdicted · novelty 6.0

FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

cs.CV · 2024-11-04 · unverdicted · novelty 6.0

PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

cs.CV · 2024-10-22 · unverdicted · novelty 6.0

LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.

TempCompass: Do Video LLMs Really Understand Videos?

cs.CV · 2024-03-01 · unverdicted · novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

cs.CV · 2023-11-28 · accept · novelty 6.0

MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

cs.CV · 2023-11-16 · unverdicted · novelty 6.0

Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

cs.CV · 2024-12-04 · unverdicted · novelty 5.0

TemporalVLM adds timestamp-aware clip encoding and BiLSTM global aggregation to video LLMs, introduces the IndustryASM factory dataset, and reports outperformance on dense captioning, temporal grounding, highlight detection, and action segmentation.

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

cs.CV · 2024-07-03 · conditional · novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

ClimateVID -- Social Media Videos Analysis and Challenges Involved

cs.CV · 2026-04-30 · unverdicted · novelty 4.0

Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.

Empowering Video Translation using Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

citing papers explorer

Showing 16 of 16 citing papers.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 25
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension cs.CL · 2023-07-30 · unverdicted · none · ref 17
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
Dynamic Model Merging Made Slim cs.LG · 2026-05-17 · unverdicted · none · ref 59
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 44
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
Detector-Empowered Video Large Language Model for Efficient Spatio-Temporal Grounding cs.CV · 2025-12-07 · conditional · none · ref 47
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
UniMind: Unleashing the Power of LLMs for Unified Multi-Task Brain Decoding cs.HC · 2025-06-23 · unverdicted · none · ref 47
UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
FaVChat: Hierarchical Prompt-Query Guided Facial Video Understanding with Data-Efficient GRPO cs.CV · 2025-03-12 · unverdicted · none · ref 11
FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance cs.CV · 2024-11-04 · unverdicted · none · ref 12
PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding cs.CV · 2024-10-22 · unverdicted · none · ref 18
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
TempCompass: Do Video LLMs Really Understand Videos? cs.CV · 2024-03-01 · unverdicted · none · ref 108
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark cs.CV · 2023-11-28 · accept · none · ref 50
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection cs.CV · 2023-11-16 · unverdicted · none · ref 64
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos cs.CV · 2024-12-04 · unverdicted · none · ref 28
TemporalVLM adds timestamp-aware clip encoding and BiLSTM global aggregation to video LLMs, introduces the IndustryASM factory dataset, and reports outperformance on dense captioning, temporal grounding, highlight detection, and action segmentation.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output cs.CV · 2024-07-03 · conditional · none · ref 102
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
ClimateVID -- Social Media Videos Analysis and Challenges Involved cs.CV · 2026-04-30 · unverdicted · none · ref 45
Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.
Empowering Video Translation using Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 7
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

Valley: Video assistant with large language model enhanced ability

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer