Videotree: Adaptive tree-based video representation for llm reasoning on long videos

· 2025 · arXiv 2405.19209

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.

Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.

One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.

Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

cs.CV · 2026-01-21 · unverdicted · novelty 6.0

HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

cs.RO · 2025-04-13 · unverdicted · novelty 4.0

AirVista-II integrates agent-based task identification and scheduling, multimodal perception, and scenario-tailored keyframe extraction to deliver high-quality zero-shot semantic understanding for embodied UAVs in dynamic environments.

citing papers explorer

Showing 6 of 6 citing papers.

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration cs.CV · 2026-04-06 · unverdicted · none · ref 64
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Video Active Perception: Effective Inference-Time Long-Form Video Understanding with Vision-Language Models cs.CV · 2026-05-03 · unverdicted · none · ref 47
VAP is a training-free active-perception method that improves zero-shot long-form video QA performance and frame efficiency up to 5.6x in VLMs by selecting keyframes that differ from priors generated by a text-conditioned video model.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding cs.CV · 2026-04-15 · unverdicted · none · ref 63
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
Progressive Video Condensation with MLLM Agent for Long-form Video Understanding cs.CV · 2026-04-03 · unverdicted · none · ref 3
ProVCA progressively condenses long videos via segment localization, snippet selection, and keyframe refinement to achieve SOTA zero-shot accuracies on EgoSchema, NExT-QA, and IntentQA with fewer frames.
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding cs.CV · 2026-01-21 · unverdicted · none · ref 45
HERMES organizes the KV cache into a hierarchical memory to enable real-time streaming video understanding in MLLMs, achieving 10x faster TTFT and up to 11.4% accuracy gains on streaming benchmarks with 68% fewer tokens.
AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding cs.RO · 2025-04-13 · unverdicted · none · ref 13
AirVista-II integrates agent-based task identification and scheduling, multimodal perception, and scenario-tailored keyframe extraction to deliver high-quality zero-shot semantic understanding for embodied UAVs in dynamic environments.

Videotree: Adaptive tree-based video representation for llm reasoning on long videos

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer