SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Ac- tionart: Advancing multimodal large models for fine- grained human-centric video understanding
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
HumanMoveVQA is a benchmark using 3D-lifted video tracks to evaluate video MLLMs on seven categories of global human motion reasoning, showing gaps in proprietary models but gains from fine-tuning.
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.
citing papers explorer
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
HumanMoveVQA: Can Video MLLMs reason about human movement in videos?
HumanMoveVQA is a benchmark using 3D-lifted video tracks to evaluate video MLLMs on seven categories of global human motion reasoning, showing gaps in proprietary models but gains from fine-tuning.
-
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
Introduces VIP identification task, releases Temporal-VIP dataset, and presents VIP-Net framework that achieves 67.3% accuracy on identifying important persons in videos while providing rationale similarity of 0.63.