SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
Visual instruction tuning.Advances in neural information processing systems, 36
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.
citing papers explorer
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding
SPEAR-1 combines a 3D-enriched VLM with embodied control to match or exceed existing robotic foundation models using 20 times fewer robot demonstrations.