TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
hub
arXiv preprint arXiv:2501.15111 , year=
11 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization in multimodal analysis.
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and reconstruction constraints.
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
A rank-aware selective fusion framework with attention gating and decoupled heads outperforms baselines and ranks 2nd on the BlEmoRE challenge for blended emotion recognition.
citing papers explorer
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
GRASP is a large-scale dataset and benchmark for social reasoning grounded in gaze and gesture events in multi-person videos, with Social Grounding Reward (SGR) proposed to improve model performance on GRASP-Bench.
-
Watching Movies Like a Human: Egocentric Emotion Understanding for Embodied Companions
Creates the first egocentric screen-view movie emotion benchmark and demonstrates that cinematic models drop sharply in Macro-F1 on realistic robot-like viewing conditions while domain-specific training improves robustness.
-
SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
SWIM aligns cross-attention maps from object nouns to ground-truth masks during training on the new NL-Refer dataset to enable text-only fine-grained video object understanding in MLLMs.
-
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
-
C2F-Thinker: Coarse-to-Fine Reasoning with Hint-Guided Reinforcement Learning for Multimodal Sentiment Analysis
C2F-Thinker combines structured coarse-to-fine chain-of-thought reasoning with hint-guided GRPO reinforcement learning to achieve competitive fine-grained sentiment regression and superior cross-domain generalization in multimodal analysis.
-
Learning Invariant Modality Representation for Robust Multimodal Learning from a Causal Inference Perspective
CmIR uses causal inference to separate invariant causal representations from spurious ones in multimodal data, improving generalization under distribution shifts and noise via invariance, mutual information, and reconstruction constraints.
-
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
LLaVA-Octopus introduces instruction-driven adaptive fusion of multiple visual projectors in a multimodal LLM to improve video understanding performance.
-
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
A rank-aware selective fusion framework with attention gating and decoupled heads outperforms baselines and ranks 2nd on the BlEmoRE challenge for blended emotion recognition.
- Training-Free Multimodal Large Language Model Orchestration