TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Jointavbench: A benchmark for joint audio-visual reasoning evaluation
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5verdicts
UNVERDICTED 5roles
dataset 2representative citing papers
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
citing papers explorer
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
-
MMTB: Evaluating Terminal Agents on Multimedia-File Tasks
MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.