EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 11verdicts
UNVERDICTED 11representative citing papers
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
FCaps supplies 19M fine-grained speech style captions on 47k hours of audio via direct grounding, enabling the CLSP model to produce multi-granular representations that improve retrieval, zero-shot classification, and style scoring aligned with human judgments.
Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
CogniRoute adds a cognitive schema and route-aware RL to an omni-modal MoE, reaching 59.38% accuracy on a new 118K-example social video QA benchmark and beating prior baselines by 15-27 points.
TCA-Captioner introduces an Observer-Checker-Corrector refinement loop and TCA-Bench to address modality detachment and temporal incoherence in audiovisual video captioning.
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.
VISA ranks 2nd in the Interspeech 2026 ARC Agent Track by adding multi-modal feature extraction, consistency-checked model voting, and rubric-aligned routing to large audio language models, reaching 66.23% Rubrics score and 77.40% accuracy.
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.
citing papers explorer
No citing papers match the current filters.