Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception

Ma, Z · 2025 · arXiv 2510.12720

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

baseline 2 background 1

citation-polarity summary

baseline 2 background 1

representative citing papers

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

cs.CV · 2026-02-15 · unverdicted · novelty 8.0

EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training

eess.AS · 2026-01-06 · unverdicted · novelty 7.0

FCaps supplies 19M fine-grained speech style captions on 47k hours of audio via direct grounding, enabling the CLSP model to produce multi-granular representations that improve retrieval, zero-shot classification, and style scoring aligned with human judgments.

Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

eess.AS · 2026-05-12 · unverdicted · novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning

cs.CV · 2026-07-02 · unverdicted · novelty 4.0

TCA-Captioner introduces an Observer-Checker-Corrector refinement loop and TCA-Bench to address modality detachment and temporal incoherence in audiovisual video captioning.

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

cs.AI · 2026-04-09 · unverdicted · novelty 4.0

PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.

VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track

eess.AS · 2026-06-05 · unverdicted · novelty 3.0

VISA ranks 2nd in the Interspeech 2026 ARC Agent Track by adding multi-modal feature extraction, consistency-checked model voting, and rubric-aligned routing to large audio language models, reaching 66.23% Rubrics score and 77.40% accuracy.

A Survey of Audio Reasoning in Multimodal Foundation Models

eess.AS · 2026-05-20 · unverdicted · novelty 2.0

A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

citing papers explorer

Showing 9 of 9 citing papers.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos cs.CV · 2026-02-15 · unverdicted · none · ref 21
EgoSound is a new benchmark with 7315 QA pairs across seven tasks to evaluate egocentric sound understanding in multimodal large language models.
Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding cs.CV · 2026-04-13 · unverdicted · none · ref 17
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training eess.AS · 2026-01-06 · unverdicted · none · ref 5
FCaps supplies 19M fine-grained speech style captions on 47k hours of audio via direct grounding, enabling the CLSP model to produce multi-granular representations that improve retrieval, zero-shot classification, and style scoring aligned with human judgments.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model eess.AS · 2026-05-12 · unverdicted · none · ref 27
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 40
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.
Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning cs.CV · 2026-07-02 · unverdicted · none · ref 45
TCA-Captioner introduces an Observer-Checker-Corrector refinement loop and TCA-Bench to address modality detachment and temporal incoherence in audiovisual video captioning.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory cs.AI · 2026-04-09 · unverdicted · none · ref 14
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.
VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track eess.AS · 2026-06-05 · unverdicted · none · ref 10
VISA ranks 2nd in the Interspeech 2026 ARC Agent Track by adding multi-modal feature extraction, consistency-checked model voting, and rubric-aligned routing to large audio language models, reaching 66.23% Rubrics score and 77.40% accuracy.
A Survey of Audio Reasoning in Multimodal Foundation Models eess.AS · 2026-05-20 · unverdicted · none · ref 104
A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.

Omni-captioner: Data pipeline, mod- els, and benchmark for omni detailed perception

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer