TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Feifan Song; Kun Gai; Kun Ouyang; Lei Li; Lingpeng Kong; Linli Yao; Pengfei Wan; Qi Liu; Xinlong Chen; Xu Sun

arxiv: 2602.08711 · v3 · pith:2DMLWPZOnew · submitted 2026-02-09 · 💻 cs.CV

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

Linli Yao , Yuancheng Wei , Yaojie Zhang , Lei Li , Xinlong Chen , Feifan Song , Ziyue Wang , Kun Ouyang

show 7 more authors

Yuanxin Liu Lingpeng Kong Qi Liu Pengfei Wan Kun Gai Yuanxing Zhang Xu Sun

This is my paper

classification 💻 cs.CV

keywords audio-visualdensescenecaptionsconstructdescriptionsstructuraltime-aware

0 comments

read the original abstract

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code are available at https://github.com/yaolinli/TimeChat-Captioner.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
cs.CV 2026-04 unverdicted novelty 7.0

OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.
CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning
cs.AI 2026-06 unverdicted novelty 6.0

CineCap combines structured reasoning and RL rewards to outperform baselines on cinematographic video captioning using a new 472-pair benchmark.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
cs.CV 2026-05 unverdicted novelty 6.0

DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.