pith. sign in

arxiv: 2602.08711 · v3 · pith:2DMLWPZOnew · submitted 2026-02-09 · 💻 cs.CV

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

classification 💻 cs.CV
keywords audio-visualdensescenecaptionsconstructdescriptionsstructuraltime-aware
0
0 comments X
read the original abstract

This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code are available at https://github.com/yaolinli/TimeChat-Captioner.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniScript is a new 8B omni-modal model that turns long cinematic videos into scene-by-scene scripts and matches top proprietary models on temporal localization and semantic accuracy.

  2. CineCap: Structured Reasoning with Spatio-Temporal Anchors for Cinematographic Video Captioning

    cs.AI 2026-06 unverdicted novelty 6.0

    CineCap combines structured reasoning and RL rewards to outperform baselines on cinematographic video captioning using a new 472-pair benchmark.

  3. DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

    cs.CV 2026-05 unverdicted novelty 6.0

    DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.