Beyond Transcripts: A Renewed Perspective on Audio Chaptering

· 2026 · cs.SD · arXiv 2602.08979

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Audio chaptering, the task of segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.

representative citing papers

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

cs.CL · 2026-06-03 · unverdicted · novelty 4.0

KIT's IWSLT submission uses segment concatenation, LLM label generation and cross-lingual translation to create >1M long-form training instances and shows that likelihood re-ranking harms semantic tasks unless combined with Minimum Bayes Risk decoding.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026 cs.CL · 2026-06-03 · unverdicted · none · ref 17 · internal anchor
KIT's IWSLT submission uses segment concatenation, LLM label generation and cross-lingual translation to create >1M long-form training instances and shows that likelihood re-ranking harms semantic tasks unless combined with Minimum Bayes Risk decoding.

Beyond Transcripts: A Renewed Perspective on Audio Chaptering

fields

years

verdicts

representative citing papers

citing papers explorer