Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Bo-Hao Su; Chao-Han Huck Yang; Chien-yu Huang; Chin-Jou Li; Haoran Wang; Jiatong Shi; Jinchuan Tian; Jin Sakuma; Keita Goto; Masao Someki

arxiv: 2602.05220 · v3 · pith:7PJHPW5Fnew · submitted 2026-02-05 · 💻 cs.CL · cs.SD

Bagpiper: Solving Open-Ended Audio Tasks via Rich Captions

Jinchuan Tian , Haoran Wang , Bo-Hao Su , Chien-yu Huang , Qingzheng Wang , Jiatong Shi , William Chen , Xun Gong

show 9 more authors

Siddhant Arora Chin-Jou Li Masao Someki Takashi Maekaku Keita Goto Yusuke Shinohara Jin Sakuma Chao-Han Huck Yang Shinji Watanabe

This is my paper

classification 💻 cs.CL cs.SD

keywords audiobagpipercognitivemodeltaskscaptionsconceptsfoundation

0 comments

read the original abstract

Current audio foundation models typically rely on rigid, task-specific supervision, addressing isolated factors of audio rather than the whole. In contrast, human intelligence processes audio holistically, seamlessly bridging physical signals with abstract cognitive concepts to execute complex tasks. Grounded in this philosophy, we introduce Bagpiper, an 8B audio foundation model that interprets physical audio via rich captions, i.e., comprehensive natural language descriptions that encapsulate the critical cognitive concepts inherent in the signal (e.g., transcription, audio events). By pre-training on a massive corpus of 600B tokens, the model establishes a robust bidirectional mapping between raw audio and this high-level conceptual space. During fine-tuning, Bagpiper adopts a caption-then-process workflow, simulating an intermediate cognitive reasoning step to solve diverse tasks without task-specific priors. Experimentally, Bagpiper outperforms Qwen-2.5-Omni on MMAU and AIRBench for audio understanding and surpasses CosyVoice3 and TangoFlux in generation quality, capable of synthesizing arbitrary compositions of speech, music, and sound effects. To the best of our knowledge, Bagpiper is among the first works that achieve unified understanding generation for general audio. Model, data, and code are available at Bagpiper Home Page.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis
cs.CL 2026-06 unverdicted novelty 7.0

Bagpiper-TTS uses natural language prompts and intent reasoning to derive rich captions that guide a single model for universal speech synthesis across classical TTS, multi-talker, singing, and role-play tasks.
Bagpiper-Edit: Zero-Shot Open-Ended Audio Editing via Rich-Caption
cs.SD 2026-06 unverdicted novelty 6.0

Bagpiper-Edit performs zero-shot open-ended audio editing by translating natural-language instructions into edited rich captions that guide generation anchored to the original audio.
Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text
cs.SD 2026-05 unverdicted novelty 6.0

Dasheng AudioGen uses multi-view captions and a unified semantic-acoustic representation to enable end-to-end generation of mixed audio scenes from text descriptions.
ESPnet3: Infrastructure for Scalable Speech and Audio Research in the Foundation Model Era
eess.AS 2026-06 unverdicted novelty 4.0

ESPnet3 introduces a new modular architecture with DataOrganizer and sharding to cut training time and simplify model integration for speech research.