WavFlow: Audio Generation in Waveform Space

Belinda Zeng; Fanny Yang; Feiyan Zhou; Luyuan Wang; Shoufa Chen; Xiaohui Zhang; Yuren Cong; Zhe Wang; Zhiheng Liu

arxiv: 2605.18749 · v1 · pith:WH76AKF4new · submitted 2026-05-18 · 💻 cs.SD · cs.CV

WavFlow: Audio Generation in Waveform Space

Feiyan Zhou , Luyuan Wang , Shoufa Chen , Zhe Wang , Zhiheng Liu , Yuren Cong , Xiaohui Zhang , Fanny Yang

show 1 more author

Belinda Zeng

This is my paper

Pith reviewed 2026-05-20 07:29 UTC · model grok-4.3

classification 💻 cs.SD cs.CV

keywords audio generationwaveform spaceflow matchingvideo-to-audiotext-to-audiomultimodal generationlatent-free synthesis

0 comments

The pith

WavFlow generates high-fidelity audio directly in raw waveform space without latent compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that audio can be synthesized at high quality by working straight on the raw waveform instead of first compressing it into a latent space. It addresses the difficulties of high-dimensional low-energy signals by reshaping audio into two-dimensional grids through waveform patchify and applying amplitude lifting to balance scales for flow matching. A large curated set of five million video-text-audio triplets then supplies the data for learning semantic and temporal details from scratch. Results on VGGSound and AudioCaps benchmarks reach or surpass those of established latent-based systems, indicating that compression steps are not essential for competitive multimodal generation.

Core claim

WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62) by generating audio directly in waveform space. It reshapes raw audio into 2D token grids via waveform patchify and applies amplitude lifting to enable stable direct x-prediction optimization in flow matching. Training on five million high-quality video-text-audio triplets allows the model to capture fine-grained acoustic patterns without intermediate representations, demonstrating that compression is not a prerequisite for high-quality synthesis.

What carries the argument

Waveform patchify that reshapes raw audio into 2D token grids together with amplitude lifting to align signal scales, enabling stable direct x-prediction in flow matching.

If this is right

High-quality audio synthesis can proceed without information loss from latent compression.
Multimodal generation pipelines become simpler by removing the need for separate compression and decompression stages.
Direct waveform modeling supports learning of semantic alignment and temporal synchronization from raw signals.
Scalability improves because the framework avoids the added complexity of intermediate representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same patchify and lifting steps could be tested on other high-dimensional signals such as raw video or sensor data.
Removing latent stages might lower total memory and compute costs when deploying generation models at scale.
The method opens a route to explore whether flow matching or other frameworks can operate directly on waveforms in music or speech domains.

Load-bearing premise

Reshaping raw audio into 2D token grids via waveform patchify combined with amplitude lifting will enable stable direct x-prediction optimization in flow matching despite the high dimensionality and low energy of waveform signals.

What would settle it

A model trained with the same waveform patchify and amplitude lifting that produces substantially worse FD, IS, or DeSync scores than latent-based methods on VGGSound or AudioCaps would show the direct approach does not support competitive synthesis.

read the original abstract

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WavFlow gets competitive benchmark scores by patching raw waveforms and lifting amplitudes for direct flow matching, but the evidence that those steps actually stabilize training is missing.

read the letter

The key point is that this paper shows you can skip latent compression for audio generation and still hit decent numbers on VGGSound and AudioCaps by turning waveforms into 2D patches, lifting amplitudes, and running flow matching with x-prediction on 5 million curated triplets. That direct approach is the main novelty, and it does line up with the reported FD and IS scores that match or beat some established latent methods.

Referee Report

2 major / 2 minor

Summary. The paper introduces WavFlow, a framework for high-fidelity audio generation directly in raw waveform space rather than latent representations. It reshapes audio into 2D token grids via waveform patchify and applies amplitude lifting to enable stable direct x-prediction optimization within a flow-matching objective. The model is trained from scratch on a curated set of 5 million video-text-audio triplets and evaluated on the VGGSound video-to-audio benchmark (reporting FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the AudioCaps text-to-audio benchmark (FD_PANNs: 10.63, IS_PANNs: 12.62), claiming competitive or superior performance relative to established latent-based methods and demonstrating that intermediate compression is not required.

Significance. If the results hold, the work would be significant for challenging the prevailing latent-compression paradigm in audio and multimodal generation. It provides concrete evidence that direct waveform modeling can achieve competitive benchmark scores on video-to-audio and text-to-audio tasks while using a large-scale curated dataset of 5M triplets. The approach offers a simpler pipeline that avoids potential information loss from autoencoders and could improve scalability, with the reported metrics allowing direct comparison to prior latent-based baselines.

major comments (2)

[Abstract and framework description] Abstract and framework description: The central claim that waveform patchify plus amplitude lifting suffices to make direct x-prediction flow matching tractable on high-dimensional, low-energy waveforms is load-bearing for the contribution, yet the manuscript supplies no ablation studies (with vs. without lifting or patchify), training loss curves, gradient norm statistics, or convergence diagnostics to substantiate that these steps resolve the stated optimization difficulties.
[Experimental results] Experimental results: The reported benchmark scores (e.g., FD_PaSST 59.98 on VGGSound) are presented without error bars, standard deviations across seeds, or multiple-run statistics, which weakens the ability to assess whether the competitive performance against latent baselines is statistically robust.

minor comments (2)

[Data curation] The automated data pipeline for curating the 5M triplets would benefit from explicit details on filtering thresholds and quality metrics to support reproducibility.
[Method] Notation for the amplitude lifting scale and its interaction with the flow-matching velocity field should be clarified with an explicit equation in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the two major comments point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and framework description] Abstract and framework description: The central claim that waveform patchify plus amplitude lifting suffices to make direct x-prediction flow matching tractable on high-dimensional, low-energy waveforms is load-bearing for the contribution, yet the manuscript supplies no ablation studies (with vs. without lifting or patchify), training loss curves, gradient norm statistics, or convergence diagnostics to substantiate that these steps resolve the stated optimization difficulties.

Authors: We agree that the manuscript would benefit from explicit empirical support for these design choices. In the revised version we will add a dedicated ablation study (in the main text or a new appendix) that trains variants without amplitude lifting and without waveform patchify, reporting both final metrics and training dynamics. We will also include training loss curves, gradient norm statistics over the course of optimization, and convergence diagnostics to demonstrate the stability gains these components provide. revision: yes
Referee: [Experimental results] Experimental results: The reported benchmark scores (e.g., FD_PaSST 59.98 on VGGSound) are presented without error bars, standard deviations across seeds, or multiple-run statistics, which weakens the ability to assess whether the competitive performance against latent baselines is statistically robust.

Authors: We acknowledge that reporting variability strengthens claims of robustness. Because of the high computational cost of training on the 5 M triplet dataset, our primary results reflect a single training run. In the revision we will explicitly state this limitation, compare our single-run numbers to the single-run or unreported-variance numbers typical of prior latent-based baselines, and, if additional compute becomes available, report a small number of additional seeds. We will also add a brief discussion of statistical considerations in the experimental section. revision: partial

Circularity Check

0 steps flagged

No circularity; external benchmarks validate independent claims

full rationale

The paper asserts that waveform patchify plus amplitude lifting enables stable direct x-prediction flow matching on raw high-dimensional audio, then reports competitive scores on VGGSound (FD_PaSST 59.98, IS_PANNs 17.40, DeSync 0.44) and AudioCaps (FD_PANNs 10.63, IS_PANNs 12.62) using standard external metrics. These results do not reduce by construction to any fitted parameters, self-defined quantities, or self-citations within the paper; the benchmarks and metrics are independent of the internal preprocessing choices. No equations, uniqueness theorems, or ansatzes are shown to be justified only by prior self-work or by renaming the input. The derivation chain therefore remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard flow-matching assumptions plus two paper-specific modeling choices whose justification is not independently verified in the provided abstract.

free parameters (1)

amplitude lifting scale
Introduced to align signal scales for stable optimization; value not stated in abstract.

axioms (1)

domain assumption Flow matching remains stable and effective when applied directly to high-dimensional, low-energy waveform data after 2D patching and amplitude lifting.
Invoked to justify direct x-prediction without latent compression.

pith-pipeline@v0.9.0 · 5776 in / 1359 out tokens · 39715 ms · 2026-05-20T07:29:50.376697+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching
IndisputableMonolith/Foundation/AbsoluteFloorClosure absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To overcome the inherent difficulties of modeling high-dimensional and low-energy signals

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 12 internal anchors

[1]

Building Normalizing Flows with Stochastic Interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023.https://arxiv.org/abs/2209.15571. Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien.Semi-Supervised Learning. MIT Press,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020a. Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation, 2020b. Shoufa Chen,...

work page arXiv
[3]

On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972, 2023

Ting Chen. On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972,

work page arXiv
[4]

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jun Zhu, and Jianfei Cai. Omni2sound: Towards unified video-text-to-audio generation, 2026.https://arxiv.org/abs/2601.02731. Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

EBU R 128: Loudness normalisation and permitted maximum level of audio signals

European Broadcasting Union. EBU R 128: Loudness normalisation and permitted maximum level of audio signals. Technical report, European Broadcasting Union, 2020.https://tech.ebu.ch/docs/r/r128.pdf. Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion. In Forty-first International Conference on...

work page 2020
[7]

Simpler diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion,

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025.https://arxiv.org/abs/2410.19324. Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Tempora...

work page arXiv 2025
[8]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132,

work page 2019
[9]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020a. Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio p...

work page internal anchor Pith review Pith/arXiv arXiv 2009
[10]

Audiogen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

work page arXiv
[11]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

work page arXiv
[12]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503,

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503,

work page arXiv
[15]

Thinksound: Chain- of-thought reasoning in multimodal large language models for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025a

Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain- of-thought reasoning in multimodal large language models for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025a. Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, and Wei Xue. Prismaudio: ...

work page arXiv
[16]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930,

work page arXiv
[19]

AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, and Yike Guo. Audiox: Diffusion transformer for anything-to-audio generation.arXiv preprint arXiv:2503.10522,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

WaveNet: A Generative Model for Raw Audio

Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbren- ner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio.arXiv preprint arXiv:1609.03499, 12(1),

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Temporally aligned audio for video with autoregression

Ilpo Viertola, Vladimir Iashin, and Esa Rahtu. Temporally aligned audio for video with autoregression. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

work page 2025
[23]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15492–15501, 2024a. Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangji...

work page arXiv 2023
[24]

Qwen3-Omni Technical Report

13 Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

M” and “L

14 Appendix A Training Details Table 6 summarizes the full training configurations for all WavFlow variants. All models are trained on NVIDIA H100 GPUs and share the same optimizer (AdamW withβ1=0.9, β2=0.95), EMA decay of0.9999, gradient clipping at1.0, and BF16 mixed precision. In our main experiments,16kHz VT2A models are trained from scratch with a le...

work page 2025
[26]

Open-source T2A

to generate dense audio-visual descriptions for the VGGSound dataset, rephrasing them to align with the description style of the Open-source T2A data. This “Dense” VGGSound variant successfully stabilized the training when mixed with T2A data. However, as shown in Table 7, the resulting performance was inferior to the baseline trained solely on VGGSound (...

work page 2016
[27]

and C = 768 (192 × 4), designed to align the audio token count with the Synchformer feature length (192tokens) to test if such explicit choice benefits temporal alignment. Input: Waveform ( 1 , T ) ( 1 , C , D) Reshape Zero Padding ( If T mod D ? 0 ) ( 1 , C*D ) Figure 7 Waveform patchify illustration.A 1D waveform is reshaped into a 2D token grid of shap...

work page arXiv 2025

[1] [1]

Building Normalizing Flows with Stochastic Interpolants

Michael Samuel Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants. InThe Eleventh International Conference on Learning Representations, 2023.https://arxiv.org/abs/2209.15571. Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien.Semi-Supervised Learning. MIT Press,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Pixelflow: Pixel-space generative models with flow.arXiv preprint arXiv:2504.07963, 2025

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. Vggsound: A large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020a. Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation, 2020b. Shoufa Chen,...

work page arXiv

[3] [3]

On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972, 2023

Ting Chen. On the importance of noise scheduling for diffusion models.arXiv preprint arXiv:2301.10972,

work page arXiv

[4] [4]

Omni2Sound: Towards Unified Video-Text-to-Audio Generation

Yusheng Dai, Zehua Chen, Yuxuan Jiang, Baolong Gao, Qiuhong Ke, Jun Zhu, and Jianfei Cai. Omni2sound: Towards unified video-text-to-audio generation, 2026.https://arxiv.org/abs/2601.02731. Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

EBU R 128: Loudness normalisation and permitted maximum level of audio signals

European Broadcasting Union. EBU R 128: Loudness normalisation and permitted maximum level of audio signals. Technical report, European Broadcasting Union, 2020.https://tech.ebu.ch/docs/r/r128.pdf. Zach Evans, CJ Carr, Josiah Taylor, Scott H. Hawley, and Jordi Pons. Fast timing-conditioned latent audio diffusion. In Forty-first International Conference on...

work page 2020

[7] [7]

Simpler diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion,

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025.https://arxiv.org/abs/2410.19324. Jiawei Huang, Yi Ren, Rongjie Huang, Dongchao Yang, Zhenhui Ye, Chen Zhang, Jinglin Liu, Xiang Yin, Zejun Ma, and Zhou Zhao. Make-an-audio 2: Tempora...

work page arXiv 2025

[8] [8]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132,

work page 2019

[9] [9]

DiffWave: A Versatile Diffusion Model for Audio Synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis.Advances in neural information processing systems, 33:17022–17033, 2020a. Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. Panns: Large-scale pretrained audio neural networks for audio p...

work page internal anchor Pith review Pith/arXiv arXiv 2009

[10] [10]

Audiogen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. Audiogen: Textually guided audio generation.arXiv preprint arXiv:2209.15352,

work page arXiv

[11] [11]

Bigvgan: A universal neural vocoder with large-scale training,

Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, and Sungroh Yoon. Bigvgan: A universal neural vocoder with large-scale training.arXiv preprint arXiv:2206.04658,

work page arXiv

[12] [12]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503,

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. Audioldm: Text-to-audio generation with latent diffusion models.arXiv preprint arXiv:2301.12503,

work page arXiv

[15] [15]

Thinksound: Chain- of-thought reasoning in multimodal large language models for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025a

Huadai Liu, Kaicheng Luo, Jialei Wang, Wen Wang, Qian Chen, Zhou Zhao, and Wei Xue. Thinksound: Chain- of-thought reasoning in multimodal large language models for audio generation and editing.arXiv preprint arXiv:2506.21448, 2025a. Huadai Liu, Kaicheng Luo, Wen Wang, Qian Chen, Peiwen Sun, Rongjie Huang, Xiangang Li, Jieping Ye, and Wei Xue. Prismaudio: ...

work page arXiv

[16] [16]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models.arXiv preprint arXiv:2410.13720,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Hunyuanvideo-foley: Multimodal diffusion with representation alignment for high- fidelity foley audio generation.arXiv preprint arXiv:2508.16930, 2025

Sizhe Shan, Qiulin Li, Yutao Cui, Miles Yang, Yuehai Wang, Qun Yang, Jin Zhou, and Zhao Zhong. Hunyuanvideo- foley: Multimodal diffusion with representation alignment for high-fidelity foley audio generation.arXiv preprint arXiv:2508.16930,

work page arXiv

[19] [19]

AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian, Yizhu Jin, Zhaoyang Liu, Ruibin Yuan, Xu Tan, Qifeng Chen, Wei Xue, and Yike Guo. Audiox: Diffusion transformer for anything-to-audio generation.arXiv preprint arXiv:2503.10522,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

WaveNet: A Generative Model for Raw Audio

Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbren- ner, Andrew Senior, Koray Kavukcuoglu, et al. Wavenet: A generative model for raw audio.arXiv preprint arXiv:1609.03499, 12(1),

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Temporally aligned audio for video with autoregression

Ilpo Viertola, Vladimir Iashin, and Esa Rahtu. Temporally aligned audio for video with autoregression. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE,

work page 2025

[23] [23]

V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models

Heng Wang, Jianbo Ma, Santiago Pascual, Richard Cartwright, and Weidong Cai. V2a-mapper: A lightweight solution for vision-to-audio generation by connecting foundation models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 15492–15501, 2024a. Jun Wang, Xijuan Zeng, Chunyu Qiang, Ruilong Chen, Shiyao Wang, Le Wang, Wangji...

work page arXiv 2023

[24] [24]

Qwen3-Omni Technical Report

13 Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

M” and “L

14 Appendix A Training Details Table 6 summarizes the full training configurations for all WavFlow variants. All models are trained on NVIDIA H100 GPUs and share the same optimizer (AdamW withβ1=0.9, β2=0.95), EMA decay of0.9999, gradient clipping at1.0, and BF16 mixed precision. In our main experiments,16kHz VT2A models are trained from scratch with a le...

work page 2025

[26] [26]

Open-source T2A

to generate dense audio-visual descriptions for the VGGSound dataset, rephrasing them to align with the description style of the Open-source T2A data. This “Dense” VGGSound variant successfully stabilized the training when mixed with T2A data. However, as shown in Table 7, the resulting performance was inferior to the baseline trained solely on VGGSound (...

work page 2016

[27] [27]

and C = 768 (192 × 4), designed to align the audio token count with the Synchformer feature length (192tokens) to test if such explicit choice benefits temporal alignment. Input: Waveform ( 1 , T ) ( 1 , C , D) Reshape Zero Padding ( If T mod D ? 0 ) ( 1 , C*D ) Figure 7 Waveform patchify illustration.A 1D waveform is reshaped into a 2D token grid of shap...

work page arXiv 2025