SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Abhinaba Roy; Ambuj Mehrish; Dorien Herremans; Jan Melechovsky

arxiv: 2508.03448 · v3 · submitted 2025-08-05 · 💻 cs.SD · cs.AI· cs.MM· eess.AS

SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Jan Melechovsky , Ambuj Mehrish , Abhinaba Roy , Dorien Herremans This is my paper

Pith reviewed 2026-05-19 00:22 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.MMeess.AS

keywords music restorationaudio masteringgenerative modelsflow matchingtext-conditioned generationaudio degradation simulationunified audio enhancement

0 comments

The pith

SonicMaster is a single generative model that restores and masters music recordings using text instructions to fix multiple audio problems at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that one model can handle restoration and mastering for a wide range of common audio defects by learning from simulated examples and following natural language directions. It builds a dataset by applying nineteen degradation functions across equalization, dynamics, reverb, amplitude, and stereo categories to clean tracks, then trains a flow-matching system to reverse those degradations under text guidance or in automatic mode. A sympathetic reader would care because this replaces a collection of separate tools and manual tweaks with a single accessible system for non-professional recordings. Objective metrics improve across artifact types and listeners favor the outputs over baselines.

Core claim

SonicMaster is the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. It is conditioned on natural language instructions to apply targeted enhancements or can operate automatically. The model learns the required audio transformation through a flow-matching generative training paradigm on the SonicMaster dataset of paired degraded and high-quality tracks created with nineteen degradation functions in five enhancement groups.

What carries the argument

A flow-matching generative model conditioned on text prompts that maps degraded audio inputs to their cleaned and mastered versions.

Load-bearing premise

The nineteen hand-chosen degradation functions applied to high-quality tracks produce training examples that accurately represent real acoustic and processing artifacts found in non-professional recordings.

What would settle it

A blind listening test or objective evaluation performed on a collection of actual non-professional recordings that shows no quality gain or listener preference for SonicMaster outputs compared with unprocessed or baseline-processed versions.

Figures

Figures reproduced from arXiv: 2508.03448 by Abhinaba Roy, Ambuj Mehrish, Dorien Herremans, Jan Melechovsky.

**Figure 2.** Figure 2: Overall architecture of SonicMaster. Rectified Flow Training: SonicMaster employs rectified flow (Liu et al., 2022; Esser et al., 2024), to predict flow velocity from degraded to clean audio in latent space, unlike other models that map noise to output distributions (Fei et al., 2024; Hung et al., 2024). We assign timestep t = 1 to the latent representation of the degraded audio x1, and t = 0 to the latent… view at source ↗

**Figure 3.** Figure 3: Original vs. degraded (via convolution with a phone microphone transfer function) and [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of spectrograms: (a) ground truth, (b) degraded with reverb, and (c) the output [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of reverberation (example from the main text in larger size): top panel shows [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Effect of clarity degradation and restoration on spectrograms. The treble frequencies are [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Effect of clipping degradation and related restoration. Drum hits clip in the degraded [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Another example of the effect of clipping and its restoration. The degraded input shows [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster's enhanced outputs over other baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SonicMaster puts text control on a flow-matching model for combined music restoration and mastering, but the synthetic training data is the part that needs real-world checking.

read the letter

The main takeaway is that this paper builds a single generative system that takes degraded music and applies targeted fixes or general cleanup based on text prompts. It is the first to wrap restoration and mastering together under language control instead of separate tools or fixed pipelines. They train it with flow-matching on a new paired dataset made by running nineteen degradation functions across five groups on clean tracks. That combination of all-in-one scope plus controllable prompts is the actual new piece here, and the dataset construction is a practical step that others can build on. The reported gains in objective metrics and listener preference on their synthetic test set show the model can learn the mapping at least in the lab setting. Credit for shipping a working prototype that handles multiple artifact types without needing separate models for each. The soft spot sits in the training data itself. Those nineteen functions are applied independently to high-quality tracks, which produces clean paired examples but leaves out the coupled, nonlinear problems that show up in actual non-professional recordings, such as mic distortion interacting with room modes and lossy encoding. If the learned distribution stays tied to the additive simulations, the improvements seen on held-out synthetic data may not translate to real user material. The abstract gives no numbers, baseline comparisons, or statistical tests, so the strength of the claims rests on whatever details appear in the full experiments. This paper is for people working on generative audio or controllable music production tools. A reader who wants to see how flow-matching can be steered by text for audio cleanup will find usable ideas in the dataset and conditioning setup. It is coherent enough on its own terms to deserve peer review, though any referee should ask for validation on authentic degraded recordings rather than only simulated ones. I would send it forward with that specific request for additional evidence.

Referee Report

2 major / 1 minor

Summary. The paper introduces SonicMaster, the first unified generative model for music restoration and mastering that uses text-based control. It constructs a large synthetic paired dataset by applying nineteen degradation functions across five groups (equalization, dynamics, reverb, amplitude, stereo) to high-quality tracks, then trains a flow-matching model to map degraded inputs to cleaned/mastered outputs, either via text prompts or in automatic mode. Objective metrics are said to improve across artifact categories and subjective listening tests show listener preference over baselines.

Significance. If the approach generalizes beyond synthetic degradations, the work could meaningfully advance controllable all-in-one audio processing by combining restoration and mastering in a single text-guided generative framework. The flow-matching training paradigm and construction of a large paired synthetic dataset are clear technical strengths that enable the unified model.

major comments (2)

[SonicMaster Dataset] SonicMaster Dataset construction: The training data are generated by applying the nineteen degradation functions independently to clean tracks. Real non-professional artifacts frequently arise from coupled nonlinear processes (e.g., microphone preamp distortion interacting with room modes and lossy encoding) absent from this additive simulation. This assumption is load-bearing for the central claim that the model addresses authentic audio artifacts in non-professional recordings.
[Evaluation] Evaluation section: The abstract states that objective metrics improve across artifact categories and listeners prefer the outputs, yet the manuscript provides no quantitative values, baseline details, or statistical tests. Without these specifics it is impossible to judge the magnitude or reliability of the reported gains on either synthetic or real data.

minor comments (1)

[Abstract] The abstract would benefit from one or two concrete metric values or effect sizes to allow readers to assess the scale of improvement immediately.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and describe the revisions we plan to make.

read point-by-point responses

Referee: [SonicMaster Dataset] SonicMaster Dataset construction: The training data are generated by applying the nineteen degradation functions independently to clean tracks. Real non-professional artifacts frequently arise from coupled nonlinear processes (e.g., microphone preamp distortion interacting with room modes and lossy encoding) absent from this additive simulation. This assumption is load-bearing for the central claim that the model addresses authentic audio artifacts in non-professional recordings.

Authors: We agree that applying the nineteen degradation functions independently does not capture the coupled nonlinear interactions typical of real non-professional recordings. This is a genuine limitation of the current synthetic dataset. In the revised manuscript we will add an explicit discussion of this assumption in the dataset construction section, qualify the central claim accordingly, and include qualitative results on real-world examples. We will also outline future directions for incorporating more complex coupled degradations. revision: partial
Referee: [Evaluation] Evaluation section: The abstract states that objective metrics improve across artifact categories and listeners prefer the outputs, yet the manuscript provides no quantitative values, baseline details, or statistical tests. Without these specifics it is impossible to judge the magnitude or reliability of the reported gains on either synthetic or real data.

Authors: The referee is correct that the current manuscript version lacks specific quantitative values, baseline details, and statistical test results. We will revise the Evaluation section to include comprehensive tables reporting exact objective metric improvements (e.g., SI-SDR, PESQ, and other measures), full baseline descriptions, and statistical significance tests. Key numerical results and a summary table will also be added to the abstract and results overview to enable assessment of gains on both synthetic and real data. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper constructs a synthetic training dataset by applying nineteen hand-specified degradation functions across five groups to clean tracks, then trains a standard flow-matching generative model conditioned on text prompts to map degraded inputs to restored outputs. Objective metrics and subjective tests evaluate the resulting generations without any load-bearing equations, predictions, or uniqueness claims reducing by construction to fitted parameters or self-defined targets. The central claim of unified text-controlled restoration rests on empirical performance of this pipeline rather than tautological re-derivation of inputs, rendering the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of synthetic degradation simulation and the suitability of flow-matching for audio-to-audio transformation; no new physical entities are introduced.

free parameters (1)

Degradation function parameters
Specific settings for the nineteen functions across equalization, dynamics, reverb, amplitude, and stereo groups were selected to generate training pairs.

axioms (1)

domain assumption Simulated degradations are representative of real non-professional recording artifacts
All training data is generated by applying the chosen functions to clean tracks rather than collecting genuine low-quality recordings.

pith-pipeline@v0.9.0 · 5743 in / 1273 out tokens · 104807 ms · 2026-05-19T00:22:01.224939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We construct the SonicMaster dataset... by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

[1]

Improvnet–generating controllable musical improvisations with itera- tive corruption refinement.arXiv preprint arXiv:2502.04522,

Keshav Bhandari, Sungkyun Chang, Tongyu Lu, Fareza R Enus, Louis B Bradshaw, Dorien Herre- mans, and Simon Colton. Improvnet–generating controllable musical improvisations with itera- tive corruption refinement.arXiv preprint arXiv:2502.04522,

work page arXiv
[2]

Text2fx: Harnessing clap embeddings for text-guided audio effects

Annie Chu, Patrick O’Reilly, Julia Barnett, and Bryan Pardo. Text2fx: Harnessing clap embeddings for text-guided audio effects. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025
[3]

Clap learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023
[4]

Flux that plays music

Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Flux that plays music. arXiv:2409.00587,

work page arXiv
[5]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Editing music with melody and text: Using controlnet for diffusion transformer

Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, and Chao Zhang. Editing music with melody and text: Using controlnet for diffusion transformer. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5,

work page 2025
[7]

doi: 10.1109/ICASSP49660.2025.10890309. David M. Howard and James A. S. Angus. Open acoustic impulse response (open air) library. https://www.openair.hosted.york.ac.uk/, n.d. Accessed: 2025-07-06. Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. Tangoflux: Super...

work page doi:10.1109/icassp49660.2025.10890309 2025
[8]

Distortion audio effects: Learning how to recover the clean signal

10 Johannes Imort, Giorgio Fabbro, Marco A Mart ´ınez Ram´ırez, Stefan Uhlich, Yuichiro Koyama, and Yuki Mitsufuji. Distortion audio effects: Learning how to recover the clean signal. arXiv:2202.01664,

work page arXiv
[9]

Music enhancement via image translation and vocoding

Nikhil Kandpal, Oriol Nieto, and Zeyu Jin. Music enhancement via image translation and vocoding. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pp. 3124–3128. IEEE,

work page 2022
[10]

Searching for music mixing graphs: A pruning approach.arXiv preprint arXiv:2406.01049,

Sungho Lee, Marco A Mart´ınez-Ram´ırez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, and Yuki Mitsufuji. Searching for music mixing graphs: A pruning approach.arXiv preprint arXiv:2406.01049,

work page arXiv
[11]

Masksr: Masked language model for full-band speech restora- tion.arXiv:2406.02092,

Xu Li, Qirui Wang, and Xiaoyu Liu. Masksr: Masked language model for full-band speech restora- tion.arXiv:2406.02092,

work page arXiv
[12]

V oicefixer: Toward general speech restoration with neural vocoder.arXiv:2109.13731,

Haohe Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, and Yuxuan Wang. V oicefixer: Toward general speech restoration with neural vocoder.arXiv:2109.13731,

work page arXiv
[13]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Zero-shot unsupervised and text-based audio editing using ddpm inversion.arXiv preprint arXiv:2402.10009,

Hila Manor and Tomer Michaeli. Zero-shot unsupervised and text-based audio editing using ddpm inversion.arXiv preprint arXiv:2402.10009,

work page arXiv
[15]

Automatic music mixing with deep learning and out-of-domain data.arXiv preprint arXiv:2208.11428,

Marco A Mart´ınez-Ram´ırez, Wei-Hsiang Liao, Giorgio Fabbro, Stefan Uhlich, Chihiro Nagashima, and Yuki Mitsufuji. Automatic music mixing with deep learning and out-of-domain data.arXiv preprint arXiv:2208.11428,

work page arXiv
[16]

Mustango: Toward controllable text-to-music generation

Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Sou- janya Poria. Mustango: Toward controllable text-to-music generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pp. 8286–8309,

work page 2024
[17]

Automatic equalization for individual instrument tracks using convolutional neural networks.arXiv:2407.16691,

Florian Mockenhaupt, Joscha Simon Rieber, and Shahan Nercessian. Automatic equalization for individual instrument tracks using convolutional neural networks.arXiv:2407.16691,

work page arXiv
[18]

Diffusion-based audio inpainting.arXiv:2305.15266,

Eloi Moliner and Vesa V¨alim¨aki. Diffusion-based audio inpainting.arXiv:2305.15266,

work page arXiv
[19]

Eloi Moliner, Filip Elvander, and Vesa V ¨alim¨aki

doi: 10.1109/TASLP.2022.3190726. Eloi Moliner, Filip Elvander, and Vesa V ¨alim¨aki. Blind audio bandwidth extension: A diffusion- based zero-shot approach.IEEE/ACM Transactions on Audio, Speech, and Language Processing,

work page doi:10.1109/taslp.2022.3190726 2022
[20]

Exploring trends in audio mixes and masters: Insights from a dataset analysis.arXiv:2412.03373,

11 Angeliki Mourgela, Elio Quinton, Spyridon Bissas, Joshua D Reiss, and David Ronan. Exploring trends in audio mixes and masters: Insights from a dataset analysis.arXiv:2412.03373,

work page arXiv
[21]

Automatic multitrack mixing with a differentiable mixing console of neural audio effects

Christian J Steinmetz, Jordi Pons, Santiago Pascual, and Joan Serr `a. Automatic multitrack mixing with a differentiable mixing console of neural audio effects. InICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 71–75. IEEE,

work page 2021
[22]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv:2502.05139,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Flowsep: Language-queried sound separation with rectified flow matching

Yi Yuan, Xubo Liu, Haohe Liu, Mark D Plumbley, and Wenwu Wang. Flowsep: Language-queried sound separation with rectified flow matching. InICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025
[24]

Music source restoration

Yongyi Zang, Zheqi Dai, Mark D Plumbley, and Qiuqiang Kong. Music source restoration. arXiv:2505.21827,

work page arXiv
[25]

Please, reduce the strong echo in this song

(a) Original (clean). (b) Degraded input. (c) Restored output. Figure 3: Original vs. degraded (via convolution with a phone microphone transfer function) and SonicMaster-restored spectrograms; restoration suppresses the microphone’s coloration. A.4 SPECTROGRAM EXAMPLES We visualize time–frequency structure in spectrograms to provide qualitative evidence ...

work page arXiv
[26]

Tables 10, 11, and 12 outline the results and high- light the trade-off across degradation categories

solver. Tables 10, 11, and 12 outline the results and high- light the trade-off across degradation categories. Euler-1 matches the baseline overall but is weaker on Boom, Microphone, Clip, all Reverb subtasks, and shows higher KL. Euler-100 boosts Reverb 15 (a) Original (clean). (b) Degraded input. (c) Restored output. Figure 5: Effect of reverberation (e...

work page arXiv
[27]

remove excess reverb and make it sound cleaner,

and Punch yet lowers every EQ score versus the 1-/10-step runs. Runge–Kutta-10 equals Euler-10 on most metrics and tops Clip, but its inference is significantly slower. A.6 PROMPTS FOR EACH DEGRADATION TYPE Prompt instructions for each degradation type are grouped by audio attribute in Table 13; for ex- ample, entries for Xband, microphone coloration, cla...

work page arXiv

[1] [1]

Improvnet–generating controllable musical improvisations with itera- tive corruption refinement.arXiv preprint arXiv:2502.04522,

Keshav Bhandari, Sungkyun Chang, Tongyu Lu, Fareza R Enus, Louis B Bradshaw, Dorien Herre- mans, and Simon Colton. Improvnet–generating controllable musical improvisations with itera- tive corruption refinement.arXiv preprint arXiv:2502.04522,

work page arXiv

[2] [2]

Text2fx: Harnessing clap embeddings for text-guided audio effects

Annie Chu, Patrick O’Reilly, Julia Barnett, and Bryan Pardo. Text2fx: Harnessing clap embeddings for text-guided audio effects. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025

[3] [3]

Clap learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2023

[4] [4]

Flux that plays music

Zhengcong Fei, Mingyuan Fan, Changqian Yu, and Junshi Huang. Flux that plays music. arXiv:2409.00587,

work page arXiv

[5] [5]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Editing music with melody and text: Using controlnet for diffusion transformer

Siyuan Hou, Shansong Liu, Ruibin Yuan, Wei Xue, Ying Shan, Mangsuo Zhao, and Chao Zhang. Editing music with melody and text: Using controlnet for diffusion transformer. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5,

work page 2025

[7] [7]

doi: 10.1109/ICASSP49660.2025.10890309. David M. Howard and James A. S. Angus. Open acoustic impulse response (open air) library. https://www.openair.hosted.york.ac.uk/, n.d. Accessed: 2025-07-06. Chia-Yu Hung, Navonil Majumder, Zhifeng Kong, Ambuj Mehrish, Amir Ali Bagherzadeh, Chuan Li, Rafael Valle, Bryan Catanzaro, and Soujanya Poria. Tangoflux: Super...

work page doi:10.1109/icassp49660.2025.10890309 2025

[8] [8]

Distortion audio effects: Learning how to recover the clean signal

10 Johannes Imort, Giorgio Fabbro, Marco A Mart ´ınez Ram´ırez, Stefan Uhlich, Yuichiro Koyama, and Yuki Mitsufuji. Distortion audio effects: Learning how to recover the clean signal. arXiv:2202.01664,

work page arXiv

[9] [9]

Music enhancement via image translation and vocoding

Nikhil Kandpal, Oriol Nieto, and Zeyu Jin. Music enhancement via image translation and vocoding. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP), pp. 3124–3128. IEEE,

work page 2022

[10] [10]

Searching for music mixing graphs: A pruning approach.arXiv preprint arXiv:2406.01049,

Sungho Lee, Marco A Mart´ınez-Ram´ırez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, and Yuki Mitsufuji. Searching for music mixing graphs: A pruning approach.arXiv preprint arXiv:2406.01049,

work page arXiv

[11] [11]

Masksr: Masked language model for full-band speech restora- tion.arXiv:2406.02092,

Xu Li, Qirui Wang, and Xiaoyu Liu. Masksr: Masked language model for full-band speech restora- tion.arXiv:2406.02092,

work page arXiv

[12] [12]

V oicefixer: Toward general speech restoration with neural vocoder.arXiv:2109.13731,

Haohe Liu, Qiuqiang Kong, Qiao Tian, Yan Zhao, DeLiang Wang, Chuanzeng Huang, and Yuxuan Wang. V oicefixer: Toward general speech restoration with neural vocoder.arXiv:2109.13731,

work page arXiv

[13] [13]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv:2209.03003,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Zero-shot unsupervised and text-based audio editing using ddpm inversion.arXiv preprint arXiv:2402.10009,

Hila Manor and Tomer Michaeli. Zero-shot unsupervised and text-based audio editing using ddpm inversion.arXiv preprint arXiv:2402.10009,

work page arXiv

[15] [15]

Automatic music mixing with deep learning and out-of-domain data.arXiv preprint arXiv:2208.11428,

Marco A Mart´ınez-Ram´ırez, Wei-Hsiang Liao, Giorgio Fabbro, Stefan Uhlich, Chihiro Nagashima, and Yuki Mitsufuji. Automatic music mixing with deep learning and out-of-domain data.arXiv preprint arXiv:2208.11428,

work page arXiv

[16] [16]

Mustango: Toward controllable text-to-music generation

Jan Melechovsky, Zixun Guo, Deepanway Ghosal, Navonil Majumder, Dorien Herremans, and Sou- janya Poria. Mustango: Toward controllable text-to-music generation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 1: Long Papers), pp. 8286–8309,

work page 2024

[17] [17]

Automatic equalization for individual instrument tracks using convolutional neural networks.arXiv:2407.16691,

Florian Mockenhaupt, Joscha Simon Rieber, and Shahan Nercessian. Automatic equalization for individual instrument tracks using convolutional neural networks.arXiv:2407.16691,

work page arXiv

[18] [18]

Diffusion-based audio inpainting.arXiv:2305.15266,

Eloi Moliner and Vesa V¨alim¨aki. Diffusion-based audio inpainting.arXiv:2305.15266,

work page arXiv

[19] [19]

Eloi Moliner, Filip Elvander, and Vesa V ¨alim¨aki

doi: 10.1109/TASLP.2022.3190726. Eloi Moliner, Filip Elvander, and Vesa V ¨alim¨aki. Blind audio bandwidth extension: A diffusion- based zero-shot approach.IEEE/ACM Transactions on Audio, Speech, and Language Processing,

work page doi:10.1109/taslp.2022.3190726 2022

[20] [20]

Exploring trends in audio mixes and masters: Insights from a dataset analysis.arXiv:2412.03373,

11 Angeliki Mourgela, Elio Quinton, Spyridon Bissas, Joshua D Reiss, and David Ronan. Exploring trends in audio mixes and masters: Insights from a dataset analysis.arXiv:2412.03373,

work page arXiv

[21] [21]

Automatic multitrack mixing with a differentiable mixing console of neural audio effects

Christian J Steinmetz, Jordi Pons, Santiago Pascual, and Joan Serr `a. Automatic multitrack mixing with a differentiable mixing console of neural audio effects. InICASSP 2021-2021 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 71–75. IEEE,

work page 2021

[22] [22]

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, et al. Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv:2502.05139,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Flowsep: Language-queried sound separation with rectified flow matching

Yi Yuan, Xubo Liu, Haohe Liu, Mark D Plumbley, and Wenwu Wang. Flowsep: Language-queried sound separation with rectified flow matching. InICASSP 2025-2025 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

work page 2025

[24] [24]

Music source restoration

Yongyi Zang, Zheqi Dai, Mark D Plumbley, and Qiuqiang Kong. Music source restoration. arXiv:2505.21827,

work page arXiv

[25] [25]

Please, reduce the strong echo in this song

(a) Original (clean). (b) Degraded input. (c) Restored output. Figure 3: Original vs. degraded (via convolution with a phone microphone transfer function) and SonicMaster-restored spectrograms; restoration suppresses the microphone’s coloration. A.4 SPECTROGRAM EXAMPLES We visualize time–frequency structure in spectrograms to provide qualitative evidence ...

work page arXiv

[26] [26]

Tables 10, 11, and 12 outline the results and high- light the trade-off across degradation categories

solver. Tables 10, 11, and 12 outline the results and high- light the trade-off across degradation categories. Euler-1 matches the baseline overall but is weaker on Boom, Microphone, Clip, all Reverb subtasks, and shows higher KL. Euler-100 boosts Reverb 15 (a) Original (clean). (b) Degraded input. (c) Restored output. Figure 5: Effect of reverberation (e...

work page arXiv

[27] [27]

remove excess reverb and make it sound cleaner,

and Punch yet lowers every EQ score versus the 1-/10-step runs. Runge–Kutta-10 equals Euler-10 on most metrics and tops Clip, but its inference is significantly slower. A.6 PROMPTS FOR EACH DEGRADATION TYPE Prompt instructions for each degradation type are grouped by audio attribute in Table 13; for ex- ample, entries for Xband, microphone coloration, cla...

work page arXiv