arxiv: 2605.09259 · v1 · submitted 2026-05-10 · 💻 cs.SD · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

Leduo Chen , Junchuan Zhao , Shengchen Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:53 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords timbre transferdiffusion modelspolyphonic audiojoint stem modelingaudio style transfersource separation avoidancechoral musicMixtureTT

0 comments

The pith

A single joint diffusion process transfers specified timbres to every voice in a polyphonic mixture without first separating the stems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MixtureTT as the first method to perform per-stem timbre transfer directly on a polyphonic audio mixture by feeding the mixture and multiple timbre references into one shared diffusion process. Prior approaches separate the mixture into individual stems first and then transfer timbre to each stem separately, which accumulates separation artifacts and often produces timbrally or harmonically inconsistent results across the stems. The proposed joint stem diffusion transformer models content within each stem together with cross-stem harmonic relationships, eliminating the separation stage, lowering inference cost by the number of stems, and generating outputs that stay more coherent. Experiments on the SATB choral dataset show MixtureTT outperforming single-instrument baselines on both objective and subjective measures, even though it receives the harder mixture-level input.

Core claim

MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines.

What carries the argument

The joint stem diffusion transformer that ingests the polyphonic mixture plus one timbre reference per target stem and models their mutual dependencies inside a single diffusion trajectory.

If this is right

Eliminates propagation of source-separation errors by operating directly on the mixture.
Cuts inference cost by the exact number of stems because only one diffusion run is required instead of one per stem.
Yields more coherent multi-stem outputs through explicit modeling of cross-stem harmonic dependencies.
Demonstrates that dedicated joint modeling is required for mixture-level timbre transfer, as the joint system beats equivalent single-stem ablations on the SATB dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-diffusion principle could be tested on other multi-source audio tasks such as speech separation or environmental sound editing where source interactions matter.
Real-time music production tools might become feasible if the single-run cost reduction holds at larger model scales.
The result suggests that any generative audio model handling multiple concurrent sources may benefit from explicit cross-source conditioning rather than independent processing.

Load-bearing premise

Joint cross-stem modeling during diffusion preserves each stem's original content and produces coherent multi-stem audio without creating new harmonic or timbral inconsistencies that separate-then-transfer pipelines avoid.

What would settle it

If a controlled test on non-choral polyphonic recordings shows that MixtureTT introduces more harmonic clashes or lower subjective quality than a strong separate-then-transfer baseline, the superiority of the joint approach would be falsified.

Figures

Figures reproduced from arXiv: 2605.09259 by Junchuan Zhao, Leduo Chen, Shengchen Li.

**Figure 1.** Figure 1: Top: the separate-then-transfer pipeline forced by single-instrument tools. Bottom: MixtureTT performs joint perstem transfer directly from the mixture. fixed ensemble-to-ensemble conversion, or require external persource query audio to enable source-level control [16]. To our knowledge, no existing method extracts per-stem content implicitly from a polyphonic mixture, assigns an independent target tim… view at source ↗

**Figure 2.** Figure 2: Overview of MixtureTT. The mixture is processed by a frozen Demucs encoder and a trainable dual-branch content adapter to produce per-stem content embeddings c (i) . Timbre references are encoded by the frozen codec and a timbre encoder into global embeddings τ (i) . The Joint Stem DiT denoises N noisy stem latents jointly through three stages (Intra-Stem → Cross-Stem → Refinement), conditioned on content… view at source ↗

**Figure 3.** Figure 3: Qualitative result of MixtureTT jointly generating the re-instrumented stems from an input mixture and four timbre references [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Subjective MOS evaluation (1–5 scale). specifically designed to prevent. 5.2. Subjective Evaluation The subjective study showing in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Timbre transfer aims to modify the timbral identity of a musical recording while preserving the original melody and rhythm. While single-instrument timbre transfer has made substantial progress, existing approaches to multi-instrument settings rely on separate-then-transfer pipelines that propagate source separation artifacts and produce incoherent synthesized timbres across stems. This paper proposes MixtureTT, to the best of our knowledge the first system for flexible per-stem timbre transfer directly from a polyphonic mixture. Given a mixture and a separate timbre reference for each target voice, MixtureTT jointly transfers all stems to the specified instruments through a shared diffusion process. Modeling the dependencies across the per-stem content and cross-stem harmonic, the proposed joint stem diffusion transformer eliminates cascaded separation error, reduces inference cost by a factor equal to the number of stems, and yields more coherent multi-stem outputs. Despite operating under a strictly harder input condition, evaluations on the SATB choral dataset show that MixtureTT outperforms single-instrument baselines on both objective and subjective metrics demonstrating the necessity of dedicated multi-instrument timbre transfer over the naive separate-then-transfer pipelines. As a result, this work confirms that the cross-stem modeling is essential for mixture-level timbre transfer as the proposed joint setting consistently exceeds an equivalent single-stem ablation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MixtureTT proposes a joint diffusion transformer for direct per-stem timbre transfer from polyphonic mixtures, avoiding separation steps and showing gains on SATB choral data, though the narrow domain leaves generalization unclear.

read the letter

The main point is that this paper claims the first direct system for transferring timbres to multiple stems straight from a mixture input, using a shared diffusion transformer that takes per-stem timbre references and models cross-stem dependencies. It positions this as better than the usual separate-then-transfer route because it skips separation artifacts, cuts inference cost by the number of stems, and produces more coherent outputs overall.

Referee Report

3 major / 2 minor

Summary. The paper proposes MixtureTT, the first system for flexible per-stem timbre transfer directly from a polyphonic mixture via a shared diffusion process and joint stem diffusion transformer. Given a mixture plus per-stem timbre references, it claims to eliminate cascaded separation errors, reduce inference cost by a factor equal to the number of stems, produce more coherent multi-stem outputs, and outperform single-instrument baselines on both objective and subjective metrics on the SATB choral dataset, thereby demonstrating the necessity of joint cross-stem modeling over separate-then-transfer pipelines.

Significance. If the empirical results hold under proper scrutiny, the work offers a meaningful advance in multi-instrument timbre transfer by replacing error-prone cascaded pipelines with a single joint diffusion process. The claimed inference-cost reduction and coherence gains address practical bottlenecks in music production tools. The approach is timely given the prevalence of diffusion models in audio, though its significance is tempered by the narrow SATB vocal domain and absence of broader instrumental validation.

major comments (3)

[Abstract] Abstract: the central claim that MixtureTT 'outperforms single-instrument baselines on both objective and subjective metrics' and 'demonstrat[es] the necessity of dedicated multi-instrument timbre transfer' is unsupported by any reported metric values, statistical tests, baseline implementation details, data splits, or ablation numbers, rendering the necessity argument unverifiable.
[Evaluation] Evaluation section: no stem-wise quantitative checks (e.g., pitch F1, onset accuracy, or harmonic coherence scores) are provided to confirm that joint modeling preserves per-stem content and avoids new cross-stem timbral or harmonic inconsistencies; SATB metrics alone are insufficient to rule out the skeptic's concern that joint diffusion may introduce artifacts not present in single-stem baselines.
[§4 and results] §4 (model description) and results: the assertion that the joint transformer 'eliminates cascaded separation error' and 'yields more coherent multi-stem outputs' rests on an equivalent single-stem ablation, but without details on how the ablation was implemented or how coherence was measured, it is impossible to assess whether the gains are due to joint modeling or other factors.

minor comments (2)

[Abstract] Abstract: the phrase 'Modeling the dependencies across the per-stem content and cross-stem harmonic' is grammatically incomplete and should be clarified.
[§3] The paper would benefit from an explicit statement of the diffusion schedule and transformer hyperparameters in the main text rather than only in supplementary material.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We agree that several claims in the abstract and evaluation sections require additional supporting details to be fully verifiable. Below we address each major comment point by point and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that MixtureTT 'outperforms single-instrument baselines on both objective and subjective metrics' and 'demonstrat[es] the necessity of dedicated multi-instrument timbre transfer' is unsupported by any reported metric values, statistical tests, baseline implementation details, data splits, or ablation numbers, rendering the necessity argument unverifiable.

Authors: We acknowledge that the abstract presents strong claims without embedding specific numerical results or explicit references to supporting sections. In the revised manuscript we will augment the abstract with key objective metric improvements (e.g., timbre similarity and content-preservation deltas) and note that statistical significance was assessed via paired t-tests on the SATB test set. We will also add a brief parenthetical reference to the evaluation section for baseline implementation details, data splits, and ablation numbers. These changes will make the necessity argument directly verifiable from the abstract while preserving its length constraints. revision: yes
Referee: [Evaluation] Evaluation section: no stem-wise quantitative checks (e.g., pitch F1, onset accuracy, or harmonic coherence scores) are provided to confirm that joint modeling preserves per-stem content and avoids new cross-stem timbral or harmonic inconsistencies; SATB metrics alone are insufficient to rule out the skeptic's concern that joint diffusion may introduce artifacts not present in single-stem baselines.

Authors: We agree that stem-wise diagnostics would strengthen the claim that joint diffusion preserves content. Although our current objective metrics (timbre transfer accuracy and overall perceptual quality) and subjective listening tests already indicate coherence across stems, we will add explicit per-stem analyses in the revised evaluation section. These will include pitch F1 scores, onset detection accuracy, and a cross-stem harmonic coherence measure computed on the generated SATB outputs. The new results will be compared directly against the single-stem baselines to demonstrate that joint modeling does not introduce additional artifacts. revision: yes
Referee: [§4 and results] §4 (model description) and results: the assertion that the joint transformer 'eliminates cascaded separation error' and 'yields more coherent multi-stem outputs' rests on an equivalent single-stem ablation, but without details on how the ablation was implemented or how coherence was measured, it is impossible to assess whether the gains are due to joint modeling or other factors.

Authors: We concur that the ablation description in §4 is insufficiently detailed. In the revision we will expand the model description to specify exactly how the single-stem ablation was implemented: each stem was processed independently by the same diffusion transformer architecture but with cross-stem attention disabled and no shared latent conditioning across voices. We will also detail the coherence metric used (a combination of inter-stem harmonic alignment scores and multi-stem consistency ratings from the listening study). These additions will allow readers to attribute the observed gains specifically to the joint modeling component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical validation

full rationale

The paper introduces MixtureTT as a joint diffusion transformer for per-stem timbre transfer from mixtures and supports its advantages through standard training on diffusion objectives plus objective/subjective evaluations against single-stem baselines and ablations on the SATB dataset. No equations, predictions, or uniqueness claims reduce by construction to fitted inputs or self-citations; the architecture and cross-stem modeling are presented as novel design choices whose benefits are measured externally rather than defined into existence. This is the typical non-circular pattern for an empirical model-proposal paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard assumptions from diffusion-based generative modeling and transformer architectures for audio; no new physical entities or ad-hoc constants are introduced beyond typical model hyperparameters.

free parameters (1)

diffusion schedule and transformer hyperparameters
Standard tunable parameters in diffusion and transformer models; not specified in abstract but required for training.

axioms (1)

domain assumption Diffusion models can capture complex joint distributions over multi-stem audio content and timbre references
Invoked implicitly when claiming the joint process models cross-stem harmonic dependencies.

pith-pipeline@v0.9.0 · 5525 in / 1415 out tokens · 58995 ms · 2026-05-12T04:53:55.809457+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Joint Stem Diffusion Transformer... three-stage attention design (Intra-Stem→Cross-Stem→Refinement)... decoupled FiLM conditioning on timestep, content, timbre
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

evaluations on the SATB choral dataset... MixtureTT outperforms single-instrument baselines

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

[1]

Introduction Timbre transfer is a form of music style transfer in which the perceptual identity of an instrument is recast onto the musical content of another, while pitch, rhythm, and articulation are pre- served. Modeling timbre is notoriously difficult: as the attribute that distinguishes instruments playing the same note at the same loudness and durat...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Single-instrument Timbre Transfer Single-instrument timbre transfer has evolved through three paradigms

Related Work 2.1. Single-instrument Timbre Transfer Single-instrument timbre transfer has evolved through three paradigms. Early audio-to-audio translation methods, including WaveNet-based autoregressive decoders with domain confusion losses [3] and CycleGAN pipelines on time-frequency represen- tations [4, 14], demonstrated cross-instrument conversion bu...

work page
[3]

Method MixtureTT decompose each musical signal into a time-varying content space (melody, rhythm, articulation) and a time- invariant timbre space (instrument identity), and recombine them at inference through a joint latent diffusion process overN stems (N=4for the SATB choral data used in our experiments). Fig. 2 shows the full pipeline. 3.1. Audio Code...

work page
[4]

Experiments 4.1. Dataset Experiments are conducted on thetinypartition of Coco- Chorales [34] (24k/8k/8k train/val/test at 16 kHz), which pro- vides four-part (SATB) chamber renditions across three main ensemble categories and several random combinations, each ac- companied by isolated stems and a pre-mixed mixture. Two evaluation settings are considered....

work page arXiv
[5]

Main Results Table 1 reports objective results on CocoChorales

Results 5.1. Main Results Table 1 reports objective results on CocoChorales. Across both reconstruction and transfer settings, MixtureTT outperforms the two single-instrument baselines on every metric, despite an asymmetric input condition: the baselines receive ground- truth isolated stems while MixtureTT operates only on the polyphonic mixture. This dir...

work page
[6]

Conclusion This paper presented MixtureTT, a joint latent diffusion system for per-stem timbre transfer that operates directly on polyphonic mixtures. To our knowledge, this is the first system capable of flexibly re-instrumenting individual voices within a mixture without explicit source separation, query audio, or instrument labels. By jointly modeling ...

work page
[7]

Timbre space as a musical control structure,

D. L. Wessel, “Timbre space as a musical control structure,”Com- puter music journal, pp. 45–52, 1979

work page 1979
[8]

Neural audio synthesis of musical notes with wavenet autoencoders,

J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” inInternational conference on machine learning. PMLR, 2017, pp. 1068–1077

work page 2017
[9]

A universal music translation network,

N. Mor, L. Wolf, A. Polyak, and Y . Taigman, “A universal music translation network,”arXiv preprint arXiv:1805.07848, 2018

work page arXiv 2018
[10]

Tim- bretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer,

S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse, “Tim- bretron: A wavenet (cyclegan (cqt (audio))) pipeline for musical timbre transfer,”arXiv preprint arXiv:1811.09620, 2018

work page arXiv 2018
[11]

Symbolic music genre transfer with cyclegan,

G. Brunner, Y . Wang, R. Wattenhofer, and S. Zhao, “Symbolic music genre transfer with cyclegan,” in2018 ieee 30th inter- national conference on tools with artificial intelligence (ictai). IEEE, 2018, pp. 786–793

work page 2018
[12]

Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,

T. Kaneko and H. Kameoka, “Cyclegan-vc: Non-parallel voice conversion using cycle-consistent adversarial networks,” in2018 26th European signal processing conference (EUSIPCO). IEEE, 2018, pp. 2100–2104

work page 2018
[13]

Self- supervised vq-vae for one-shot music style transfer,

O. C ´ıfka, A. Ozerov, U. S ¸ims ¸ekli, and G. Richard, “Self- supervised vq-vae for one-shot music style transfer,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 96–100

work page 2021
[14]

Transplayer: Timbre style transfer with flexible timbre control,

Y . Wu, Y . He, X. Liu, Y . Wang, and R. B. Dannenberg, “Transplayer: Timbre style transfer with flexible timbre control,” inICASSP 2023-2023 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[15]

Learning disentan- gled representations for timber and pitch in music audio,

Y .-N. Hung, Y .-A. Chen, and Y .-H. Yang, “Learning disentan- gled representations for timber and pitch in music audio,”arXiv preprint arXiv:1811.03271, 2018

work page arXiv 2018
[16]

Timbre transfer us- ing image-to-image denoising diffusion implicit models,

L. Comanducci, F. Antonacci, and A. Sarti, “Timbre transfer us- ing image-to-image denoising diffusion implicit models,”arXiv preprint arXiv:2307.04586, 2023

work page arXiv 2023
[17]

Combining audio control and style transfer using latent diffusion,

N. Demerl ´e, P. Esling, G. Doras, and D. Genova, “Combining audio control and style transfer using latent diffusion,”arXiv preprint arXiv:2408.00196, 2024

work page arXiv 2024
[18]

Latent diffusion bridges for unsupervised mu- sical audio timbre transfer,

M. Mancusi, Y . Halychanskyi, K. W. Cheuk, E. Moliner, C.-H. Lai, S. Uhlich, J. Koo, M. A. Mart ´ınez-Ram´ırez, W.-H. Liao, G. Fabbroet al., “Latent diffusion bridges for unsupervised mu- sical audio timbre transfer,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[19]

Diffusion-based voice conversion with fast maximum likelihood sampling scheme

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, M. Kudinov, and J. Wei, “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,”arXiv preprint arXiv:2109.13821, 2021

work page arXiv 2021
[20]

Music-star: a style translation sys- tem for audio-based re-instrumentation

M. Alinoori and V . Tzerpos, “Music-star: a style translation sys- tem for audio-based re-instrumentation.” inISMIR, 2022, pp. 419–426

work page 2022
[21]

Wavetransfer: A flexible end-to-end multi-instrument timbre transfer with dif- fusion,

T. Baoueb, X. Bie, H. Janati, and G. Richard, “Wavetransfer: A flexible end-to-end multi-instrument timbre transfer with dif- fusion,” in2024 IEEE 34th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2024, pp. 1–6

work page 2024
[22]

Dismix: Disentangling mixtures of musical instruments for source-level pitch and timbre manipulation,

Y .-J. Luo, K. W. Cheuk, W. Choi, T. Uesaka, K. Toyama, K. Saito, C.-H. Lai, Y . Takida, W.-H. Liao, S. Dixonet al., “Dismix: Disentangling mixtures of musical instruments for source-level pitch and timbre manipulation,”arXiv preprint arXiv:2408.10807, 2024

work page arXiv 2024
[23]

Multi-source diffusion models for simultaneous mu- sic generation and separation,

G. Mariani, I. Tallini, E. Postolache, M. Mancusi, L. Cosmo, and E. Rodol`a, “Multi-source diffusion models for simultaneous mu- sic generation and separation,”arXiv preprint arXiv:2302.02257, 2023

work page arXiv 2023
[24]

Denoising diffusion probabilis- tic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilis- tic models,”Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020

work page 2020
[25]

Score-Based Generative Modeling through Stochastic Differential Equations

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochas- tic differential equations,”arXiv preprint arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[26]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

work page 2022
[27]

Audioldm: Text-to-audio generation with latent diffusion models,

H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,”arXiv preprint arXiv:2301.12503, 2023

work page arXiv 2023
[28]

Stable audio open,

Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable audio open,” inICASSP 2025-2025 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[29]

Elucidating the design space of diffusion-based generative models,

T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,”Advances in neural information processing systems, vol. 35, pp. 26 565– 26 577, 2022

work page 2022
[30]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Uniaudio: An audio foundation model toward universal audio generation

D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wuet al., “Uniaudio: An audio founda- tion model toward universal audio generation,”arXiv preprint arXiv:2310.00704, 2023

work page arXiv 2023
[32]

Discl-vc: Disentangled discrete tokens and in-context learning for controllable zero-shot voice conversion,

K. Wang, W. Guan, Z. Jiang, H. Huang, P. Chen, W. Wu, Q. Hong, and L. Li, “Discl-vc: Disentangled discrete tokens and in-context learning for controllable zero-shot voice conversion,” inProc. In- terspeech 2025, 2025, pp. 1383–1387

work page 2025
[33]

Prosody-Adaptable Audio Codecs for Zero-Shot V oice Conversion via In-Context Learning,

J. Zhao, X. Wang, and Y . Wang, “Prosody-Adaptable Audio Codecs for Zero-Shot V oice Conversion via In-Context Learning,” inInterspeech 2025, 2025, pp. 4893–4897

work page 2025
[34]

Comelsinger: Discrete token-based zero-shot singing synthesis with structured melody control and guidance,

J. Zhao, W. Zeng, T. Lyu, and Y . Wang, “Comelsinger: Discrete token-based zero-shot singing synthesis with structured melody control and guidance,”IEEE Transactions on Audio, Speech and Language Processing, 2026

work page 2026
[35]

Rave: A variational autoencoder for fast and high-quality neural audio synthesis,

A. Caillon and P. Esling, “Rave: A variational autoencoder for fast and high-quality neural audio synthesis,”arXiv preprint arXiv:2111.05011, 2021

work page arXiv 2021
[36]

High-fidelity audio compression with improved rvqgan,

R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,”Ad- vances in Neural Information Processing Systems, vol. 36, pp. 27 980–27 993, 2023

work page 2023
[37]

Hybrid transformers for music source separation,

S. Rouard, F. Massa, and A. D ´efossez, “Hybrid transformers for music source separation,” inICASSP 2023-2023 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[38]

Film: Visual reasoning with a general conditioning layer,

E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

work page 2018
[39]

Scalable diffusion models with transform- ers,

W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

work page 2023
[40]

arXiv preprint arXiv:2209.14458 , year =

Y . Wu, J. Gardner, E. Manilow, I. Simon, C. Hawthorne, and J. Engel, “The chamber ensemble generator: Limitless high-quality mir data via generative modeling,”arXiv preprint arXiv:2209.14458, 2022

work page arXiv 2022
[41]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regulariza- tion,”arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Fr\’echet audio distance: A metric for evaluating music enhancement algo- rithms,

K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr\’echet audio distance: A metric for evaluating music enhancement algo- rithms,”arXiv preprint arXiv:1812.08466, 2018

work page arXiv 2018
[43]

pyin: A fundamental frequency estima- tor using probabilistic threshold distributions,

M. Mauch and S. Dixon, “pyin: A fundamental frequency estima- tor using probabilistic threshold distributions,” in2014 ieee inter- national conference on acoustics, speech and signal processing (icassp). IEEE, 2014, pp. 659–663

work page 2014
[44]

An overview on per- ceptually motivated audio indexing and classification,

G. Richard, S. Sundaram, and S. Narayanan, “An overview on per- ceptually motivated audio indexing and classification,”Proceed- ings of the IEEE, vol. 101, no. 9, pp. 1939–1954, 2013

work page 1939