arxiv: 2604.01929 · v3 · submitted 2026-04-02 · 💻 cs.SD · cs.AI· cs.LG

Recognition: no theorem link

Woosh: A Sound Effects Foundation Model

Ga\"etan Hadjeres , Marc Ferras , Khaled Koutini , Benno Weck , Alexandre Bittar , Thomas Hummel , Zineb Lahrichi , Hakim Missoum

show 2 more authors

Joan Serr\`a Yuki Mitsufuji

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:48 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LG

keywords sound effectsfoundation modeltext-to-audiovideo-to-audioaudio codecgenerative audioopen modelaudio generation

0 comments

The pith

Woosh releases an open sound effects foundation model whose components match or exceed existing open alternatives on public and private evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Woosh as a publicly released sound effects foundation model. It supplies a high-quality audio encoder-decoder, a text-audio alignment model for conditioning, plus text-to-audio and video-to-audio generators together with distilled fast-inference versions. The work supplies inference code and weights so the audio community can treat the full stack as a shared baseline rather than rebuilding codecs and aligners from scratch each time. Evaluations on both public and private data position each module as competitive with or better than StableAudio-Open and TangoFlux. The specialization to sound effects rather than music or speech is presented as the practical advantage for downstream tasks in media and games.

Core claim

Woosh is a sound effects foundation model consisting of an audio encoder/decoder, a text-audio alignment model, text-to-audio and video-to-audio generative models, and their distilled low-resource counterparts; the complete set is released publicly with weights and inference code, and each component is shown to deliver competitive or superior performance to existing open models on both public and private test sets.

What carries the argument

The Woosh suite of models, which pairs a high-quality audio codec with a text-audio alignment module to condition specialized generative networks for sound effects from text or video prompts.

Load-bearing premise

The private-data comparisons assume that evaluation conditions, training data, and compute resources were equivalent to those used for the open baseline models without undisclosed advantages.

What would settle it

A fully public benchmark where all compared models are retrained or evaluated under identical standardized conditions would falsify the performance claim if Woosh falls below the open baselines on quantitative metrics or listening tests.

Figures

Figures reproduced from arXiv: 2604.01929 by Alexandre Bittar, Benno Weck, Ga\"etan Hadjeres, Hakim Missoum, Joan Serr\`a, Khaled Koutini, Marc Ferras, Thomas Hummel, Yuki Mitsufuji, Zineb Lahrichi.

**Figure 2.** Figure 2: VOCOS decoder architecture as a cascade on ConvNeXt blocks, used in Woosh-AE. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Woosh-CLAP training block diagram for a positive pair of samples. Only the text encoder is used at [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Multimodal transformer stack in the Woosh-Flow diffusion model, formed by MultiStream (MS) and Sin [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: MultiStream transformer block diagram (left). Both self-attention and feed-forward network (FFN) outputs [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Woosh is a practical open release of sound-effect models with public weights and code, but its performance edge on private data cannot be independently checked.

read the letter

The main point is that Sony AI has shipped a full stack of models tuned for sound effects: an audio codec, text-audio alignment module, text-to-audio and video-to-audio generators, plus distilled fast versions. They released the weights and inference code on GitHub, which is the concrete contribution here. The abstract shows they ran comparisons against StableAudio-Open and TangoFlux and report competitive or better results on public data, which lines up with the release goal of giving the community a usable baseline for entertainment and multimedia work. The paper does a reasonable job laying out the architecture choices and training outline at a level that lets others start from it. Having the entire pipeline in one place, including the alignment and distillation pieces, saves downstream users time. The public artifacts are real evidence that the models exist and can be run. The soft spot is exactly the private-data evaluations. The claim of better performance there rests on internal test sets and conditions that are not released, so there is no way to confirm the baselines were evaluated identically in prompt distribution, sampling, or compute. That part of the evidence is not reproducible from the outside. The work stays within existing latent diffusion and alignment techniques rather than introducing new core methods, so the novelty is in the specialization and the open package rather than algorithmic invention. This is for audio researchers and engineers who need a ready sound-effects starting point rather than a general audio model. It is worth sending to peer review because the public results and artifacts are verifiable and the release itself has clear utility, even if the private comparisons need more transparency to be fully convincing.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Woosh, Sony AI's publicly released sound effects foundation model. It comprises (1) a high-quality audio encoder/decoder, (2) a text-audio alignment model, (3) text-to-audio and (4) video-to-audio generative models, plus distilled low-resource variants. The authors detail the architecture and training process and report that evaluations on both public and private data show competitive or better performance for each module relative to open baselines such as StableAudio-Open and TangoFlux. Model weights and inference code are released at the provided GitHub link.

Significance. If the performance claims hold, the work supplies the audio community with a specialized open foundation model for sound-effect generation that includes video conditioning, a capability not uniformly present in the cited baselines. The public release of weights, inference code, and demo samples directly supports reproducibility and downstream use as a baseline or building block.

major comments (1)

[Evaluation] Evaluation section: the headline claim of 'competitive or better performance for each module' on private internal data is load-bearing for the overall contribution, yet the manuscript provides no details on test-set composition, prompt distributions, sampling hyperparameters, or confirmation that the baseline models (StableAudio-Open, TangoFlux) were evaluated under identical conditioning, compute budgets, and post-processing. This prevents independent verification of equivalence and directly affects the strength of the empirical comparison.

minor comments (1)

[Abstract] The abstract states that architecture and training details are provided, but explicit forward references to the relevant sections or tables (e.g., model hyperparameters, training curves) would improve readability for readers who wish to reproduce or extend the work.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their positive summary of the work, recognition of its significance for the audio community, and recommendation for minor revision. We appreciate the constructive feedback on the evaluation section and address it point-by-point below. We will incorporate the suggested clarifications in the revised manuscript.

read point-by-point responses

Referee: Evaluation section: the headline claim of 'competitive or better performance for each module' on private internal data is load-bearing for the overall contribution, yet the manuscript provides no details on test-set composition, prompt distributions, sampling hyperparameters, or confirmation that the baseline models (StableAudio-Open, TangoFlux) were evaluated under identical conditioning, compute budgets, and post-processing. This prevents independent verification of equivalence and directly affects the strength of the empirical comparison.

Authors: We agree that additional protocol details would strengthen the manuscript and improve reproducibility. In the revised version we will expand the Evaluation section with: (i) explicit descriptions of the public test sets, including sample counts, category distributions, and prompt characteristics; (ii) the exact sampling hyperparameters (guidance scale, diffusion steps, etc.) used for all models; and (iii) a clear statement that baselines were evaluated on identical conditioning inputs, under matched compute budgets, and with the same post-processing pipeline. For the private internal data we will add high-level statistics (number of clips, broad sound-effect categories, and rationale for internal evaluation) while noting that exact prompts cannot be released for confidentiality reasons. These changes will allow readers to assess the fairness of the comparisons without compromising proprietary data. revision: yes

standing simulated objections not resolved

Exact prompts and full composition of the private internal test set cannot be disclosed due to Sony AI data confidentiality policies.

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external baseline comparisons

full rationale

The paper presents an empirical release of the Woosh sound-effects models (encoder/decoder, text-audio aligner, text-to-audio and video-to-audio generators) together with training details and benchmark numbers. All load-bearing claims are performance comparisons against independent external models (StableAudio-Open, TangoFlux) on public data plus internal private data. No equations, first-principles derivations, or fitted parameters are presented whose outputs are then re-labeled as predictions; the reported metrics are direct empirical measurements, not quantities defined by the authors' own fitting procedure. Self-citations, if any, are not load-bearing for the central performance statements. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is an empirical model-release paper whose claims rest on standard deep-learning training assumptions and the availability of large audio datasets; no new mathematical axioms or invented physical entities are introduced.

free parameters (1)

model architecture hyperparameters
Standard neural network sizes, learning rates, and conditioning strengths typical of foundation models; not enumerated in abstract.

axioms (1)

standard math Standard assumptions of deep generative modeling (e.g., latent variable models can capture audio distributions)
Invoked implicitly by the use of encoder/decoder and alignment models.

pith-pipeline@v0.9.0 · 5508 in / 1285 out tokens · 40582 ms · 2026-05-13T20:48:00.461026+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 11 internal anchors

[1]

Andrea Agostinelli et al.MusicLM: Generating Music From Text. 2023. arXiv: 2301.11325[cs.SD].URL: https://arxiv.org/abs/2301.11325

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Zach Evans et al.Fast Timing-Conditioned Latent Audio Diffusion. 2024. arXiv: 2402.04825[cs.SD].URL: https://arxiv.org/abs/2402.04825

work page arXiv 2024
[3]

Zach Evans et al.Long-form music generation with latent diffusion. 2024. arXiv: 2404.10301[cs.SD].URL: https://arxiv.org/abs/2404.10301

work page arXiv 2024
[4]

Chia-Yu Hung et al.TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization. 2025. arXiv: 2412.21037[cs.SD].URL: https://arxiv.org/abs/2412. 21037

work page arXiv 2025
[5]

Felix Kreuk et al.AudioGen: Textually Guided Audio Generation. 2023. arXiv: 2209.15352[cs.SD].URL: https://arxiv.org/abs/2209.15352

work page arXiv 2023
[6]

Haohe Liu et al.AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. 2023. arXiv: 2301.12503 [cs.SD].URL: https://arxiv.org/abs/2301.12503. 17 Woosh: A Sound Effects Foundation Model

work page arXiv 2023
[7]

Haohe Liu et al.AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. 2024. arXiv: 2308.05734[cs.SD].URL: https://arxiv.org/abs/2308.05734

work page arXiv 2024
[8]

Zach Evans et al.Stable Audio Open. 2024. arXiv: 2407.14358[cs.SD].URL: https://arxiv.org/abs/2407. 14358

work page arXiv 2024
[9]

Jade Copet et al.Simple and Controllable Music Generation. 2024. arXiv: 2306.05284[cs.SD].URL: https: //arxiv.org/abs/2306.05284

work page arXiv 2024
[10]

Black Forest Labs et al.FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space. 2025. arXiv: 2506.15742[cs.GR].URL: https://arxiv.org/abs/2506.15742

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Hubert Siuzdak.Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high- quality audio synthesis. 2024. arXiv: 2306.00814[cs.SD].URL: https://arxiv.org/abs/2306.00814

work page arXiv 2024
[12]

Alexandre D ´efossez et al.High Fidelity Neural Audio Compression. 2022. arXiv: 2210 . 13438[eess.AS]. URL: https://arxiv.org/abs/2210.13438

work page internal anchor Pith review arXiv 2022
[13]

Rithesh Kumar et al.High-Fidelity Audio Compression with Improved RVQGAN. 2023. arXiv: 2306 . 06546 [cs.SD].URL: https://arxiv.org/abs/2306.06546

work page arXiv 2023
[14]

Upsampling Artifacts in Neural Audio Synthesis

Jordi Pons et al. “Upsampling Artifacts in Neural Audio Synthesis”. In:Proc. of the IEEE Int. Conf. on Acous- tics, Speech and Signal Processing (ICASSP). 2021, pp. 3005–3009.URL: https://arxiv.org/abs/2010.14356

work page arXiv 2021
[15]

Freesound Technical Demo

Frederic Font, Gerard Roma, and Xavier Serra. “Freesound Technical Demo”. In:MM ’13: Proceedings of the 21st ACM international conference on Multimedia. 2013, pp. 411–412

work page 2013
[16]

Audio Set: An ontology and human-labeled dataset for audio events

Jort F. Gemmeke et al. “Audio Set: An ontology and human-labeled dataset for audio events”. In:Proc. IEEE ICASSP 2017. New Orleans, LA, 2017

work page 2017
[17]

AudioCaps: Generating Captions for Audios in The Wild

Chris Dongjoo Kim et al. “AudioCaps: Generating Captions for Audios in The Wild”. In:NAACL-HLT. 2019

work page 2019
[18]

WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio- Language Multimodal Research

Xinhao Mei et al. “WavCaps: A ChatGPT-Assisted Weakly-Labelled Audio Captioning Dataset for Audio- Language Multimodal Research”. In:IEEE/ACM Transactions on Audio, Speech, and Language Processing32 (2024), pp. 3339–3354.ISSN: 2329-9304.DOI: 10.1109/taslp.2024.3419446.URL: http://dx.doi.org/10.1109/ TASLP.2024.3419446

work page doi:10.1109/taslp.2024.3419446.url: 2024
[19]

2019.URL: https://datashare.ed.ac.uk/handle/10283/ 2950

Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald.CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). 2019.URL: https://datashare.ed.ac.uk/handle/10283/ 2950

work page 2019
[20]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae.HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. 2020. arXiv: 2010.05646[cs.SD].URL: https://arxiv.org/abs/2010.05646

work page arXiv 2020
[21]

Neil Zeghidour et al.SoundStream: An End-to-End Neural Audio Codec. 2021. arXiv: 2107.03312[cs.SD]. URL: https://arxiv.org/abs/2107.03312

work page arXiv 2021
[22]

Xudong Mao et al.Least Squares Generative Adversarial Networks. 2017. arXiv: 1611.04076[cs.CV].URL: https://arxiv.org/abs/1611.04076

work page arXiv 2017
[23]

Benjamin Elizalde et al.CLAP: Learning Audio Concepts From Natural Language Supervision. 2022. arXiv: 2206.04769[cs.SD].URL: https://arxiv.org/abs/2206.04769

work page arXiv 2022
[24]

Yinhan Liu et al.RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. arXiv: 1907 . 11692 [cs.CL].URL: https://arxiv.org/abs/1907.11692

work page internal anchor Pith review Pith/arXiv arXiv 2019
[25]

Efficient Training of Audio Transformers with Patchout

Khaled Koutini et al. “Efficient Training of Audio Transformers with Patchout”. In:23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22,

work page 2022
[26]

V oxCeleb: A Large-Scale Speaker Identification Dataset

Ed. by Hanseok Ko and John H. L. Hansen. ISCA, 2022, pp. 2753–2757.DOI: 10.21437/INTERSPEECH. 2022-227.URL: https://doi.org/10.21437/Interspeech.2022-227

work page doi:10.21437/interspeech 2022
[27]

Alexey Dosovitskiy et al.An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. arXiv: 2010.11929[cs.CV].URL: https://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

Alec Radford et al.Learning Transferable Visual Models From Natural Language Supervision. 2021. arXiv: 2103.00020[cs.CV].URL: https://arxiv.org/abs/2103.00020. 18 Woosh: A Sound Effects Foundation Model

work page internal anchor Pith review Pith/arXiv arXiv 2021
[29]

Yusong Wu et al.Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to- Caption Augmentation. 2024. arXiv: 2211.06687[cs.SD].URL: https://arxiv.org/abs/2211.06687

work page arXiv 2024
[30]

Ke Chen et al.HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and De- tection. 2022. arXiv: 2202.00874[cs.SD].URL: https://arxiv.org/abs/2202.00874

work page arXiv 2022
[31]

Yaron Lipman et al.Flow Matching for Generative Modeling. 2023. arXiv: 2210.02747[cs.LG].URL: https: //arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Jianlin Su et al.RoFormer: Enhanced Transformer with Rotary Position Embedding. 2023. arXiv: 2104.09864 [cs.CL].URL: https://arxiv.org/abs/2104.09864

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Flow Matching Guide and Code

Yaron Lipman et al. “Flow Matching Guide and Code”. In:arXiv preprint(2024). arXiv: 2412.06264[cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Mean flows for one-step generative modeling

Zhengyang Geng et al. “Mean flows for one-step generative modeling”. In:arXiv preprint(2025). arXiv: 2505. 13447[cs.LG]

work page 2025
[35]

Jonathan Ho and Tim Salimans.Classifier-Free Diffusion Guidance. 2022. arXiv: 2207.12598[cs.LG].URL: https://arxiv.org/abs/2207.12598

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Axel Sauer et al. “Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation”. In: SIGGRAPH Asia 2024 Conference Papers. 2024, pp. 1–11.DOI: 10.1145/3680528.3687625

work page doi:10.1145/3680528.3687625 2024
[37]

Jae Hyun Lim and Jong Chul Ye.Geometric GAN. 2017. arXiv: 1705.02894[stat.ML].URL: https://arxiv. org/abs/1705.02894

work page arXiv 2017
[38]

Sana-sprint: One-step diffusion with continuous-time consistency distillation

Junsong Chen et al. “Sana-sprint: One-step diffusion with continuous-time consistency distillation”. In:arXiv preprint(2025). arXiv: arXiv:2503.09641[cs.GR]

work page arXiv 2025
[39]

Full-Band General Audio Synthesis with Score-Based Diffusion

Santiago Pascual et al. “Full-Band General Audio Synthesis with Score-Based Diffusion”. In:IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP). 2023.DOI: 10.1109/ICASSP49357.2023.10096760

work page doi:10.1109/icassp49357.2023.10096760 2023
[40]

Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword- to-Caption Augmentation

Yusong Wu et al. “Large-Scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword- to-Caption Augmentation”. In:IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. IEEE, 2023, pp. 1–5.DOI: 10.1109/ICASSP49357. 2023.10095969.URL: https://doi.org/10.1109/ICASSP49357...

work page doi:10.1109/icassp49357 2023
[41]

Vladimir Iashin et al.Synchformer: Efficient Synchronization from Sparse Cues. 2024. arXiv: 2401 . 16423 [cs.CV].URL: https://arxiv.org/abs/2401.16423

work page arXiv 2024
[42]

Qwen3-Omni Technical Report

Jin Xu et al. “Qwen3-Omni Technical Report”. In:CoRRabs/2509.17765 (2025).DOI: 10.48550/ARXIV .2509. 17765. arXiv: 2509.17765.URL: https://doi.org/10.48550/arXiv.2509.17765

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[43]

Vggsound: A Large-Scale Audio-Visual Dataset

Honglie Chen et al. “Vggsound: A Large-Scale Audio-Visual Dataset”. In:2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020, pp. 721–725.DOI: 10.1109/ICASSP40776.2020.9053174.URL: https://doi.org/10.1109/ICASSP40776.2020. 9053174

work page doi:10.1109/icassp40776.2020.9053174.url: 2020
[44]

GameGen-X: Interactive Open-world Game Video Generation

Haoxuan Che et al. “GameGen-X: Interactive Open-world Game Video Generation”. In:The Thirteenth Inter- national Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025.URL: https://openreview.net/forum?id=8VG8tpPZhe

work page 2025
[45]

SoundReactor: Frame-level Online Video-to-Audio Generation

Koichi Saito et al. “SoundReactor: Frame-level Online Video-to-Audio Generation”. In:ArXivabs/2510.02110 (2025).URL: https://api.semanticscholar.org/CorpusID:281725129

work page arXiv 2025
[46]

FoleyBench: A Benchmark For Video-to-Audio Models

Satvik Dixit et al. “FoleyBench: A Benchmark For Video-to-Audio Models”. In:ArXivabs/2511.13219 (2025). URL: https://api.semanticscholar.org/CorpusID:283072409

work page arXiv 2025
[47]

Assran, Q

Rohit Girdhar et al. “ImageBind One Embedding Space to Bind Them All”. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023. IEEE, 2023, pp. 15180–15190.DOI: 10.1109/CVPR52729.2023.01457.URL: https://doi.org/10.1109/CVPR52729. 2023.01457. 19 Woosh: A Sound Effects Foundation Model

work page doi:10.1109/cvpr52729.2023.01457.url: 2023
[48]

Lifting the veil on visual information flow in mllms: Unlocking pathways to faster inference

Ho Kei Cheng et al. “MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthe- sis”. In:IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025. Computer Vision Foundation / IEEE, 2025, pp. 28901–28911.DOI: 10.1109/CVPR52734. 2025.02691

work page doi:10.1109/cvpr52734 2025
[49]

Nataniel Ruiz et al.DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

work page
[50]

arXiv: 2208.12242[cs.CV].URL: https://arxiv.org/abs/2208.12242

work page arXiv
[51]

Daniel Garibi et al.TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space. 2025. arXiv: 2501.12224[cs.CV].URL: https://arxiv.org/abs/2501.12224. Appendix A Distillation Pseudocode Algorithm 1 provides detailed pseudocode for the Woosh-DFLow and Woosh-DVFlow training processes, using the MeanFlow criterion together with latent adver...

work page arXiv 2025