MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation
Pith reviewed 2026-06-27 14:50 UTC · model grok-4.3
The pith
MeCo maps any discriminative multi-channel speech separation estimate onto the clean speech manifold in one MeanFlow step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. Data-Space Optimization integrates an x_r-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity.
What carries the argument
The MeanFlow conditional average velocity field, which performs the direct one-step mapping from discriminative estimate to clean speech.
If this is right
- State-of-the-art signal fidelity is achieved with only minimal added computation.
- Human listening quality improves simultaneously with reference metrics.
- The gains hold for both in-domain and out-of-domain test conditions.
- One-step generation replaces multi-step sampling while retaining generative benefits.
Where Pith is reading between the lines
- The single-step design could support lower-latency real-time speech separation systems.
- Data-Space Optimization may transfer to other audio tasks where perceptual quality must be balanced against reference metrics.
- MeanFlow velocity fields might serve as lightweight correctors for other discriminative audio models beyond separation.
Load-bearing premise
A single step of the learned conditional average velocity field is sufficient to map any discriminative estimate directly onto the clean speech manifold.
What would settle it
A controlled listening test in which MeCo outputs receive no higher perceptual ratings than the uncorrected outputs of the underlying discriminative separator.
read the original abstract
While discriminative models for multi-channel speech separation excel in reference-based metrics, they often exhibit suboptimal human listening quality. To address this, we propose a novel MeanFlow-based one-step generative corrector (MeCo). MeCo learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. To maximize one-step generation performance, we introduce Data-Space Optimization (DSO). DSO integrates an $\mathbf{x}_r$-loss, which penalizes prediction errors on longer displacement intervals to serve as a generative objective for human listening quality, with an Endpoint SI-SDR loss that directly optimizes terminal signal fidelity. Experiments demonstrate that MeCo achieves state-of-the-art (SOTA) performance with minimal computational overhead, simultaneously achieving superior signal fidelity and human listening quality in both in-domain and out-of-domain scenarios.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes MeCo, a MeanFlow-based one-step generative corrector for multi-channel speech separation. It learns a conditional average velocity field to map discriminative estimates directly onto the clean speech manifold in a single step. Data-Space Optimization (DSO) is introduced, combining an x_r-loss (penalizing errors on longer displacement intervals) with an Endpoint SI-SDR loss to optimize for human listening quality alongside signal fidelity. Experiments claim SOTA performance with minimal overhead, superior fidelity and listening quality in both in-domain and out-of-domain scenarios.
Significance. If the one-step correction holds, MeCo would offer an efficient post-processing layer that improves perceptual quality of existing discriminative separators without substantial compute, addressing a known gap between reference metrics and human listening in multi-channel separation.
major comments (2)
- [Abstract] Abstract: the central claim that a single Euler integration of the learned conditional average velocity field suffices to reach the clean-speech manifold from any discriminative estimate (including distant out-of-domain cases) lacks supporting derivation or guarantee; the construction of DSO, x_r-loss, and Endpoint SI-SDR does not by itself ensure the learned field remains accurate far from the data manifold or that one step avoids audible artifacts.
- [Abstract] Abstract: the assertion of simultaneous SOTA signal fidelity and human listening quality in out-of-domain scenarios rests on the unverified premise that the one-step trajectory lands inside the manifold; no independent check (e.g., manifold-distance metric or artifact analysis) is described to confirm this for estimates lying far from training data.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that a single Euler integration of the learned conditional average velocity field suffices to reach the clean-speech manifold from any discriminative estimate (including distant out-of-domain cases) lacks supporting derivation or guarantee; the construction of DSO, x_r-loss, and Endpoint SI-SDR does not by itself ensure the learned field remains accurate far from the data manifold or that one step avoids audible artifacts.
Authors: We agree that no formal derivation or theoretical guarantee is provided for one-step convergence to the manifold, particularly for distant out-of-domain estimates. DSO is an empirical training strategy. In revision we will soften the abstract language to emphasize the empirical nature of the claim and add a short discussion subsection on the one-step assumption, supported by additional out-of-domain artifact analysis. revision: yes
-
Referee: [Abstract] Abstract: the assertion of simultaneous SOTA signal fidelity and human listening quality in out-of-domain scenarios rests on the unverified premise that the one-step trajectory lands inside the manifold; no independent check (e.g., manifold-distance metric or artifact analysis) is described to confirm this for estimates lying far from training data.
Authors: The manuscript currently relies on SI-SDR and listening-quality metrics as proxies. We acknowledge the lack of an explicit manifold-distance metric or dedicated artifact analysis. We will add a new analysis subsection containing qualitative artifact examples and a simple embedding-based distance check for out-of-domain cases. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The provided abstract and context describe MeCo as learning a conditional average velocity field from data to perform a one-step mapping, optimized via the introduced DSO combining x_r-loss and Endpoint SI-SDR loss. No equations, self-citations, or load-bearing steps are shown that reduce a claimed prediction or result to its own inputs by construction. The method is presented as data-driven empirical learning rather than self-definitional or fitted-input renaming, making the derivation independent of the target claims.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
MeCo: One-Step MeanFlow-based Corrector for Multi-Channel Speech Separation
Introduction Deep discriminative models have significantly advanced multi- channel speech enhancement and separation. Modern architec- tures [1–4], readily adaptable across joint denoising, derever- beration, and speech separation, have achieved saturated per- formance on reference-based metrics. However, these models are primarily trained to optimize obj...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Flow Matching Flow Matching (FM) [11] is a generative framework that learns to construct a flow path between a simple prior distributionp 0 and a complex data distributionp1
Background 2.1. Flow Matching Flow Matching (FM) [11] is a generative framework that learns to construct a flow path between a simple prior distributionp 0 and a complex data distributionp1. Formally, given a prior sam- plex 0 ∼p 0 and a data samplex 1 ∼p 1, a statex t along the flow path at timet∈[0,1]can be explicitly constructed using predefined schedu...
-
[3]
MeCo incorporates a conditional MeanFlow-based architecture (Section 3.1) and DSO to maxi- mize one-step generation performance (Section 3.2)
Method We introduce MeCo, a one-step generative corrector for multi- channel speech separation. MeCo incorporates a conditional MeanFlow-based architecture (Section 3.1) and DSO to maxi- mize one-step generation performance (Section 3.2). 3.1. Conditional MeanFlow-based correction The proposed corrector operates in the complex Short-Time Fourier Transform...
-
[4]
Datasets To evaluate the proposed MeCo, we constructed multi-channel noisy and reverberant datasets
Experiments 4.1. Datasets To evaluate the proposed MeCo, we constructed multi-channel noisy and reverberant datasets. For the in-domain training and test sets, we used clean speech from the WSJ0 corpus mixed with noise from WHAM! [30]. To assess the model’s general- ization capabilities, we constructed two separate out-of-domain evaluation sets. The first...
-
[5]
By leveraging Mean Flows, MeCo effectively maps discriminative estimates directly onto the clean speech manifold in a single step
Conclusion We proposed MeCo, the first one-step generative corrector for multi-channel speech separation. By leveraging Mean Flows, MeCo effectively maps discriminative estimates directly onto the clean speech manifold in a single step. To maximize one- step generation performance, we introduced DSO, which incor- porates anx r-loss and an Endpoint SI-SDR ...
-
[6]
RS-2024-00337945), STEAM re- search grant (No
Acknowledgements This work was supported by the National Research Foundation of Korea (NRF) grant (No. RS-2024-00337945), STEAM re- search grant (No. RS-2024-00464269) funded by the Ministry of Science and ICT of Korea government (MSIT), and the BK21 FOUR program through the NRF grant funded by the Ministry of Education of Korea government (MOE)
2024
-
[7]
Generative AI Use Disclosure Generative AI tools were used to edit and polish the manuscript, improving readability and refining the experimental code
-
[8]
TF-GridNet: Integrating full-and sub-band modeling for speech separation,
Z.-Q. Wang, S. Cornell, S. Choi, Y . Lee, B.-Y . Kim, and S. Watan- abe, “TF-GridNet: Integrating full-and sub-band modeling for speech separation,”TASLP, vol. 31, pp. 3221–3236, 2023
2023
-
[9]
SpatialNet: Extensively learning spatial in- formation for multichannel joint speech separation, denoising and dereverberation,
C. Quan and X. Li, “SpatialNet: Extensively learning spatial in- formation for multichannel joint speech separation, denoising and dereverberation,”TASLP, vol. 32, pp. 1310–1323, 2024
2024
-
[10]
TF-CrossNet: Leveraging global, cross-band, narrow-band, and positional encoding for single-and multi-channel speaker separation,
V . A. Kalkhorani and D. Wang, “TF-CrossNet: Leveraging global, cross-band, narrow-band, and positional encoding for single-and multi-channel speaker separation,”TASLP, vol. 32, pp. 4999– 5009, 2024
2024
-
[11]
DeFTAN-II: Efficient multichannel speech enhancement with subgroup processing,
D. Lee and J.-W. Choi, “DeFTAN-II: Efficient multichannel speech enhancement with subgroup processing,”TASLP, vol. 32, p. 4850–4866, 2024
2024
-
[12]
SDR– half-baked or well done?
J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR– half-baked or well done?” inProc. ICASSP, 2019
2019
-
[13]
Universal speech enhancement with score-based diffusion,
J. Serr `a, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,” inProc. ICLR, 2023
2023
-
[14]
Speech enhancement and dereverberation with diffusion-based generative models,
J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,”TASLP, vol. 31, p. 2351–2364, 2023
2023
-
[15]
DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,
C. K. Reddy, V . Gopal, and R. Cutler, “DNSMOS P. 835: A non- intrusive perceptual objective speech quality metric to evaluate noise suppressors,” inProc. ICASSP, 2022
2022
-
[16]
Utmos: Utokyo-sarulab system for voicemos challenge 2022,
T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,” inInterspeech, 2022
2022
-
[17]
Score-based generative modeling through stochas- tic differential equations,
Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochas- tic differential equations,” inProc. ICLR, 2021
2021
-
[18]
Flow matching for generative modeling,
Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,” inProc. ICLR, 2023
2023
-
[19]
Conditional diffusion probabilistic model for speech en- hancement,
Y .-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y . Tsao, “Conditional diffusion probabilistic model for speech en- hancement,” inProc. ICASSP, 2022
2022
-
[20]
StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,
J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,”TASLP, vol. 31, pp. 2724–2737, 2023
2023
-
[21]
Diffusion-based generative speech source separation,
R. Scheibler, Y . Ji, S.-W. Chung, J. Byun, S. Choe, and M.-S. Choi, “Diffusion-based generative speech source separation,” in Proc. ICASSP, 2023
2023
-
[22]
Generative pre-training for speech with flow matching,
A. H. Liu, M. Le, A. Vyas, B. Shi, A. Tjandra, and W.-N. Hsu, “Generative pre-training for speech with flow matching,” inProc. ICLR, 2024
2024
-
[23]
EDSep: An effective diffusion- based method for speech source separation,
J. Dong, X. Wang, and Q. Mao, “EDSep: An effective diffusion- based method for speech source separation,” inProc. ICASSP, 2025
2025
-
[24]
Source sepa- ration by flow matching,
R. Scheibler, J. R. Hershey, A. Doucet, and H. Li, “Source sepa- ration by flow matching,” inProc. WASPAA, 2025
2025
-
[25]
DiffCBF: A diffusion model with convolutional beamformer for joint speech separation, denoising, and derever- beration,
R. Kimura, T. Ueda, T. Nakatani, N. Kamo, M. Delcroix, S. Araki, and S. Makino, “DiffCBF: A diffusion model with convolutional beamformer for joint speech separation, denoising, and derever- beration,” inProc. EUSIPCO, 2025
2025
-
[26]
Ar- raydps: Unsupervised blind speech separation with a diffusion prior,
Z. Xu, X. Fan, Z.-Q. Wang, X. Jiang, and R. R. Choudhury, “Ar- raydps: Unsupervised blind speech separation with a diffusion prior,” inProc. ICML, 2025
2025
-
[27]
Diffiner: A versatile diffusion-based generative refiner for speech enhancement,
R. Sawata, N. Murata, Y . Takida, T. Uesaka, T. Shibuya, S. Taka- hashi, and Y . Mitsufuji, “Diffiner: A versatile diffusion-based generative refiner for speech enhancement,” inProc. Interspeech, 2023
2023
-
[28]
Separate and diffuse: Using a pretrained diffusion model for improving source separation,
S. Lutati, E. Nachmani, and L. Wolf, “Separate and diffuse: Using a pretrained diffusion model for improving source separation,” in Proc. ICLR, 2024
2024
-
[29]
Noise-robust speech separation with fast generative correction,
H. Wang, J. Villalba, L. Moro-Velazquez, J. Hai, T. Thebaud, and N. Dehak, “Noise-robust speech separation with fast generative correction,” inProc. Interspeech, 2024
2024
-
[30]
SpeechRe- finer: Towards perceptual quality refinement for front-end algo- rithms,
S. Li, S. Wang, Z. Liu, Z. Jiang, Y . Wang, and H. Li, “SpeechRe- finer: Towards perceptual quality refinement for front-end algo- rithms,” inProc. Interspeech, 2025
2025
-
[31]
Mean flows for one-step generative modeling,
Z. Geng, M. Deng, X. Bai, J. Z. Kolter, and K. He, “Mean flows for one-step generative modeling,” inProc. NeurIPS, 2025
2025
-
[32]
Back to Basics: Let Denoising Generative Models Denoise
T. Li and K. He, “Back to basics: Let denoising generative models denoise,”arXiv preprint arXiv:2511.13720, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,
D. Li, S. Lu, H. Pan, Z. Zhan, Q. Hong, and L. Li, “Mean- FlowSE: one-step generative speech enhancement via conditional mean flow,” inProc. ICASSP, 2026
2026
-
[34]
MeanSE: Efficient generative speech enhancement with mean flows,
J. Wang, H. Wang, W. Wang, L. Yang, C. Li, W. Zhang, L. Tan, and Y . Qian, “MeanSE: Efficient generative speech enhancement with mean flows,” inProc. ICASSP, 2026
2026
-
[35]
Flowse: Flow matching-based speech enhancement,
S. Lee, S. Cheong, S. Han, and J. W. Shin, “Flowse: Flow matching-based speech enhancement,” inProc. ICASSP, 2025
2025
-
[36]
A step-by-step process for building tts voices using open source data and frameworks for bangla, ja- vanese, khmer, nepali, sinhala, and sundanese
K. Sodimana, P. De Silva, S. Sarin, O. Kjartansson, M. Jansche, K. Pipatsrisawat, and L. Ha, “A step-by-step process for building tts voices using open source data and frameworks for bangla, ja- vanese, khmer, nepali, sinhala, and sundanese.” inProc. SLTU, 2018
2018
-
[37]
WHAM!: Extending speech separation to noisy environments,
G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending speech separation to noisy environments,” inProc. Interspeech, 2019
2019
-
[38]
Lib- rispeech: an asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” inProc. ICASSP, 2015
2015
-
[39]
The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,
J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” inProc. of Meet- ings on Acoustics, 2013
2013
-
[40]
gpuRIR: A python library for room impulse response simulation with gpu accelera- tion,
D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpuRIR: A python library for room impulse response simulation with gpu accelera- tion,”Multimedia Tools and Applications, vol. 80, pp. 5653–5671, 2021
2021
-
[41]
Adam: A method for stochastic opti- mization,
D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” inProc. ICLR, 2015
2015
-
[42]
SA-SDR: A novel loss function for separation of meeting style data,
T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “SA-SDR: A novel loss function for separation of meeting style data,” inProc. ICASSP, 2022
2022
-
[43]
Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,
A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Per- ceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001
2001
-
[44]
An algorithm for predicting the intelli- gibility of speech masked by modulated noise maskers,
J. Jensen and C. H. Taal, “An algorithm for predicting the intelli- gibility of speech masked by modulated noise maskers,”TASLP, vol. 24, no. 11, pp. 2009–2022, 2016
2009
-
[45]
NISQA: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,
G. Mittag, B. Naderi, A. Chehadi, and S. M¨oller, “NISQA: A deep cnn-self-attention model for multidimensional speech quality pre- diction with crowdsourced datasets,” inProc. Interspeech, 2021
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.