Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters
Pith reviewed 2026-05-19 23:34 UTC · model grok-4.3
The pith
Geometry conditioning lets a spatially selective filter generalize target speaker extraction across different microphone array shapes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The GC-SSF adds a geometry-conditioning branch built from FiLM layers that receives a DOA-MPE feature encoding both the target direction of arrival and the microphone positions; this branch modulates the intermediate feature maps inside the original SSF so that the filtering process adapts to the specific spatial relationship between the array and the speaker.
What carries the argument
Geometry-conditioning branch using FiLM layers driven by the DOA-MPE feature that jointly represents microphone positions and target direction of arrival
If this is right
- The model maintains high spatial selectivity on circular, uniform linear, and random microphone arrays.
- Performance degrades less than the baseline SSF when the test array geometry differs from the training geometry.
- The same trained model can be deployed on varied hardware without per-geometry retraining.
- The filtering process adapts to the concrete spatial layout of any given array.
Where Pith is reading between the lines
- Deployment pipelines could stop training separate models for every device form factor.
- Ad-hoc microphone setups assembled from consumer devices might become practical for speaker extraction.
- The same conditioning pattern could be tested on related tasks such as multi-speaker separation or dereverberation.
Load-bearing premise
The FiLM-based conditioning branch and DOA-MPE feature can reliably capture and apply the spatial relationship between microphone positions and target speaker direction.
What would settle it
Evaluating the GC-SSF on a previously unseen microphone array geometry and observing no improvement or a drop in performance relative to the unconditioned SSF would falsify the generalization claim.
read the original abstract
Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades significantly when evaluated on mismatched array geometries. In this paper, we propose a geometry-conditioned SSF (GC-SSF), which incorporates a geometry-conditioning branch based on FiLM layers. Furthermore, we propose a feature that jointly encodes the DOA and the microphone positions (DOA-MPE). The conditioning branch modulates the intermediate feature maps of the SSF using the DOA-MPE feature to capture the spatial relationship between the microphone positions and the target speaker. Experimental results across circular, uniform linear, and random microphone arrays show that the proposed GC-SSF generalizes better to mismatched geometries while maintaining high spatial selectivity, demonstrating its ability to effectively adapt the filtering process to different array geometries
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a geometry-conditioned spatially selective non-linear filter (GC-SSF) for multi-channel target speaker extraction. It augments a prior SSF architecture with a geometry-conditioning branch that employs FiLM layers driven by a new DOA-MPE feature encoding both target direction-of-arrival and microphone positions. The conditioning is intended to modulate intermediate feature maps so that the same trained weights maintain spatial selectivity on unseen array layouts. Experiments are reported on circular, uniform linear, and random microphone arrays, with the central claim that GC-SSF generalizes better to mismatched geometries than the baseline SSF while preserving high selectivity.
Significance. If the central claim holds, the work would address a practical barrier in deploying learned spatial filters, since real-world microphone arrays rarely match training geometries exactly. The explicit conditioning mechanism is a direct response to the geometry-tied features identified in prior SSF work. Credit is due for testing across multiple array types rather than a single mismatched case; however, the significance remains moderate until quantitative metrics, ablation results, and invariance arguments are provided to substantiate the generalization.
major comments (2)
- [Proposed Method] The geometry-conditioning branch (FiLM layers + DOA-MPE): the manuscript supplies no derivation or invariance argument showing why this particular encoding must transfer to arbitrary microphone-position / DOA relationships. If DOA-MPE is effectively a concatenation or embedding that does not explicitly encode relative distances or angles in a geometry-invariant manner, performance on mismatched geometries could still degrade due to overfitting to training-array statistics. This is load-bearing for the headline claim.
- [Experiments] Experimental results section: the abstract states positive results across array types but the provided description lacks specific quantitative metrics (e.g., SI-SDR, PESQ, or selectivity measures), baseline comparisons on mismatched geometries, training details, or ablation studies isolating the contribution of the FiLM/DOA-MPE branch. Without these, the extent of improvement and the support for the generalization claim cannot be fully assessed.
minor comments (2)
- [Proposed Method] Notation for the DOA-MPE feature should be formalized with an explicit equation or diagram showing its construction from microphone coordinates and DOA.
- [Experiments] Figure captions and axis labels for array geometry illustrations should be clarified to indicate which arrays were seen during training versus evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the opportunity to clarify and strengthen our manuscript. We address the major comments point by point below.
read point-by-point responses
-
Referee: [Proposed Method] The geometry-conditioning branch (FiLM layers + DOA-MPE): the manuscript supplies no derivation or invariance argument showing why this particular encoding must transfer to arbitrary microphone-position / DOA relationships. If DOA-MPE is effectively a concatenation or embedding that does not explicitly encode relative distances or angles in a geometry-invariant manner, performance on mismatched geometries could still degrade due to overfitting to training-array statistics. This is load-bearing for the headline claim.
Authors: We agree that a more explicit discussion of the design rationale would strengthen the paper. In the revision we will add a subsection explaining the construction of DOA-MPE as a joint representation of target DOA and microphone coordinates; the FiLM layers then learn to modulate features according to relative geometry rather than absolute layout. While a formal invariance proof is difficult for a learned model, the empirical results on random arrays (which have no fixed structure) provide evidence that the conditioning does not simply memorize training-array statistics. We will also report additional cross-geometry transfer experiments to further support this claim. revision: yes
-
Referee: [Experiments] Experimental results section: the abstract states positive results across array types but the provided description lacks specific quantitative metrics (e.g., SI-SDR, PESQ, or selectivity measures), baseline comparisons on mismatched geometries, training details, or ablation studies isolating the contribution of the FiLM/DOA-MPE branch. Without these, the extent of improvement and the support for the generalization claim cannot be fully assessed.
Authors: We acknowledge that the experimental section in the submitted version was insufficiently detailed. The full manuscript contains the requested metrics and comparisons, but we will expand the revision to include explicit tables reporting SI-SDR, PESQ, and selectivity scores for matched and mismatched geometries, direct baseline comparisons against the original SSF on all tested arrays, complete training hyperparameters, and ablation results that isolate the FiLM/DOA-MPE branch. These additions will make the quantitative support for the generalization claim fully transparent. revision: yes
Circularity Check
No significant circularity; architecture and claims are self-contained
full rationale
The paper introduces an explicit new architecture (GC-SSF with FiLM-based geometry-conditioning branch and DOA-MPE feature) rather than deriving results from parameters fitted to evaluation data or reducing claims to prior self-citations. Generalization to mismatched arrays is asserted via direct experimental comparison on circular, linear, and random geometries, which constitutes independent empirical validation outside any fitted input. No self-definitional equations, fitted-input predictions, uniqueness theorems, or ansatz smuggling appear in the derivation chain. The central claim rests on the added conditioning mechanism and its measured performance, keeping the work non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- FiLM layer parameters
axioms (1)
- domain assumption Geometry information can be effectively injected via FiLM conditioning to adapt spatial filtering without degrading selectivity.
invented entities (2)
-
DOA-MPE feature
no independent evidence
-
GC-SSF
no independent evidence
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Extracting a target speaker from a mixture of speakers and background noise remains a fundamental challenge in acoustic signal processing [1]. To discriminate the target speaker from the interfering speakers, various cues have been proposed, such as enrollment utterances [2, 3], visual information [4, 5], and spatial features [6–11]. In this ...
-
[2]
integration of a geometry-conditioning branch into the baseline SSF (see Fig. 1), using a Feature-wise Linear Modulation (FiLM) layer [23] to modulate intermediate feature maps from the SSF system, 2) a DOA- Microphone Positional Encoding (DOA-MPE) feature, which effectively represents the spatial relationship between the microphone positions and the targ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
SP A TIALL Y SELECTIVE NON-LINEAR FIL TER In this section, we review the spatially selective non-linear filter (SSF) for target speaker extraction [8], which serves as the baseline system. In the short-time Fourier transform (STFT) domain, the observed noisy speech signal at them-th microphone for frequency binf∈[1, F] and time framet∈[1, T] is denoted by...
-
[4]
PROPOSED GEOMETRY -CONDITIONED SP A TIALL Y SELECTIVE NON-LINEAR FIL TER To improve the generalization ability of the SSF system across different microphone array geometries for a fixed number of microphones, we propose to incorporate a geometry-conditioning branch into the SSF (see Fig. 1). This branch first transforms the microphone array geometry and t...
-
[5]
EXPERIMENTS This section first presents the experimental setup, including the training and evaluation datasets, the network structure, and the training procedure. Then, the experimental results are presented and discussed, evaluating the performance, generalization ability, and the spatial selectivity of the proposed GC-SSF system compared with the baseli...
-
[6]
CONCLUSIONS In this paper, we proposed the GC-SSF system, designed to achieve robust target speaker extraction across different array geometries for a fixed number of microphones. The proposed system extends the baseline SSF by incorporating an explicit geometry-conditioning branch via a FiLM layer and a proposed DOA-MPE feature to represent the spatial r...
-
[7]
Neural target speech extraction: An overview,
K. Zmolikova, M. Delcroix, T. Ochiai, K. Kinoshita, J.ˇCernock´y, and D. Y u, “Neural target speech extraction: An overview,”IEEE Signal Processing Magazine, vol. 40, pp. 8–29, 2023
work page 2023
-
[8]
Single channel target speaker extraction and recognition with speaker beam,
M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, Apr. 2018, pp. 5554–5558
work page 2018
-
[9]
V ariants of LSTM cells for single-channel speaker-conditioned target speaker extraction,
R. Sinha, C. Rollwage, and S. Doclo, “V ariants of LSTM cells for single-channel speaker-conditioned target speaker extraction,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, pp. 63, 2024
work page 2024
-
[10]
An overview of deep-learning-based audio-visual speech enhancement and separation,
D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y . Xu, M. Y u, D. Y u, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,”IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 1368–1396, 2021
work page 2021
-
[11]
A V -Sepformer: Cross-attention sepformer for audio-visual target speaker extraction,
J. Lin, X. Cai, H. Dinkel, J. Chen, Z. Y an, Y . Wang, J. Zhang, Z. Wu, Y . Wang, and H. Meng, “A V -Sepformer: Cross-attention sepformer for audio-visual target speaker extraction,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, June 2023, pp. 1–5
work page 2023
-
[12]
Combining spectral and spatial features for deep learning based blind speaker separation,
Z.-Q. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,”IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 27, pp. 457–468, 2019
work page 2019
-
[13]
Beamformer-guided target speaker extraction,
M. Elminshawi, S. Raj Chetupalli, and E. A. P . Habets, “Beamformer-guided target speaker extraction,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, June 2023, pp. 1–5
work page 2023
-
[14]
Multi-channel speech separation using spatially selective deep non-linear filters,
K. Tesch and T. Gerkmann, “Multi-channel speech separation using spatially selective deep non-linear filters,”IEEE/ACM Trans. on Au- dio, Speech, and Language Processing, vol. 32, pp. 542–553, 2024
work page 2024
-
[15]
J. Kienegger, A. Mannanova, H. Fang, and T. Gerkmann, “Self-steering deep non-linear spatially selective filters for efficient extraction of moving speakers under weak guidance,” inProc. IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics, Tahoe City, USA, Oct. 2025, pp. 1–5
work page 2025
-
[16]
GAN-based multi-microphone spatial target speaker extraction,
S. S. Shetu, E. A. P . Habets, and A. Brendel, “GAN-based multi-microphone spatial target speaker extraction,” inarXiv, 2025
work page 2025
-
[17]
Leverag- ing boolean directivity embedding for binaural target speaker extrac- tion,
Y . Wang, J. Zhang, C. Jiang, W . Zhang, Z. Y e, and L. Dai, “Leverag- ing boolean directivity embedding for binaural target speaker extrac- tion,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, April 2025, pp. 1–5
work page 2025
-
[18]
A. Mannanova, K. Tesch, J.-M. Lemercier, and T. Gerkmann, “Meta-learning for variable array configurations in end-to-end few-shot multichannel speech enhancement,” inProc. International W orkshop on Acoustic Signal Enhancement, Aalborg, Denmark, 2024, pp. 200–204
work page 2024
-
[19]
End-to-end microphone permutation and number invariant multi-channel speech separation,
Y . Luo, Z. Chen, N. Mesgarani, and T. Y oshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2020, pp. 6394–6398
work page 2020
-
[20]
Flexible multichannel speech enhancement for noise-robust frontend,
A. Juki´c, J. Balam, and B. Ginsburg, “Flexible multichannel speech enhancement for noise-robust frontend,” inProc. IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA, Oct. 2023, pp. 1–5
work page 2023
-
[21]
Array geometry-robust attention-based neural beamformer for moving speakers,
M. Tammen, T. Ochiai, M. Delcroix, T. Nakatani, S. Araki, and S. Doclo, “Array geometry-robust attention-based neural beamformer for moving speakers,” inProc. Interspeech, Kos, Greece, Sep. 2024, pp. 3345–3349
work page 2024
-
[22]
DeFTAN-AA: Array geometry agnostic multichannel speech enhancement,
D. Lee and J.-W. Choi, “DeFTAN-AA: Array geometry agnostic multichannel speech enhancement,” inProc. Interspeech, Kos, Greece, Sep. 2024, pp. 3360–3364
work page 2024
-
[23]
Ambidrop: Array-agnostic speech enhancement using ambisonics encoding and dropout-based learning,
M. Tatarjitzky and B. Rafaely, “Ambidrop: Array-agnostic speech enhancement using ambisonics encoding and dropout-based learning,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2026, pp. 14732–14736
work page 2026
-
[24]
Eigenbeam-feature-based multi-order encoder for geometry-agnostic speech enhancement,
D. Zhang, A. I. Mezza, F. Miotello, J. Chen, M. Wang, F. Antonacci, and A. Bernardini, “Eigenbeam-feature-based multi-order encoder for geometry-agnostic speech enhancement,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2026, pp. 22192–22196
work page 2026
-
[25]
Flexio: Flexible single- and multi-channel speech separation and enhancement,
Y . Masuyama, K. Saijo, F. Paissan, J. Han, M. Delcroix, R. Aihara, F. G. Germain, G. Wichern, and J. Le Roux, “Flexio: Flexible single- and multi-channel speech separation and enhancement,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2026, pp. 14417–14421
work page 2026
-
[26]
Geometry-aware DOA esti- mation using a deep neural network with mixed-data input features,
U. Kowalk, S. Doclo, and J. Bitzer, “Geometry-aware DOA esti- mation using a deep neural network with mixed-data input features,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Rohdes Island, Greece, Jun. 2023, pp. 1–5
work page 2023
-
[27]
M.-S. Baek, J.-H. Chang, and I. Cohen, “DNN-based geometry- invariant DOA estimation with microphone positional encoding and complexity gradual training,”IEEE Trans. on Audio, Speech and Language Processing, vol. 33, pp. 2360–2376, 2025
work page 2025
-
[28]
A unified geometry-aware source localization and separation framework for ad-hoc micro- phone array,
J. Fan, R. Gu, Y . Luo, and C. Pang, “A unified geometry-aware source localization and separation framework for ad-hoc micro- phone array,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing W orkshops, Seoul, Korea, Apr. 2024, pp. 725–729
work page 2024
-
[29]
FiLM: Visual reasoning with a general conditioning layer,
E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,”Proc. AAAI Conference on Artificial Intelligence, vol. 32, 2018
work page 2018
-
[30]
Pyroomacoustics: A python package for audio room simulation and array processing algorithms,
R. Scheibler, E. Bezzam, and I. Dokmani´c, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 351–355
work page 2018
-
[31]
CSR-I (WSJ0) Complete LDC93S6A,
J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) Complete LDC93S6A,” Linguistic Data Consortium, Philadelphia, May 2007
work page 2007
-
[32]
A. W . Rix, J. G. Beerends, M. P . Hollier, and A. P . Hekstra, “Percep- tual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, USA, May 2001, vol. 2, pp. 749–752
work page 2001
-
[33]
SDR – half-baked or well done?,
J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, May 2019, pp. 626–630
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.