pith. sign in

arxiv: 2605.18442 · v1 · pith:CXHPACCVnew · submitted 2026-05-18 · 📡 eess.AS

Flexible Multi-Channel Target Speaker Extraction Using Geometry-Conditioned Spatially Selective Non-linear Filters

Pith reviewed 2026-05-19 23:34 UTC · model grok-4.3

classification 📡 eess.AS
keywords target speaker extractionmicrophone arraygeometry conditioningspatial filteringFiLM layersdirection of arrivalmulti-channel audio
0
0 comments X

The pith

Geometry conditioning lets a spatially selective filter generalize target speaker extraction across different microphone array shapes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a geometry-conditioned spatially selective non-linear filter (GC-SSF) for extracting a target speaker from multi-channel audio. Standard SSF models tie learned features to one fixed microphone geometry, so performance drops sharply on new array layouts. Adding a conditioning branch that uses FiLM layers and a joint DOA-microphone-position encoding modulates the filter to match the actual spatial setup. This matters in practice because devices and rooms use many different microphone placements, from linear phone arrays to circular conference tables. Experiments on circular, linear, and random arrays show the conditioned version retains strong direction selectivity while improving robustness to geometry mismatch.

Core claim

The GC-SSF adds a geometry-conditioning branch built from FiLM layers that receives a DOA-MPE feature encoding both the target direction of arrival and the microphone positions; this branch modulates the intermediate feature maps inside the original SSF so that the filtering process adapts to the specific spatial relationship between the array and the speaker.

What carries the argument

Geometry-conditioning branch using FiLM layers driven by the DOA-MPE feature that jointly represents microphone positions and target direction of arrival

If this is right

  • The model maintains high spatial selectivity on circular, uniform linear, and random microphone arrays.
  • Performance degrades less than the baseline SSF when the test array geometry differs from the training geometry.
  • The same trained model can be deployed on varied hardware without per-geometry retraining.
  • The filtering process adapts to the concrete spatial layout of any given array.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment pipelines could stop training separate models for every device form factor.
  • Ad-hoc microphone setups assembled from consumer devices might become practical for speaker extraction.
  • The same conditioning pattern could be tested on related tasks such as multi-speaker separation or dereverberation.

Load-bearing premise

The FiLM-based conditioning branch and DOA-MPE feature can reliably capture and apply the spatial relationship between microphone positions and target speaker direction.

What would settle it

Evaluating the GC-SSF on a previously unseen microphone array geometry and observing no improvement or a drop in performance relative to the unconditioned SSF would falsify the generalization claim.

read the original abstract

Recently, a spatially selective non-linear filter (SSF) has been proposed for target speaker extraction, using the target direction-of-arrival (DOA) as a spatial cue. Since learned intermediate features are tied to the microphone geometry, the performance of the SSF degrades significantly when evaluated on mismatched array geometries. In this paper, we propose a geometry-conditioned SSF (GC-SSF), which incorporates a geometry-conditioning branch based on FiLM layers. Furthermore, we propose a feature that jointly encodes the DOA and the microphone positions (DOA-MPE). The conditioning branch modulates the intermediate feature maps of the SSF using the DOA-MPE feature to capture the spatial relationship between the microphone positions and the target speaker. Experimental results across circular, uniform linear, and random microphone arrays show that the proposed GC-SSF generalizes better to mismatched geometries while maintaining high spatial selectivity, demonstrating its ability to effectively adapt the filtering process to different array geometries

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a geometry-conditioned spatially selective non-linear filter (GC-SSF) for multi-channel target speaker extraction. It augments a prior SSF architecture with a geometry-conditioning branch that employs FiLM layers driven by a new DOA-MPE feature encoding both target direction-of-arrival and microphone positions. The conditioning is intended to modulate intermediate feature maps so that the same trained weights maintain spatial selectivity on unseen array layouts. Experiments are reported on circular, uniform linear, and random microphone arrays, with the central claim that GC-SSF generalizes better to mismatched geometries than the baseline SSF while preserving high selectivity.

Significance. If the central claim holds, the work would address a practical barrier in deploying learned spatial filters, since real-world microphone arrays rarely match training geometries exactly. The explicit conditioning mechanism is a direct response to the geometry-tied features identified in prior SSF work. Credit is due for testing across multiple array types rather than a single mismatched case; however, the significance remains moderate until quantitative metrics, ablation results, and invariance arguments are provided to substantiate the generalization.

major comments (2)
  1. [Proposed Method] The geometry-conditioning branch (FiLM layers + DOA-MPE): the manuscript supplies no derivation or invariance argument showing why this particular encoding must transfer to arbitrary microphone-position / DOA relationships. If DOA-MPE is effectively a concatenation or embedding that does not explicitly encode relative distances or angles in a geometry-invariant manner, performance on mismatched geometries could still degrade due to overfitting to training-array statistics. This is load-bearing for the headline claim.
  2. [Experiments] Experimental results section: the abstract states positive results across array types but the provided description lacks specific quantitative metrics (e.g., SI-SDR, PESQ, or selectivity measures), baseline comparisons on mismatched geometries, training details, or ablation studies isolating the contribution of the FiLM/DOA-MPE branch. Without these, the extent of improvement and the support for the generalization claim cannot be fully assessed.
minor comments (2)
  1. [Proposed Method] Notation for the DOA-MPE feature should be formalized with an explicit equation or diagram showing its construction from microphone coordinates and DOA.
  2. [Experiments] Figure captions and axis labels for array geometry illustrations should be clarified to indicate which arrays were seen during training versus evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the opportunity to clarify and strengthen our manuscript. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Proposed Method] The geometry-conditioning branch (FiLM layers + DOA-MPE): the manuscript supplies no derivation or invariance argument showing why this particular encoding must transfer to arbitrary microphone-position / DOA relationships. If DOA-MPE is effectively a concatenation or embedding that does not explicitly encode relative distances or angles in a geometry-invariant manner, performance on mismatched geometries could still degrade due to overfitting to training-array statistics. This is load-bearing for the headline claim.

    Authors: We agree that a more explicit discussion of the design rationale would strengthen the paper. In the revision we will add a subsection explaining the construction of DOA-MPE as a joint representation of target DOA and microphone coordinates; the FiLM layers then learn to modulate features according to relative geometry rather than absolute layout. While a formal invariance proof is difficult for a learned model, the empirical results on random arrays (which have no fixed structure) provide evidence that the conditioning does not simply memorize training-array statistics. We will also report additional cross-geometry transfer experiments to further support this claim. revision: yes

  2. Referee: [Experiments] Experimental results section: the abstract states positive results across array types but the provided description lacks specific quantitative metrics (e.g., SI-SDR, PESQ, or selectivity measures), baseline comparisons on mismatched geometries, training details, or ablation studies isolating the contribution of the FiLM/DOA-MPE branch. Without these, the extent of improvement and the support for the generalization claim cannot be fully assessed.

    Authors: We acknowledge that the experimental section in the submitted version was insufficiently detailed. The full manuscript contains the requested metrics and comparisons, but we will expand the revision to include explicit tables reporting SI-SDR, PESQ, and selectivity scores for matched and mismatched geometries, direct baseline comparisons against the original SSF on all tested arrays, complete training hyperparameters, and ablation results that isolate the FiLM/DOA-MPE branch. These additions will make the quantitative support for the generalization claim fully transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; architecture and claims are self-contained

full rationale

The paper introduces an explicit new architecture (GC-SSF with FiLM-based geometry-conditioning branch and DOA-MPE feature) rather than deriving results from parameters fitted to evaluation data or reducing claims to prior self-citations. Generalization to mismatched arrays is asserted via direct experimental comparison on circular, linear, and random geometries, which constitutes independent empirical validation outside any fitted input. No self-definitional equations, fitted-input predictions, uniqueness theorems, or ansatz smuggling appear in the derivation chain. The central claim rests on the added conditioning mechanism and its measured performance, keeping the work non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on the effectiveness of newly introduced components whose performance is shown only through experiments on the paper's data.

free parameters (1)
  • FiLM layer parameters
    Learned scaling and shifting factors that modulate features based on geometry input.
axioms (1)
  • domain assumption Geometry information can be effectively injected via FiLM conditioning to adapt spatial filtering without degrading selectivity.
    This premise underpins the adaptation mechanism described in the proposal.
invented entities (2)
  • DOA-MPE feature no independent evidence
    purpose: Joint encoding of direction-of-arrival and microphone positions
    New feature proposed to provide spatial relationship information to the conditioning branch.
  • GC-SSF no independent evidence
    purpose: Geometry-conditioned version of the spatially selective non-linear filter
    New model architecture that incorporates the conditioning branch.

pith-pipeline@v0.9.0 · 5704 in / 1320 out tokens · 38132 ms · 2026-05-19T23:34:37.366810+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    INTRODUCTION Extracting a target speaker from a mixture of speakers and background noise remains a fundamental challenge in acoustic signal processing [1]. To discriminate the target speaker from the interfering speakers, various cues have been proposed, such as enrollment utterances [2, 3], visual information [4, 5], and spatial features [6–11]. In this ...

  2. [2]

    integration of a geometry-conditioning branch into the baseline SSF (see Fig. 1), using a Feature-wise Linear Modulation (FiLM) layer [23] to modulate intermediate feature maps from the SSF system, 2) a DOA- Microphone Positional Encoding (DOA-MPE) feature, which effectively represents the spatial relationship between the microphone positions and the targ...

  3. [3]

    SP A TIALL Y SELECTIVE NON-LINEAR FIL TER In this section, we review the spatially selective non-linear filter (SSF) for target speaker extraction [8], which serves as the baseline system. In the short-time Fourier transform (STFT) domain, the observed noisy speech signal at them-th microphone for frequency binf∈[1, F] and time framet∈[1, T] is denoted by...

  4. [4]

    PROPOSED GEOMETRY -CONDITIONED SP A TIALL Y SELECTIVE NON-LINEAR FIL TER To improve the generalization ability of the SSF system across different microphone array geometries for a fixed number of microphones, we propose to incorporate a geometry-conditioning branch into the SSF (see Fig. 1). This branch first transforms the microphone array geometry and t...

  5. [5]

    EXPERIMENTS This section first presents the experimental setup, including the training and evaluation datasets, the network structure, and the training procedure. Then, the experimental results are presented and discussed, evaluating the performance, generalization ability, and the spatial selectivity of the proposed GC-SSF system compared with the baseli...

  6. [6]

    CONCLUSIONS In this paper, we proposed the GC-SSF system, designed to achieve robust target speaker extraction across different array geometries for a fixed number of microphones. The proposed system extends the baseline SSF by incorporating an explicit geometry-conditioning branch via a FiLM layer and a proposed DOA-MPE feature to represent the spatial r...

  7. [7]

    Neural target speech extraction: An overview,

    K. Zmolikova, M. Delcroix, T. Ochiai, K. Kinoshita, J.ˇCernock´y, and D. Y u, “Neural target speech extraction: An overview,”IEEE Signal Processing Magazine, vol. 40, pp. 8–29, 2023

  8. [8]

    Single channel target speaker extraction and recognition with speaker beam,

    M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, Apr. 2018, pp. 5554–5558

  9. [9]

    V ariants of LSTM cells for single-channel speaker-conditioned target speaker extraction,

    R. Sinha, C. Rollwage, and S. Doclo, “V ariants of LSTM cells for single-channel speaker-conditioned target speaker extraction,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2024, pp. 63, 2024

  10. [10]

    An overview of deep-learning-based audio-visual speech enhancement and separation,

    D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y . Xu, M. Y u, D. Y u, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,”IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 29, pp. 1368–1396, 2021

  11. [11]

    A V -Sepformer: Cross-attention sepformer for audio-visual target speaker extraction,

    J. Lin, X. Cai, H. Dinkel, J. Chen, Z. Y an, Y . Wang, J. Zhang, Z. Wu, Y . Wang, and H. Meng, “A V -Sepformer: Cross-attention sepformer for audio-visual target speaker extraction,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, June 2023, pp. 1–5

  12. [12]

    Combining spectral and spatial features for deep learning based blind speaker separation,

    Z.-Q. Wang and D. Wang, “Combining spectral and spatial features for deep learning based blind speaker separation,”IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 27, pp. 457–468, 2019

  13. [13]

    Beamformer-guided target speaker extraction,

    M. Elminshawi, S. Raj Chetupalli, and E. A. P . Habets, “Beamformer-guided target speaker extraction,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Rhodes Island, Greece, June 2023, pp. 1–5

  14. [14]

    Multi-channel speech separation using spatially selective deep non-linear filters,

    K. Tesch and T. Gerkmann, “Multi-channel speech separation using spatially selective deep non-linear filters,”IEEE/ACM Trans. on Au- dio, Speech, and Language Processing, vol. 32, pp. 542–553, 2024

  15. [15]

    Self-steering deep non-linear spatially selective filters for efficient extraction of moving speakers under weak guidance,

    J. Kienegger, A. Mannanova, H. Fang, and T. Gerkmann, “Self-steering deep non-linear spatially selective filters for efficient extraction of moving speakers under weak guidance,” inProc. IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics, Tahoe City, USA, Oct. 2025, pp. 1–5

  16. [16]

    GAN-based multi-microphone spatial target speaker extraction,

    S. S. Shetu, E. A. P . Habets, and A. Brendel, “GAN-based multi-microphone spatial target speaker extraction,” inarXiv, 2025

  17. [17]

    Leverag- ing boolean directivity embedding for binaural target speaker extrac- tion,

    Y . Wang, J. Zhang, C. Jiang, W . Zhang, Z. Y e, and L. Dai, “Leverag- ing boolean directivity embedding for binaural target speaker extrac- tion,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India, April 2025, pp. 1–5

  18. [18]

    Meta-learning for variable array configurations in end-to-end few-shot multichannel speech enhancement,

    A. Mannanova, K. Tesch, J.-M. Lemercier, and T. Gerkmann, “Meta-learning for variable array configurations in end-to-end few-shot multichannel speech enhancement,” inProc. International W orkshop on Acoustic Signal Enhancement, Aalborg, Denmark, 2024, pp. 200–204

  19. [19]

    End-to-end microphone permutation and number invariant multi-channel speech separation,

    Y . Luo, Z. Chen, N. Mesgarani, and T. Y oshioka, “End-to-end microphone permutation and number invariant multi-channel speech separation,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2020, pp. 6394–6398

  20. [20]

    Flexible multichannel speech enhancement for noise-robust frontend,

    A. Juki´c, J. Balam, and B. Ginsburg, “Flexible multichannel speech enhancement for noise-robust frontend,” inProc. IEEE W orkshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, USA, Oct. 2023, pp. 1–5

  21. [21]

    Array geometry-robust attention-based neural beamformer for moving speakers,

    M. Tammen, T. Ochiai, M. Delcroix, T. Nakatani, S. Araki, and S. Doclo, “Array geometry-robust attention-based neural beamformer for moving speakers,” inProc. Interspeech, Kos, Greece, Sep. 2024, pp. 3345–3349

  22. [22]

    DeFTAN-AA: Array geometry agnostic multichannel speech enhancement,

    D. Lee and J.-W. Choi, “DeFTAN-AA: Array geometry agnostic multichannel speech enhancement,” inProc. Interspeech, Kos, Greece, Sep. 2024, pp. 3360–3364

  23. [23]

    Ambidrop: Array-agnostic speech enhancement using ambisonics encoding and dropout-based learning,

    M. Tatarjitzky and B. Rafaely, “Ambidrop: Array-agnostic speech enhancement using ambisonics encoding and dropout-based learning,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2026, pp. 14732–14736

  24. [24]

    Eigenbeam-feature-based multi-order encoder for geometry-agnostic speech enhancement,

    D. Zhang, A. I. Mezza, F. Miotello, J. Chen, M. Wang, F. Antonacci, and A. Bernardini, “Eigenbeam-feature-based multi-order encoder for geometry-agnostic speech enhancement,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2026, pp. 22192–22196

  25. [25]

    Flexio: Flexible single- and multi-channel speech separation and enhancement,

    Y . Masuyama, K. Saijo, F. Paissan, J. Han, M. Delcroix, R. Aihara, F. G. Germain, G. Wichern, and J. Le Roux, “Flexio: Flexible single- and multi-channel speech separation and enhancement,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, May 2026, pp. 14417–14421

  26. [26]

    Geometry-aware DOA esti- mation using a deep neural network with mixed-data input features,

    U. Kowalk, S. Doclo, and J. Bitzer, “Geometry-aware DOA esti- mation using a deep neural network with mixed-data input features,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Rohdes Island, Greece, Jun. 2023, pp. 1–5

  27. [27]

    DNN-based geometry- invariant DOA estimation with microphone positional encoding and complexity gradual training,

    M.-S. Baek, J.-H. Chang, and I. Cohen, “DNN-based geometry- invariant DOA estimation with microphone positional encoding and complexity gradual training,”IEEE Trans. on Audio, Speech and Language Processing, vol. 33, pp. 2360–2376, 2025

  28. [28]

    A unified geometry-aware source localization and separation framework for ad-hoc micro- phone array,

    J. Fan, R. Gu, Y . Luo, and C. Pang, “A unified geometry-aware source localization and separation framework for ad-hoc micro- phone array,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing W orkshops, Seoul, Korea, Apr. 2024, pp. 725–729

  29. [29]

    FiLM: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,”Proc. AAAI Conference on Artificial Intelligence, vol. 32, 2018

  30. [30]

    Pyroomacoustics: A python package for audio room simulation and array processing algorithms,

    R. Scheibler, E. Bezzam, and I. Dokmani´c, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018, pp. 351–355

  31. [31]

    CSR-I (WSJ0) Complete LDC93S6A,

    J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) Complete LDC93S6A,” Linguistic Data Consortium, Philadelphia, May 2007

  32. [32]

    Percep- tual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,

    A. W . Rix, J. G. Beerends, M. P . Hollier, and A. P . Hekstra, “Percep- tual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Salt Lake City, USA, May 2001, vol. 2, pp. 749–752

  33. [33]

    SDR – half-baked or well done?,

    J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – half-baked or well done?,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, May 2019, pp. 626–630