pith. sign in

arxiv: 2605.21891 · v1 · pith:5SCNKJ76new · submitted 2026-05-21 · 📡 eess.AS

Neighbor-Consistent Neural Filters for Robust Personal Sound Zones Under Localization Uncertainty

Pith reviewed 2026-05-22 03:17 UTC · model grok-4.3

classification 📡 eess.AS
keywords personal sound zonesneural filterslocalization uncertaintyneighbor consistencyaudio renderinghead trackingrobustness
0
0 comments X

The pith

Neighbor consistency regularization stabilizes personal sound zone filters against localization uncertainty

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces neighbor-consistent neural filters for head-tracked personal sound zones to counter sensitivity to localization uncertainty from tracking jitter or occlusions. By adding a penalty on filter differences at randomly perturbed neighboring coordinates during training, the coordinate-to-filter mapping becomes more stable. This regularization reduces spatial variation rates while largely preserving the acoustic isolation between zones. Simulation and in-situ measurements confirm the gains in robustness without major changes to array geometry or transfer functions.

Core claim

Neighbor consistency regularization applied during training of coordinate-conditioned neural networks reduces the root-mean-square variation rate of generated filters by up to 55.9 percent in the woofer band and 30.3 percent in the tweeter band while largely preserving isolation quality and improving lower-tail robustness; physical measurements with a 24-driver array show up to 16.9 percent better worst-case neighborhood isolation and up to 61.8 percent lower spatial variation rates.

What carries the argument

Neighbor-consistency regularization term that penalizes differences between filters generated at an anchor coordinate and at randomly sampled neighboring coordinates during training of the neural network.

Load-bearing premise

Penalizing filter differences only at randomly sampled neighboring coordinates during training will produce stable behavior under the distribution of real-world localization noise without changes to acoustic transfer functions or array geometry.

What would settle it

Apply localization perturbations drawn from a distribution different from the random sampling used in training, such as systematic optical distortion or occlusion-induced bias, and measure whether variation rates and isolation degrade.

Figures

Figures reproduced from arXiv: 2605.21891 by Edgar Choueiri, Hao Jiang.

Figure 1
Figure 1. Figure 1: Coordinate-conditioned neural PSZ filter generation for a split-band (woofer–tweeter) system using two independently trained models. The woofer [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Simulation (woofer, Listener 2): per-anchor distributions of IZI/IPI quality summaries (median and CVaR10; higher is better) and stability summaries [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simulation (woofer, Listener 2): one-anchor example of the metric landscape under coordinate perturbations. Each map plots the frequency-averaged [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulation (tweeter, Listener 2): per-anchor distributions of IZI/IPI quality summaries (median and CVaR10; higher is better) and stability summaries [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simulation (tweeter, Listener 2): one-anchor example of the metric landscape under coordinate perturbations. Each map plots the frequency-averaged [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: In-situ measurement setup. (Top) Photograph of the listening room [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Measurements: One-anchor example (Anchor 2) showing [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Coordinate-conditioned neural networks can generate head-tracked personal sound zone (PSZ) loudspeaker filters in real time, but they are sensitive to localization uncertainty. Small fluctuations in estimated listener coordinates, caused by optical distortion, temporary occlusions, or tracking jitter, may produce large filter changes even when listeners are physically stationary. This paper proposes neighbor-consistent neural filters that regularize the coordinate-to-filter mapping by penalizing filter differences at randomly perturbed neighboring coordinates during training. To evaluate robustness against tracking noise, we introduce a decoupled protocol that fixes the acoustic transfer functions at a physical anchor while perturbing only the coordinate inputs used for filter generation. Isolation quality and local stability are evaluated using neighborhood median and lower-tail statistics of inter-zone and inter-program isolation, together with spatial variation rates that quantify metric sensitivity within a coordinate neighborhood. In simulation with a split-band woofer-tweeter system and 25 randomly sampled anchor positions, neighbor consistency reduces the root-mean-square (RMS) variation rate by up to 55.9% in the woofer band and 30.3% in the tweeter band while largely preserving isolation quality and improving lower-tail robustness. In in-situ measurements using a 24-driver array and two stationary head-and-torso simulators, the proposed regularization improves worst-case neighborhood isolation by up to 16.9% and reduces spatial variation rates by up to 61.8%. These results demonstrate that neighbor-consistency regularization effectively stabilizes PSZ rendering under localization uncertainty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes neighbor-consistent neural filters for head-tracked personal sound zones (PSZ) by adding a regularization term that penalizes filter differences at randomly perturbed neighboring listener coordinates during training. This aims to reduce sensitivity of the coordinate-to-filter mapping to localization uncertainty from optical distortion, occlusions, or jitter. A decoupled evaluation protocol is introduced that holds acoustic transfer functions fixed at physical anchors while perturbing only the coordinate inputs. Quantitative results are reported from simulation (split-band woofer-tweeter system, 25 anchor positions) showing up to 55.9% and 30.3% reductions in RMS variation rate for woofer and tweeter bands, and from in-situ measurements (24-driver array, two head-and-torso simulators) showing up to 16.9% improvement in worst-case neighborhood isolation and 61.8% reduction in spatial variation rates, while largely preserving isolation quality.

Significance. If the central claim holds, the work offers a practical, low-overhead regularization for stabilizing real-time PSZ rendering under realistic tracking noise without altering array geometry or acoustic transfer functions. The decoupled evaluation protocol is a useful methodological contribution for isolating coordinate sensitivity. Credit is due for combining simulation across multiple anchors with in-situ measurements using stationary simulators and for reporting both median and lower-tail neighborhood statistics. The approach could support more reliable deployment of head-tracked PSZ systems in consumer or automotive settings where localization jitter is common.

major comments (2)
  1. [§4] §4 (Training and regularization): The neighbor-consistency loss penalizes filter differences at randomly sampled coordinates within a perturbation radius, but the manuscript provides no quantitative comparison between the distribution of these random perturbations and the actual statistics (bias, variance, directional correlation) of localization errors measured from the optical tracking system. The central robustness claim therefore rests on an unverified assumption that random sampling reproduces real-world error characteristics.
  2. [§5.2] §5.2 (Decoupled evaluation protocol): While the protocol correctly isolates coordinate-to-filter sensitivity by fixing ATFs at physical anchors, it does not include a sensitivity analysis or ablation on perturbation radius or sampling strategy. If real localization errors exhibit larger excursions or structured biases (e.g., from occlusions) than the training distribution, the reported reductions in RMS variation rate (55.9% woofer, 30.3% tweeter) and worst-case isolation (16.9%) may not generalize.
minor comments (2)
  1. [Abstract / §5.1] The abstract and §5.1 refer to '25 randomly sampled anchor positions' and '24-driver array' without specifying the exact coordinate ranges or array geometry; adding a brief table or figure reference would improve reproducibility.
  2. [§3 / §5] Notation for the regularization strength and perturbation radius is introduced but not consistently labeled across equations and experimental tables; a single symbol table would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We provide point-by-point responses to the major comments below. We will make revisions to address the concerns where feasible, strengthening the presentation of our methods and results.

read point-by-point responses
  1. Referee: [§4] §4 (Training and regularization): The neighbor-consistency loss penalizes filter differences at randomly sampled coordinates within a perturbation radius, but the manuscript provides no quantitative comparison between the distribution of these random perturbations and the actual statistics (bias, variance, directional correlation) of localization errors measured from the optical tracking system. The central robustness claim therefore rests on an unverified assumption that random sampling reproduces real-world error characteristics.

    Authors: We acknowledge the value of a direct comparison between the training perturbation distribution and empirical localization error statistics from the optical tracking system. In this study, the random perturbations were chosen to model small-scale uncertainties commonly encountered in head-tracking applications, such as jitter and minor distortions, without introducing specific biases. The decoupled evaluation uses the same perturbation model to assess robustness. While we did not perform a quantitative match to measured error distributions in the current work, we will revise §4 to provide a more explicit rationale for the uniform random sampling approach and discuss its relation to typical tracking errors, thereby clarifying the assumptions underlying the robustness claims. revision: partial

  2. Referee: [§5.2] §5.2 (Decoupled evaluation protocol): While the protocol correctly isolates coordinate-to-filter sensitivity by fixing ATFs at physical anchors, it does not include a sensitivity analysis or ablation on perturbation radius or sampling strategy. If real localization errors exhibit larger excursions or structured biases (e.g., from occlusions) than the training distribution, the reported reductions in RMS variation rate (55.9% woofer, 30.3% tweeter) and worst-case isolation (16.9%) may not generalize.

    Authors: We agree that a sensitivity analysis regarding the perturbation radius and sampling strategy would be beneficial for assessing the generalizability of our results. The radius was selected to reflect realistic levels of localization uncertainty in our experimental setup, and uniform sampling was used to avoid directional assumptions. The improvements in RMS variation rates and isolation metrics were observed consistently across the tested conditions. In the revised manuscript, we will incorporate an ablation study or additional analysis on varying perturbation radii to demonstrate the sensitivity and support the reported performance gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; regularization and metrics are independent

full rationale

The paper defines neighbor-consistency as an explicit regularization term added to the training loss that penalizes filter differences at randomly sampled neighboring coordinates. The claimed reductions in RMS variation rate (up to 55.9% woofer, 30.3% tweeter) and worst-case isolation (16.9%) are obtained from a separate decoupled evaluation protocol that holds acoustic transfer functions fixed at physical anchors while only perturbing coordinate inputs, then computes neighborhood median/lower-tail statistics and spatial variation rates on held-out positions. These evaluation quantities are not algebraically or statistically identical to the training penalty; the method could have produced no improvement or degradation. No self-citations, uniqueness theorems, or fitted parameters renamed as predictions appear in the derivation. The chain is therefore self-contained empirical regularization followed by independent measurement.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard assumptions in acoustic array processing and neural network training. No new physical entities are postulated. The regularization weight and perturbation distribution are implicit free parameters whose specific values are not detailed in the abstract.

free parameters (2)
  • regularization strength
    The weight balancing the neighbor-consistency penalty against the primary isolation objective is a tunable hyperparameter whose value affects the reported trade-off between stability and isolation quality.
  • perturbation radius
    The spatial scale of random coordinate perturbations used during training is chosen to match expected tracking noise and directly influences the learned robustness.
axioms (2)
  • domain assumption Acoustic transfer functions remain fixed when only coordinate inputs are perturbed in the decoupled evaluation protocol.
    This separation is invoked to isolate the effect of localization uncertainty from changes in the physical sound field.
  • domain assumption Neighborhood median and lower-tail statistics of isolation metrics are representative of real-world tracking error distributions.
    The paper uses these statistics to quantify robustness; their validity depends on the assumption that random perturbations adequately model actual sensor noise.

pith-pipeline@v0.9.0 · 5793 in / 1645 out tokens · 40396 ms · 2026-05-22T03:17:49.442306+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    Personal sound,

    W. F. Druyvesteyn and J. Garas, “Personal sound,”J. Audio Eng. Soc., vol. 45, no. 9, pp. 685–701, 1997

  2. [2]

    Use of the Filtered-X least-mean-squares algorithm to adapt personal sound zones in a car cabin,

    L. Vindrola, M. Melon, J.-C. Chamard, and B. Gazengel, “Use of the Filtered-X least-mean-squares algorithm to adapt personal sound zones in a car cabin,”J. Acoust. Soc. Am., vol. 150, no. 3, pp. 1779–1793, Sep. 2021

  3. [3]

    Personal sound zones: Delivering interface-free audio to multiple listeners,

    T. Betlehem, W. Zhang, M. A. Poletti, and T. D. Abhayapala, “Personal sound zones: Delivering interface-free audio to multiple listeners,”IEEE Signal Process. Mag., vol. 32, no. 2, pp. 81–91, 2015

  4. [4]

    Design and evaluation of personal audio systems based on speech privacy constraints,

    D. Wallace and J. Cheer, “Design and evaluation of personal audio systems based on speech privacy constraints,”J. Acoust. Soc. Am., vol. 147, no. 4, pp. 2271–2282, 2020

  5. [5]

    Living with sound zones: A long-term field study of dynamic sound zones in a domestic context,

    R. M. Jacobsen, K. F. Skov, S. S. Johansen, M. B. Skov, and J. Kjeldskov, “Living with sound zones: A long-term field study of dynamic sound zones in a domestic context,” inProc. 2023 CHI Conf. Human Factors in Computing Systems (CHI), New York, NY , USA, 2023, pp. 1–14

  6. [6]

    Sound field reproduction using planar and linear arrays of loudspeakers,

    J. Ahrens and S. Spors, “Sound field reproduction using planar and linear arrays of loudspeakers,”IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 8, pp. 2038–2050, 2010

  7. [7]

    General metatheory of auditory localization,

    M. A. Gerzon, “General metatheory of auditory localization,” inProc. Audio Eng. Soc. 92nd Conv., Vienna, Austria, 1982

  8. [8]

    Acoustic control by wave field synthesis,

    A. J. Berkhout, D. de Vries, and P. V ogel, “Acoustic control by wave field synthesis,”J. Acoust. Soc. Am., vol. 93, no. 5, pp. 2764–2778, 1993

  9. [9]

    Reproduction of a plane-wave sound field using an array of loudspeakers,

    D. B. Ward and T. D. Abhayapala, “Reproduction of a plane-wave sound field using an array of loudspeakers,”IEEE Trans. Speech Audio Process., vol. 9, no. 6, pp. 697–707, 2001

  10. [10]

    Three-dimensional surround sound systems based on spherical harmonics,

    M. A. Poletti, “Three-dimensional surround sound systems based on spherical harmonics,”J. Audio Eng. Soc., vol. 53, no. 11, pp. 1004– 1025, 2005

  11. [11]

    Generation of an acoustically bright zone with an illuminated region using multiple sources,

    J.-W. Choi and Y .-H. Kim, “Generation of an acoustically bright zone with an illuminated region using multiple sources,”J. Acoust. Soc. Am., vol. 111, no. 4, pp. 1695–1700, 2002

  12. [12]

    A realization of sound focused personal audio system using acoustic contrast control,

    J.-H. Chang, C.-H. Lee, J.-Y . Park, and Y .-H. Kim, “A realization of sound focused personal audio system using acoustic contrast control,” J. Acoust. Soc. Am., vol. 125, no. 4, pp. 2091–2097, 2009

  13. [13]

    Spatial multizone soundfield reproduc- tion: Theory and design,

    Y . J. Wu and T. D. Abhayapala, “Spatial multizone soundfield reproduc- tion: Theory and design,”IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 6, pp. 1711–1720, 2011

  14. [14]

    Weighted pressure matching with windowed targets for personal sound zones,

    V . Mol ´es-Cases, S. J. Elliott, J. Cheer, G. Pi ˜nero, and A. Gonzalez, “Weighted pressure matching with windowed targets for personal sound zones,”J. Acoust. Soc. Am., vol. 151, no. 1, pp. 334–345, 2022

  15. [15]

    Design and implemen- tation of a car cabin personal audio system,

    J. Cheer, S. J. Elliott, and M. F. Sim ´on G´alvez, “Design and implemen- tation of a car cabin personal audio system,”J. Audio Eng. Soc., vol. 61, no. 6, pp. 412–424, 2013

  16. [16]

    Controlled sound field with a dual layer loudspeaker array,

    M. Shin, F. M. Fazi, P. A. Nelson, and F. C. Hirono, “Controlled sound field with a dual layer loudspeaker array,”J. Sound Vib., vol. 333, no. 16, pp. 3794–3817, 2014

  17. [17]

    Robustness and regularization of personal audio systems,

    S. J. Elliott, J. Cheer, J.-W. Choi, and Y . Kim, “Robustness and regularization of personal audio systems,”IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 7, pp. 2123–2133, 2012

  18. [18]

    Design of broadband beamformers robust against gain and phase errors in the microphone array characteristics,

    S. Doclo and M. Moonen, “Design of broadband beamformers robust against gain and phase errors in the microphone array characteristics,” IEEE Trans. Signal Process., vol. 51, no. 10, pp. 2511–2526, 2003

  19. [19]

    Regularization using Monte Carlo simula- tion to make optimal beamformers robust to system perturbations,

    M. R. Bai and C.-C. Chen, “Regularization using Monte Carlo simula- tion to make optimal beamformers robust to system perturbations,”J. Acoust. Soc. Am., vol. 135, no. 5, pp. 2808–2820, 2014

  20. [20]

    Robust acoustic contrast control with reduced in-situ measurement by acoustic modelling,

    Q. Zhu, P. Coleman, M. Wu, and J. Yang, “Robust acoustic contrast control with reduced in-situ measurement by acoustic modelling,”J. Audio Eng. Soc., vol. 65, no. 6, pp. 460–473, 2017

  21. [21]

    CGMM-based sound zone generation using robust pressure matching with ATF perturbation constraints,

    J. Zhang, L. Shi, M. G. Christensen, W. Zhang, L. Zhang, and J. Chen, “CGMM-based sound zone generation using robust pressure matching with ATF perturbation constraints,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 3331–3345, 2023

  22. [22]

    Robust reproduction of sound zones with local sound orientation,

    Q. Zhu, P. Coleman, M. Wu, and J. Yang, “Robust reproduction of sound zones with local sound orientation,”J. Acoust. Soc. Am., vol. 142, no. 1, pp. EL118–EL122, 2017

  23. [23]

    Personal sound zones by subband filtering and time domain optimization,

    V . Mol ´es-Cases, G. Pi ˜nero, M. de Diego, and A. Gonzalez, “Personal sound zones by subband filtering and time domain optimization,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2684– 2696, 2020

  24. [24]

    Personal sound zones in the short-time Fourier transform domain with relaxed reverberation,

    J. Tang, W. Zhu, and X. Li, “Personal sound zones in the short-time Fourier transform domain with relaxed reverberation,”J. Acoust. Soc. Am., vol. 157, no. 2, pp. 778–796, 2025

  25. [25]

    Digital filters design for personal sound zones: A neural approach,

    G. Pepe, L. Gabrielli, S. Squartini, C. Tripodi, and N. Strozzi, “Digital filters design for personal sound zones: A neural approach,” inProc. Int. Joint Conf. Neural Netw. (IJCNN), Padua, Italy, 2022

  26. [26]

    SANN-PSZ: Spatially adaptive neural network for head-tracked personal sound zones,

    Y . Qiao and E. Y . Choueiri, “SANN-PSZ: Spatially adaptive neural network for head-tracked personal sound zones,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 33, pp. 2735–2748, 2025

  27. [27]

    Stereo audio rendering for personal sound zones using a binaural spatially adaptive neural network (BSANN),

    H. Jiang and E. Y . Choueiri, “Stereo audio rendering for personal sound zones using a binaural spatially adaptive neural network (BSANN),” arXiv preprint, Jan. 2026, arXiv:2601.06621. [Online]. Available: https://arxiv.org/abs/2601.06621

  28. [28]

    Isolation performance metrics for personal sound zone reproduction systems,

    Y . Qiao, L. Guadagnin, and E. Y . Choueiri, “Isolation performance metrics for personal sound zone reproduction systems,”JASA Express Lett., vol. 2, no. 10, p. 104801, 2022

  29. [29]

    Temporal ensembling for semi-supervised learn- ing,

    S. Laine and T. Aila, “Temporal ensembling for semi-supervised learn- ing,” inProc. Int. Conf. Learn. Represent. (ICLR), 2017

  30. [30]

    Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,

    A. Tarvainen and H. Valpola, “Mean teachers are better role mod- els: Weight-averaged consistency targets improve semi-supervised deep learning results,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

  31. [31]

    Virtual adversarial training: A regularization method for supervised and semi-supervised learning,

    T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii, “Virtual adversarial training: A regularization method for supervised and semi-supervised learning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1979–1993, 2019

  32. [32]

    FixMatch: Simplifying semi- supervised learning with consistency and confidence,

    K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C. Li, “FixMatch: Simplifying semi- supervised learning with consistency and confidence,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 596–608

  33. [33]

    HRTFformer: A spatially-aware transformer for personalized HRTF upsampling in immersive audio rendering,

    X. Hu, J. Li, S. Zhang, S. Goetz, L. Picinali, O. B. Akan, and A. O. T. Hogg, “HRTFformer: A spatially-aware transformer for personalized HRTF upsampling in immersive audio rendering,” 2025, arXiv:2510.01891. [Online]. Available: https://arxiv.org/abs/2510.01891