Protecting Bystander Privacy via Selective Hearing in Audio LLMs
Pith reviewed 2026-05-17 01:32 UTC · model grok-4.3
The pith
Audio LLMs can be trained via Bystander Privacy Fine-Tuning to refuse processing incidental bystander speech while preserving main-speaker comprehension.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Audio LLMs exhibit substantial bystander privacy leakage even when they perform well on audio understanding tasks overall, yet Bystander Privacy Fine-Tuning offers a training pipeline that teaches models to refuse bystander-related queries, producing an absolute 47 percent higher bystander accuracy under selective mode and an absolute 16 percent higher Selective Efficacy score than Gemini 2.5 Pro, the strongest audio LLM without this fine-tuning.
What carries the argument
Bystander Privacy Fine-Tuning (BPFT), a training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension.
If this is right
- Audio LLMs trained with BPFT achieve substantially higher accuracy at identifying and protecting information about bystander speech when operating in selective mode.
- Selective Efficacy scores rise by 16 percentage points over the best audio LLM without BPFT.
- Main-speaker comprehension and task performance remain intact after the fine-tuning process.
- SH-Bench supplies a standardized testbed for comparing selective hearing and privacy protection across different audio LLMs.
Where Pith is reading between the lines
- If adopted in consumer devices, BPFT-style training could lower the chance that always-on audio assistants inadvertently store or disclose nearby private conversations.
- Similar selective refusal mechanisms might extend to other unintended inputs such as background noise or private visual contexts in multimodal systems.
- Applying the approach across additional languages and acoustic settings could expose further limitations in current model behaviors.
Load-bearing premise
That the 3,968 mixtures and 77k questions in SH-Bench adequately represent the distribution of real-world bystander speech and query types that would arise in deployed audio LLMs.
What would settle it
Testing BPFT-tuned models on fresh, unscripted real-world recordings containing bystanders and measuring whether they still refuse to answer questions that would reveal bystander information.
Figures
read the original abstract
Audio Large language models (LLMs) are increasingly deployed in the real world, where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences did not consider. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model's ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures, including both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. In addition, we propose Selective Efficacy (SE), a novel metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LLMs reveals substantial bystander privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we also present Bystander Privacy Fine-Tuning (BPFT), a novel training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. We show that BPFT yields substantial gains, achieving an absolute 47% higher bystander accuracy under selective mode and an absolute 16% higher SE compared to Gemini 2.5 Pro, which is the best audio LLM without BPFT. Together, SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SH-Bench, a benchmark with 3,968 multi-speaker audio mixtures (real and synthetic) and 77k multiple-choice questions to measure selective hearing in audio LLMs: the ability to comprehend a main speaker while refusing to process or reveal bystander speech. It defines a new Selective Efficacy (SE) metric that combines comprehension and privacy protection, evaluates multiple open and proprietary models showing substantial bystander leakage, and proposes Bystander Privacy Fine-Tuning (BPFT) that yields a 47% absolute gain in bystander accuracy under selective mode and 16% higher SE relative to Gemini 2.5 Pro.
Significance. If the benchmark distributions prove representative, the work supplies the first systematic empirical framework for quantifying and mitigating bystander privacy risks in deployed audio LLMs. The multi-model evaluation, the joint SE metric, and the demonstration that fine-tuning can improve privacy without harming main-speaker performance are concrete contributions that could guide future system design.
major comments (2)
- [§3] §3 (SH-Bench construction): The benchmark is assembled from 3,968 real+synthetic mixtures and 77k questions, yet the manuscript provides no quantitative validation (e.g., distributional statistics or human-subject studies) that the acoustic conditions, accent coverage, overlap patterns, or query phrasings match the variability expected in actual deployments. Because the headline 47% and 16% gains rest on this representativeness assumption, the claim that BPFT delivers deployable privacy protection is not yet load-bearing.
- [§5] §5 (Evaluation and BPFT results): The reported deltas are presented as fixed outcomes of BPFT, but the text does not state whether question templates, data splits, or selective-mode prompts were locked before any model runs or were iterated after observing preliminary numbers. Without this protocol, the numerical superiority over Gemini 2.5 Pro cannot be confidently interpreted as a general property of the method rather than benchmark-specific tuning.
minor comments (2)
- [§2] The SE metric is introduced in the abstract and results but would benefit from an explicit equation (e.g., SE = f(main accuracy, bystander refusal rate)) placed in §2 or §4 so readers can reproduce the exact 16% comparison.
- [Results figures] Table captions and axis labels in the results figures should explicitly note the number of runs or confidence intervals; the current presentation makes it hard to judge whether the 47% gap is statistically stable.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the presentation of SH-Bench and the evaluation protocol. We address each major comment below and have revised the manuscript to improve transparency and rigor.
read point-by-point responses
-
Referee: [§3] §3 (SH-Bench construction): The benchmark is assembled from 3,968 real+synthetic mixtures and 77k questions, yet the manuscript provides no quantitative validation (e.g., distributional statistics or human-subject studies) that the acoustic conditions, accent coverage, overlap patterns, or query phrasings match the variability expected in actual deployments. Because the headline 47% and 16% gains rest on this representativeness assumption, the claim that BPFT delivers deployable privacy protection is not yet load-bearing.
Authors: We acknowledge the value of explicit validation for representativeness. The revised manuscript expands §3 with quantitative statistics on acoustic conditions (SNR and reverberation distributions), overlap patterns, speaker accent coverage drawn from the real recordings, and query complexity metrics. Real-world mixtures were drawn from diverse public sources and controlled recordings to approximate deployment variability. We have not added new human-subject studies, as these would require substantial additional resources beyond the current scope; instead, we added a limitations paragraph discussing the representativeness assumption and its implications for generalizing the reported gains. revision: partial
-
Referee: [§5] §5 (Evaluation and BPFT results): The reported deltas are presented as fixed outcomes of BPFT, but the text does not state whether question templates, data splits, or selective-mode prompts were locked before any model runs or were iterated after observing preliminary numbers. Without this protocol, the numerical superiority over Gemini 2.5 Pro cannot be confidently interpreted as a general property of the method rather than benchmark-specific tuning.
Authors: All question templates, data splits, and selective-mode prompts were designed and locked prior to any model evaluations, with no post-hoc iterations after observing preliminary results. The revised §5 now includes an explicit 'Experimental Protocol' subsection that documents this fixed setup, the rationale for the chosen splits, and confirmation that prompts were not tuned based on observed performance. This addition allows readers to interpret the 47% and 16% improvements as properties of BPFT rather than benchmark-specific adjustments. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation and fine-tuning results are self-contained
full rationale
The paper's core contributions are the construction of SH-Bench (3,968 mixtures, 77k questions) and the BPFT training pipeline, followed by direct empirical measurements of bystander accuracy and the proposed SE metric on that benchmark. These results are reported as observed performance deltas (e.g., +47% bystander accuracy, +16% SE vs. Gemini 2.5 Pro) rather than any derivation, equation, or prediction that reduces to fitted parameters or self-citations by construction. No load-bearing steps invoke uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results; the evaluation chain is independent of the paper's own inputs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Bystander Privacy Fine-Tuning (BPFT) ... absolute 47% higher bystander accuracy under selective mode and an absolute 16% higher SE compared to Gemini 2.5 Pro
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SH-Bench contains 3,968 multi-speaker audio mixtures ... 77k multiple-choice questions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A benchmark for multi-speaker anonymiza- tion.IEEE Transactions on Information Forensics and Security. Andreas Nautsch, Catherine Jasserand, Els Kindt, Mas- similiano Todisco, Isabel Trancoso, and Nicholas Evans. 2019. The gdpr & speech data: Reflections of legal and technology communities, first steps to- wards a common understanding.arXiv preprint arXiv...
-
[2]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
Mmau: A massive multi-task audio under- standing and reasoning benchmark.arXiv preprint arXiv:2410.19168. Eimaan Saqib, Shijing He, Junghyun Choy, Ruba Abu- Salma, Jose Such, Julia Bernd, and Mobin Javed. 2025a. Bystander privacy in smart homes: A system- atic review of concerns and solutions.ACM Transac- tions on Computer-Human Interaction. Eimaan Saqib,...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Audio monitoring in smart cities: an informa- tion privacy perspective.IADIS International Associ- ation for Development of the Information Society. Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. “Atten- tion Is All You Need” in Speech Separation: the SepFormer. InProc. ICASSP. Angela Sun. 2025. Gemini live: A more ...
work page internal anchor Pith review arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.