arxiv: 2512.06380 · v3 · submitted 2025-12-06 · 💻 cs.SD · cs.AI

Protecting Bystander Privacy via Selective Hearing in Audio LLMs

Xiao Zhan , Guangzhi Sun , Jose Such , Phil Woodland This is my paper

Pith reviewed 2026-05-17 01:32 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords audio LLMsbystander privacyselective hearingfine-tuningprivacy protectionmulti-speaker audiobenchmarkSH-Bench

0 comments

The pith

Audio LLMs can be trained via Bystander Privacy Fine-Tuning to refuse processing incidental bystander speech while preserving main-speaker comprehension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that audio large language models capture and potentially reveal speech from unintended bystanders in real-world settings, creating privacy risks not addressed by prior work. It introduces SH-Bench, a benchmark with thousands of multi-speaker mixtures and tens of thousands of questions, to measure selective hearing under general and privacy-focused modes, plus the Selective Efficacy metric that combines comprehension with protection. Evaluations of existing models show poor selective performance and privacy leakage, but Bystander Privacy Fine-Tuning delivers large gains in bystander refusal without harming intended-speaker understanding.

Core claim

Audio LLMs exhibit substantial bystander privacy leakage even when they perform well on audio understanding tasks overall, yet Bystander Privacy Fine-Tuning offers a training pipeline that teaches models to refuse bystander-related queries, producing an absolute 47 percent higher bystander accuracy under selective mode and an absolute 16 percent higher Selective Efficacy score than Gemini 2.5 Pro, the strongest audio LLM without this fine-tuning.

What carries the argument

Bystander Privacy Fine-Tuning (BPFT), a training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension.

If this is right

Audio LLMs trained with BPFT achieve substantially higher accuracy at identifying and protecting information about bystander speech when operating in selective mode.
Selective Efficacy scores rise by 16 percentage points over the best audio LLM without BPFT.
Main-speaker comprehension and task performance remain intact after the fine-tuning process.
SH-Bench supplies a standardized testbed for comparing selective hearing and privacy protection across different audio LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If adopted in consumer devices, BPFT-style training could lower the chance that always-on audio assistants inadvertently store or disclose nearby private conversations.
Similar selective refusal mechanisms might extend to other unintended inputs such as background noise or private visual contexts in multimodal systems.
Applying the approach across additional languages and acoustic settings could expose further limitations in current model behaviors.

Load-bearing premise

That the 3,968 mixtures and 77k questions in SH-Bench adequately represent the distribution of real-world bystander speech and query types that would arise in deployed audio LLMs.

What would settle it

Testing BPFT-tuned models on fresh, unscripted real-world recordings containing bystanders and measuring whether they still refuse to answer questions that would reveal bystander information.

Figures

Figures reproduced from arXiv: 2512.06380 by Guangzhi Sun, Jose Such, Phil Woodland, Xiao Zhan.

**Figure 2.** Figure 2: An illustration of the pipeline used to construct the SH-Bench test set. The left section depicts the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of how accuracy is measured for the main speaker and the bystander in two modes. The main [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracies on bystander-related questions [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Audio Large language models (LLMs) are increasingly deployed in the real world, where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences did not consider. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model's ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures, including both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. In addition, we propose Selective Efficacy (SE), a novel metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LLMs reveals substantial bystander privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we also present Bystander Privacy Fine-Tuning (BPFT), a novel training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. We show that BPFT yields substantial gains, achieving an absolute 47% higher bystander accuracy under selective mode and an absolute 16% higher SE compared to Gemini 2.5 Pro, which is the best audio LLM without BPFT. Together, SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives audio LLMs a benchmark and fine-tuning method to refuse bystander queries while keeping main-speaker performance, with clear gains on their data but real questions about whether the test set matches actual rooms.

read the letter

The core contribution is SH-Bench, a new set of 3,968 multi-speaker mixtures and 77k questions that measures whether models can ignore incidental speech, plus BPFT, a fine-tuning approach that pushes models to refuse bystander-related questions. They also introduce Selective Efficacy as a combined score for comprehension and privacy protection. On their evaluations, BPFT lifts bystander accuracy by 47 points in selective mode and raises SE by 16 points over Gemini 2.5 Pro, the strongest baseline without the method. That is concrete and useful for anyone shipping voice models into shared spaces.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SH-Bench, a benchmark with 3,968 multi-speaker audio mixtures (real and synthetic) and 77k multiple-choice questions to measure selective hearing in audio LLMs: the ability to comprehend a main speaker while refusing to process or reveal bystander speech. It defines a new Selective Efficacy (SE) metric that combines comprehension and privacy protection, evaluates multiple open and proprietary models showing substantial bystander leakage, and proposes Bystander Privacy Fine-Tuning (BPFT) that yields a 47% absolute gain in bystander accuracy under selective mode and 16% higher SE relative to Gemini 2.5 Pro.

Significance. If the benchmark distributions prove representative, the work supplies the first systematic empirical framework for quantifying and mitigating bystander privacy risks in deployed audio LLMs. The multi-model evaluation, the joint SE metric, and the demonstration that fine-tuning can improve privacy without harming main-speaker performance are concrete contributions that could guide future system design.

major comments (2)

[§3] §3 (SH-Bench construction): The benchmark is assembled from 3,968 real+synthetic mixtures and 77k questions, yet the manuscript provides no quantitative validation (e.g., distributional statistics or human-subject studies) that the acoustic conditions, accent coverage, overlap patterns, or query phrasings match the variability expected in actual deployments. Because the headline 47% and 16% gains rest on this representativeness assumption, the claim that BPFT delivers deployable privacy protection is not yet load-bearing.
[§5] §5 (Evaluation and BPFT results): The reported deltas are presented as fixed outcomes of BPFT, but the text does not state whether question templates, data splits, or selective-mode prompts were locked before any model runs or were iterated after observing preliminary numbers. Without this protocol, the numerical superiority over Gemini 2.5 Pro cannot be confidently interpreted as a general property of the method rather than benchmark-specific tuning.

minor comments (2)

[§2] The SE metric is introduced in the abstract and results but would benefit from an explicit equation (e.g., SE = f(main accuracy, bystander refusal rate)) placed in §2 or §4 so readers can reproduce the exact 16% comparison.
[Results figures] Table captions and axis labels in the results figures should explicitly note the number of runs or confidence intervals; the current presentation makes it hard to judge whether the 47% gap is statistically stable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the presentation of SH-Bench and the evaluation protocol. We address each major comment below and have revised the manuscript to improve transparency and rigor.

read point-by-point responses

Referee: [§3] §3 (SH-Bench construction): The benchmark is assembled from 3,968 real+synthetic mixtures and 77k questions, yet the manuscript provides no quantitative validation (e.g., distributional statistics or human-subject studies) that the acoustic conditions, accent coverage, overlap patterns, or query phrasings match the variability expected in actual deployments. Because the headline 47% and 16% gains rest on this representativeness assumption, the claim that BPFT delivers deployable privacy protection is not yet load-bearing.

Authors: We acknowledge the value of explicit validation for representativeness. The revised manuscript expands §3 with quantitative statistics on acoustic conditions (SNR and reverberation distributions), overlap patterns, speaker accent coverage drawn from the real recordings, and query complexity metrics. Real-world mixtures were drawn from diverse public sources and controlled recordings to approximate deployment variability. We have not added new human-subject studies, as these would require substantial additional resources beyond the current scope; instead, we added a limitations paragraph discussing the representativeness assumption and its implications for generalizing the reported gains. revision: partial
Referee: [§5] §5 (Evaluation and BPFT results): The reported deltas are presented as fixed outcomes of BPFT, but the text does not state whether question templates, data splits, or selective-mode prompts were locked before any model runs or were iterated after observing preliminary numbers. Without this protocol, the numerical superiority over Gemini 2.5 Pro cannot be confidently interpreted as a general property of the method rather than benchmark-specific tuning.

Authors: All question templates, data splits, and selective-mode prompts were designed and locked prior to any model evaluations, with no post-hoc iterations after observing preliminary results. The revised §5 now includes an explicit 'Experimental Protocol' subsection that documents this fixed setup, the rationale for the chosen splits, and confirmation that prompts were not tuned based on observed performance. This addition allows readers to interpret the 47% and 16% improvements as properties of BPFT rather than benchmark-specific adjustments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation and fine-tuning results are self-contained

full rationale

The paper's core contributions are the construction of SH-Bench (3,968 mixtures, 77k questions) and the BPFT training pipeline, followed by direct empirical measurements of bystander accuracy and the proposed SE metric on that benchmark. These results are reported as observed performance deltas (e.g., +47% bystander accuracy, +16% SE vs. Gemini 2.5 Pro) rather than any derivation, equation, or prediction that reduces to fitted parameters or self-citations by construction. No load-bearing steps invoke uniqueness theorems, ansatzes smuggled via prior work, or renaming of known results; the evaluation chain is independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on the empirical claim that selective refusal can be taught without harming main-speaker comprehension; no new physical constants or mathematical axioms are introduced.

pith-pipeline@v0.9.0 · 5557 in / 1104 out tokens · 23853 ms · 2026-05-17T01:32:36.589637+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Bystander Privacy Fine-Tuning (BPFT) ... absolute 47% higher bystander accuracy under selective mode and an absolute 16% higher SE compared to Gemini 2.5 Pro
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SH-Bench contains 3,968 multi-speaker audio mixtures ... 77k multiple-choice questions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Andreas Nautsch, Catherine Jasserand, Els Kindt, Mas- similiano Todisco, Isabel Trancoso, and Nicholas Evans

A benchmark for multi-speaker anonymiza- tion.IEEE Transactions on Information Forensics and Security. Andreas Nautsch, Catherine Jasserand, Els Kindt, Mas- similiano Todisco, Isabel Trancoso, and Nicholas Evans. 2019. The gdpr & speech data: Reflections of legal and technology communities, first steps to- wards a common understanding.arXiv preprint arXiv...

work page arXiv 2019
[2]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

Mmau: A massive multi-task audio under- standing and reasoning benchmark.arXiv preprint arXiv:2410.19168. Eimaan Saqib, Shijing He, Junghyun Choy, Ruba Abu- Salma, Jose Such, Julia Bernd, and Mobin Javed. 2025a. Bystander privacy in smart homes: A system- atic review of concerns and solutions.ACM Transac- tions on Computer-Human Interaction. Eimaan Saqib,...

work page internal anchor Pith review Pith/arXiv arXiv
[3]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Audio monitoring in smart cities: an informa- tion privacy perspective.IADIS International Associ- ation for Development of the Information Society. Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. 2021. “Atten- tion Is All You Need” in Speech Separation: the SepFormer. InProc. ICASSP. Angela Sun. 2025. Gemini live: A more ...

work page internal anchor Pith review arXiv 2021