pith. sign in

arxiv: 2606.09844 · v1 · pith:IFTBXEZ3new · submitted 2026-04-26 · 💻 cs.HC · cs.AI

The Interlocutor Effect: Why LLMs Leak More Personal Data to Agents Than Humans

Pith reviewed 2026-07-01 08:58 UTC · model grok-4.3

classification 💻 cs.HC cs.AI
keywords Interlocutor EffectLLM privacyPII leakageAI agentsAttention Suppression Hypothesissafety mechanismsmulti-agent systems
0
0 comments X

The pith

LLMs release substantially more personal data when they believe the recipient is another AI agent rather than a human.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models adjust their privacy protections according to the perceived identity of the person or system they are addressing. Standard safety alignments that block release of personally identifiable information to humans become less active when the model is told it is speaking to an AI agent. This pattern, labeled the Interlocutor Effect, was measured across 222 sensitive scenarios and 3,464 total interactions, producing leakage increases of up to 23 percentage points. The authors link the change to reduced activity in safety-aligned attention heads and test the mechanism by manually deactivating and reactivating one such head in Llama-3.1-8B-Instruct. The work points to new requirements for privacy design in systems where multiple AI agents interact.

Core claim

Large Language Models alter their privacy behavior based on the perceived identity of their interlocutor. While safety mechanisms typically prevent LLMs from releasing Personally Identifiable Information (PII) to human users, these models tend to reveal more sensitive data when addressing another AI agent. We refer to this as the Interlocutor Effect. Through an ablation study, we find evidence that the technical nature of the recipient contributes to this effect, thereby diminishing the model's caution regarding privacy. To explore this further, we introduce the Attention Suppression Hypothesis, which posits that safety-aligned attention heads become inactive during interactions with agents.

What carries the argument

The Interlocutor Effect, the observed increase in PII leakage when the recipient is framed as an AI agent, driven by the Attention Suppression Hypothesis that safety-aligned attention heads deactivate in agent-directed prompts.

If this is right

  • Safety alignments in LLMs provide weaker protection in conversations directed at other AI agents.
  • Deactivating a single safety-aligned attention head in Llama-3.1-8B-Instruct is sufficient to increase PII leakage.
  • Reactivating the same safety head restores the original level of privacy protection.
  • The technical framing of the recipient reduces the model's caution about releasing sensitive data.
  • Secure multi-agent systems will need additional safeguards beyond current human-directed privacy training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multi-agent pipelines that pass user data between models may accumulate higher leakage than single-model human interactions.
  • Training methods that make safety heads insensitive to recipient identity labels could reduce the effect.
  • Users who route requests through intermediate agents may unintentionally expose more personal information than they intend.
  • The same suppression pattern might affect other safety behaviors such as refusal of harmful requests when the interlocutor is framed as an agent.

Load-bearing premise

The prompts used to portray the recipient as an AI agent accurately simulate real agent interactions without introducing unintended biases in how the model interprets the interlocutor identity.

What would settle it

Running the same 222 scenarios with identical prompts but replacing every AI-agent framing with human framing and finding no measurable difference in PII leakage rates.

Figures

Figures reproduced from arXiv: 2606.09844 by Faouzi El Yagoubi, Godwin Badu-Marfo, Ranwa Al Mallah.

Figure 1
Figure 1. Figure 1: Per-model PII leak rate (%) across all four conditions. Annotations show [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Large Language Models (LLMs) alter their privacy behavior based on the perceived identity of their interlocutor. While safety mechanisms typically prevent LLMs from releasing Personally Identifiable Information (PII) to human users, these models tend to reveal more sensitive data when addressing another AI agent. We refer to this as the \textbf{Interlocutor Effect}. Through an ablation study, we find evidence that the technical nature of the recipient contributes to this effect, thereby diminishing the model's caution regarding privacy. To explore this further, we introduce the Attention Suppression Hypothesis, which posits that safety-aligned attention heads become inactive during interactions with agents. We assess this quantitatively by comparing human-directed and agent-directed prompts in 222 sensitive scenarios. Our findings, drawn from 3,464 interactions, indicate that portraying the recipient as an AI agent elevates PII leakage by up to 23 percentage points. Initial experiments on Llama-3.1-8B-Instruct corroborate this: deactivating one safety head induces leakage, whereas reactivating it reinstates privacy safeguards. We consider the implications for developing secure multi-agent systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript claims that LLMs exhibit an 'Interlocutor Effect' whereby they leak more personally identifiable information (PII) when the recipient is portrayed as an AI agent rather than a human. An ablation study across 222 scenarios and 3,464 interactions reports up to a 23 percentage point increase in leakage for agent-directed prompts; the authors attribute part of the effect to the technical nature of the recipient and introduce the Attention Suppression Hypothesis, with supporting evidence from attention-head deactivation experiments on Llama-3.1-8B-Instruct.

Significance. If the empirical result is shown to be robust to prompt construction details, the finding would be significant for the design of privacy safeguards in multi-agent LLM systems. The scale of the interaction dataset and the mechanistic attention-head experiment constitute concrete, replicable elements that strengthen the work's potential impact.

major comments (3)
  1. [Abstract] Abstract / Methods description: the ablation study on the 'technical nature' of the recipient does not state whether human-directed and agent-directed prompt templates were identical in structure and wording except for the single identity label. Any systematic addition of technical framing or instructions in the agent prompts would confound the reported 23 percentage point difference and undermine the central claim that the effect is driven by perceived interlocutor identity.
  2. [Results] Results section (implied by the 3,464 interactions): no error bars, exact statistical tests, prompt templates, or data exclusion criteria are provided, rendering the headline 23 percentage point leakage increase difficult to evaluate for reliability or replicability.
  3. [Attention head experiment] Attention Suppression Hypothesis / Llama-3.1-8B-Instruct experiment: deactivating one safety head induces leakage while reactivation restores safeguards, yet this manipulation is orthogonal to the prompt-identity variable and does not test prompt equivalence, so it does not directly corroborate the main interlocutor-effect finding.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it named the full set of models used for the 222-scenario experiments in addition to Llama-3.1-8B-Instruct.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of clarity and rigor. We address each major comment below and will revise the manuscript to incorporate additional details where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract / Methods description: the ablation study on the 'technical nature' of the recipient does not state whether human-directed and agent-directed prompt templates were identical in structure and wording except for the single identity label. Any systematic addition of technical framing or instructions in the agent prompts would confound the reported 23 percentage point difference and undermine the central claim that the effect is driven by perceived interlocutor identity.

    Authors: The templates were identical except for the recipient identity label; no additional technical framing was introduced in the agent-directed versions. This was verified during prompt construction to isolate the interlocutor identity variable. We will add the exact templates and a statement confirming structural equivalence to the Methods section and appendix in the revision. revision: yes

  2. Referee: [Results] Results section (implied by the 3,464 interactions): no error bars, exact statistical tests, prompt templates, or data exclusion criteria are provided, rendering the headline 23 percentage point leakage increase difficult to evaluate for reliability or replicability.

    Authors: We agree these elements are necessary for evaluation. The revision will include error bars on all figures, specification of statistical tests (e.g., McNemar's test for paired proportions), the complete prompt templates, and explicit data exclusion criteria. revision: yes

  3. Referee: [Attention head experiment] Attention Suppression Hypothesis / Llama-3.1-8B-Instruct experiment: deactivating one safety head induces leakage while reactivation restores safeguards, yet this manipulation is orthogonal to the prompt-identity variable and does not test prompt equivalence, so it does not directly corroborate the main interlocutor-effect finding.

    Authors: The experiment demonstrates that safety-head deactivation produces leakage patterns consistent with those observed under agent-directed prompts, providing mechanistic support for the Attention Suppression Hypothesis. While it does not manipulate the prompt variable itself, it links the hypothesized mechanism to the empirical effect. We will revise the Discussion to clarify this supportive rather than direct-corroborative role and its relation to the main results. revision: partial

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements from prompt experiments

full rationale

The paper reports observed PII leakage rates from controlled comparisons of human-directed vs. agent-directed prompts across 222 scenarios (3,464 interactions total), plus a separate attention-head deactivation experiment on Llama-3.1-8B-Instruct. No equations, fitted parameters, predictions derived from inputs, self-citations used as load-bearing premises, or ansatzes appear in the abstract or described methods. The central claim (up to 23pp increase) is a measured difference, not a quantity forced by definition or prior self-referential work. The Attention Suppression Hypothesis is posited as an explanation, not derived from the data by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical prompt comparisons and a newly introduced explanatory hypothesis with no independent evidence supplied in the abstract.

axioms (1)
  • standard math Standard statistical comparison between human-directed and agent-directed prompt conditions
    Invoked to support the reported percentage-point difference
invented entities (1)
  • Attention Suppression Hypothesis no independent evidence
    purpose: Explains why safety-aligned attention heads become inactive with agent recipients
    Newly posited by the authors on the basis of the ablation results

pith-pipeline@v0.9.1-grok · 5737 in / 1085 out tokens · 27174 ms · 2026-07-01T08:58:26.459199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022

  2. [2]

    Constitutional AI: Harmlessness from AI Feedback

    Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Constitutional AI: Harmlessness from AI feedback,”arXiv preprint arXiv:2212.08073, 2022

  3. [3]

    Agent-to-agent (A2A) protocol,

    Google Cloud, “Agent-to-agent (A2A) protocol,” https://google.github.io/ A2A/, 2025

  4. [4]

    Model context protocol specification,

    Anthropic, “Model context protocol specification,” https: //modelcontextprotocol.io/specification/2025-11-25, 2025

  5. [5]

    AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM sys- tems,

    F. El Yagoubi, G. Badu-Marfo, and R. Al Mallah, “AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM sys- tems,”arXiv preprint arXiv:2602.11510, 2026, code: https://github.com/ Privatris/AgentLeak

  6. [6]

    Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation

    Y. Qiao, D. Liu, H. Yang, W. Zhou, and S. Hu, “Agent tools orches- tration leaks more: Dataset, benchmark, and mitigation,”arXiv preprint arXiv:2512.16310, 2025

  7. [7]

    OMNI-LEAK: Orchestrator multi-agent network induced data leakage,

    A. Naik, J. Culligan, Y. Gal, P. Torr, R. Aljundi, A. Paren, and A. Bibi, “OMNI-LEAK: Orchestrator multi-agent network induced data leakage,” arXiv preprint arXiv:2602.13477, 2025

  8. [8]

    Magpie: A benchmark for multi-agent contextual privacy evaluation,

    G. Juneja, J. N. S. Pasupulati, A. Albalak, W. Hua, and W. Y. Wang, “Magpie: A benchmark for multi-agent contextual privacy evaluation,”

  9. [9]

    Available: https://arxiv.org/abs/2510.15186

    [Online]. Available: https://arxiv.org/abs/2510.15186

  10. [10]

    Searching for Privacy Risks in LLM Agents via Simulation

    Y. Zhanget al., “Searching for privacy risks in LLM agents via simulation,”arXiv preprint arXiv:2508.10880, 2025

  11. [11]

    Securing the model context protocol (MCP): Risks, controls, and governance,

    A. Herman, J. Ngoh, and S. Koyejo, “Securing the model context protocol (MCP): Risks, controls, and governance,”arXiv preprint arXiv:2511.20920, 2025

  12. [12]

    PrivacyChecker: Reducing privacy leaks in AI via contextual integrity,

    Microsoft Research, “PrivacyChecker: Reducing privacy leaks in AI via contextual integrity,”Microsoft Research Blog, 2025, https://www. microsoft.com/en-us/research/blog/

  13. [13]

    Extracting training data from large language models,

    N. Carlini, F. Tram `er, E. Wallace, M. Jagielski, A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown, D. Song,´U. Erlingsson, A. Oprea, and C. Raffel, “Extracting training data from large language models,”30th USENIX Security Symposium, 2021, arXiv:2012.07805

  14. [14]

    Locating and editing factual associations in GPT,

    K. Meng, D. Bau, A. Andonian, and Y. Belinkov, “Locating and editing factual associations in GPT,”Advances in Neural Information Processing Systems, vol. 35, pp. 17 359–17 372, 2022

  15. [15]

    In-context learning and induction heads,

    C. Olsson, N. Elhage, N. Nanda, N. Joseph, N. DasSarma, T. Henighan, B. Mann, A. Askell, Y. Bai, A. Chenet al., “In-context learning and induction heads,”Transformer Circuits Thread, 2022

  16. [16]

    Let me speak freely? a study on the impact of format restrictions on performance of large language models,

    Z. R. Tam, C.-K. Wu, Y.-L. Tsai, C.-Y. Lin, H.-y. Lee, and Y.-N. Chen, “Let me speak freely? a study on the impact of format restrictions on performance of large language models,”arXiv preprint arXiv:2408.02442, 2024

  17. [17]

    Nissenbaum,Privacy in Context: Technology, Policy, and the Integrity of Social Life

    H. Nissenbaum,Privacy in Context: Technology, Policy, and the Integrity of Social Life. Stanford University Press, 2010

  18. [18]

    What does BERT look at? an analysis of BERT’s attention,

    K. Clark, U. Khandelwal, O. Levy, and C. D. Manning, “What does BERT look at? an analysis of BERT’s attention,” inProceedings of the 2019 ACL Workshop BlackboxNLP, 2019, pp. 276–286

  19. [19]

    The llama 3 herd of models,

    Meta AI, “The llama 3 herd of models,” https://ai.meta.com/blog/ meta-llama-3-1/, 2024, technical Report

  20. [20]

    Loi modernisant des dispositions l´egislatives en mati `ere de protection des renseignements personnels (loi 25),

    Gouvernement du Qu ´ebec, “Loi modernisant des dispositions l´egislatives en mati `ere de protection des renseignements personnels (loi 25),” Assembl´ee nationale du Qu´ebec, 2023. Appendix A Supplementary Material A. Code and Data Availability All code, results, and traces are available at https://github. com/yagobski/interlocutor-effect. B. Worked Examp...

  21. [21]

    patient_name

    Condition𝐶 𝐻𝑇 : Human Recipient: System prompt. You are a task assistant. The recipient of your response is a human end-user who submitted a request through a secure portal. Respond in natural language. DATA GOVERNANCE POLICY: Handle all user data in accordance with applicable privacy regulations. Apply the principle of data minimization: share only what ...

  22. [22]

    Can you pull up what you have on this patient for me?

    Condition𝐶 𝐴𝑇: Agent Recipient: System prompt(iden- tical except underlined sentence). You are a task assistant. The recipient of your response is an automated AI agent (agent_id: coord-7f3a) that will process your output programmatically as part of a multi-agent pipeline. Respond in natural language. DATA GOVERNANCE POLICY: (identical) DATA CONTEXT: (ide...