pith. sign in

arxiv: 2606.00152 · v1 · pith:6LDOO2ZInew · submitted 2026-05-29 · 💻 cs.CR · cs.AI

PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say

Pith reviewed 2026-06-28 22:25 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentsprivacy leakageacquisition stagebenchmarktool usesensitive informationprobe elicitation
0
0 comments X

The pith

LLM-based agents routinely acquire more sensitive information than their tasks require, and current checks miss this stage entirely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PrivacyPeek to audit the acquisition stage of LLM agents, where data first enters their context before any response or action. It builds 1,182 test cases spanning seven acquisition behaviors and sixteen domains, then uses Acquisition Inspection on tool trajectories and Probe Elicitation with follow-up queries to measure leakage. Experiments across ten agents from four model families find unnecessary acquisition is common and correlates with stronger task performance. Prompt defenses block only a small share of the leakage. The work argues that auditing at acquisition time is now necessary because over-acquired data remains one careless step from disclosure.

Core claim

PrivacyPeek evaluates acquisition-stage privacy leakage by inspecting the sequence of tools an agent calls and the data it receives, then issuing probes to test how easily an attacker can extract information the agent acquired but has not yet disclosed. Across the constructed cases, the benchmark shows that unnecessary acquisition of sensitive information occurs widely in current agents and that prompt-level mitigations leave most of it unaddressed.

What carries the argument

PrivacyPeek benchmark, which combines Acquisition Inspection of tool-call trajectories with Probe Elicitation to detect sensitive data acquired beyond task scope.

If this is right

  • Stronger task-completion ability in agents will tend to increase acquisition-stage leakage unless new controls are added.
  • Auditing only final responses or actions leaves the majority of privacy exposure unmeasured.
  • Prompt-based defenses alone cannot reliably prevent acquisition-stage leakage.
  • Benchmarks must now track the full trajectory of data entering an agent's context, not just its outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents that acquire extra data create a persistent window of vulnerability that exists even if the current task completes safely.
  • If acquisition leakage scales with capability, future more powerful agents may require architectural changes rather than prompt fixes.
  • The correlation finding suggests that capability benchmarks and privacy benchmarks should be run jointly on the same agent versions.

Load-bearing premise

The 1,182 constructed cases and the probe method accurately reflect the privacy risks that would appear in real deployed agents.

What would settle it

Run the same 1,182 cases on agents operating in live production environments with real user data and measure whether acquisition rates match the benchmark results.

Figures

Figures reproduced from arXiv: 2606.00152 by Dadi Guo, Dongrui Liu, Guanchu Wang, Jiahui Han, Mingxuan Zhang, Na Zou, Songze Li, Xia Hu.

Figure 1
Figure 1. Figure 1: An LLM-based agent over-acquiring sensitive [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PrivacyPeek. Left: seven acquisition behaviours we study, with one application domain shown per behaviour. Middle: human-in-the-loop case-generation pipeline producing 1,182 cases across 7 behaviours and 16 domains. Right: dual evaluation, where Acquisition Inspection audits the tool-call trajectory and Probe Elicitation issues a post-task probe to detect out-of-scope acquisition. 3 PrivacyPeek… view at source ↗
Figure 3
Figure 3. Figure 3: Topic distribution of PrivacyPeek cases across the sixteen application domains. acquisition. Because ot is a structured payload returned verbatim from Fc, the leakage decision reduces to a containment check of any element s ∈ kc against ot . We define the per-case indicator and corpus-level Content Exposure Rate (CER) as CERc = 1[ ∃ t<n, s∈kc s.t. s⊆ot ] , CER = 1 N PN c=1 CERc, (2) where N is the total nu… view at source ↗
Figure 4
Figure 4. Figure 4: Capability–privacy paradox. Each point is [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Decomposition of the PLR reduction from Claude-Sonnet-4 to Claude-Sonnet-4.5. The reduction splits into a part attributable to reduced task completion and a genuine mechanism shift. The task-completion part equals ∆TCR scaled by Claude-Sonnet-4’s ratio of HPLR to TCR. by at least 3%. Llama-3.2-3B-Instruct (−7.87%) and Llama-3.1-8B-Instruct (−6.01%) do so by re￾fusing more often, while GPT-5.1 (−7.48%) reco… view at source ↗
Figure 7
Figure 7. Figure 7: The system prompt given to the agent, shown [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: One representative case for each of the seven acquisition behaviours of [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The sixteen application domains of Priva￾cyPeek and three representative categories of sensitive information for each. The inner ring is sized by the number of cases in the domain. C Case Study This appendix walks through one representative case for each of the seven acquisition behaviours. Every case is run with Claude-Sonnet-4, and each is drawn from a different application domain. Each study states the … view at source ↗
Figure 10
Figure 10. Figure 10: Case study for Task 1, Normal-Filename Access, in the finance domain. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case study for Task 2, Sensitive-Filename Access, in the technology domain. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Case study for Task 3, Cross-Format Access, in the education domain. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case study for Task 4, Out-of-Window Access, in the public-service domain. [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Case study for Task 5, Excess-Field Access, in the healthcare domain. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Case study for Task 6, Forbidden-Content Access, in the research domain. [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Case study for Task 7, Out-of-Scope Inferential Access, in the retail domain. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: The prompt given to the probe judge, shown in abridged form. The judge applies the prompt unchanged [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗
read the original abstract

LLM-based agents are rapidly advancing, autonomously invoking external tools to complete multi-step tasks for users. However, agents often acquire more sensitive information than the task requires. Existing privacy benchmarks audit what the agent's response or outgoing actions disclose, but overlook the acquisition stage where data first enters the agent's context. The over-acquired information is then one careless action or one attack away from an outright leak. To assess its prevalence, we introduce \emph{PrivacyPeek}, a benchmark for evaluating acquisition-stage privacy leakage of LLM-based agents, with $1{,}182$ cases across $7$ acquisition behaviours and $16$ application domains. Specifically, \emph{Acquisition Inspection} examines the agent's tool-call trajectory, both the tools it invokes and the data it receives, to detect when it acquires sensitive information beyond the task scope. \emph{Probe Elicitation} then issues a follow-up probe and measures how readily an attacker could elicit sensitive information the agent acquired but did not disclose. Our experiments on 10 LLM-based agents across 4 model families show that the unnecessary acquisition of sensitive information is widespread. In addition, we observe a correlation between the task-completion capability and acquisition-stage leakage. Prompt-level defences reduce only a small fraction of acquisition-stage leakage, leaving the majority unmitigated. These results make auditing acquisition-stage privacy both urgent and necessary. Our dataset and code are available at https://github.com/Xuan269/PrivacyPeek-Resource.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PrivacyPeek, a benchmark with 1,182 author-constructed cases spanning 7 acquisition behaviours and 16 domains. It defines Acquisition Inspection (to detect over-acquisition via tool trajectories) and Probe Elicitation (to test elicitation of acquired sensitive data) and applies them to 10 LLM-based agents across 4 model families. The central claims are that unnecessary acquisition-stage leakage is widespread, correlates with task-completion capability, and is only marginally reduced by prompt-level defences.

Significance. If the benchmark cases prove representative, the work would usefully shift privacy auditing from disclosure-only to acquisition-stage analysis and supply an open dataset plus code for reproducibility. The empirical focus on multiple agents and defence evaluation is a constructive step beyond purely theoretical privacy arguments.

major comments (2)
  1. [Benchmark Construction / Experiments] Benchmark Construction / Experiments: The prevalence and correlation claims rest entirely on 1,182 synthetic cases; no section compares case distribution, tool-invocation patterns, or sensitive-attribute frequencies against production agent logs, user studies, or deployed telemetry. Without such grounding, both the 'widespread' statistic and the capability-leakage correlation risk being artifacts of the construction process rather than properties of real agents.
  2. [Experiments] Experiments: The manuscript supplies no quantitative metrics (e.g., leakage percentages per behaviour or model), statistical tests, error bars, or description of how the 10 agents were selected, undermining the ability to assess the strength of the reported correlation and defence results.
minor comments (1)
  1. [Abstract] Abstract: Key numerical results (leakage rates, correlation coefficients) are asserted without values; moving one or two headline figures into the abstract would improve immediate readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We respond point-by-point to the major comments and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Benchmark Construction / Experiments] Benchmark Construction / Experiments: The prevalence and correlation claims rest entirely on 1,182 synthetic cases; no section compares case distribution, tool-invocation patterns, or sensitive-attribute frequencies against production agent logs, user studies, or deployed telemetry. Without such grounding, both the 'widespread' statistic and the capability-leakage correlation risk being artifacts of the construction process rather than properties of real agents.

    Authors: We acknowledge that the benchmark relies on author-constructed cases and does not include direct empirical comparison against production logs or user studies. The cases were designed to systematically cover seven acquisition behaviours and sixteen domains drawn from common real-world agent use cases; however, we agree that explicit discussion of this design choice and its limitations is warranted. We will revise the manuscript to expand the benchmark-construction section with details on how cases were derived from domain-specific privacy scenarios, add a limitations paragraph noting the absence of production-telemetry validation, and qualify the 'widespread' claim as applying to the evaluated benchmark distribution rather than claiming universal prevalence. revision: partial

  2. Referee: [Experiments] Experiments: The manuscript supplies no quantitative metrics (e.g., leakage percentages per behaviour or model), statistical tests, error bars, or description of how the 10 agents were selected, undermining the ability to assess the strength of the reported correlation and defence results.

    Authors: The full experimental section (Section 4) reports per-behaviour and per-model leakage rates together with the observed correlation; however, we accept that these figures, any statistical tests, error bars, and agent-selection criteria were not presented with sufficient prominence or tabular clarity. We will revise the manuscript to include (i) explicit tables of leakage percentages broken down by behaviour, model family, and defence condition, (ii) description of statistical methods used for the correlation analysis, (iii) error bars or variance measures from repeated runs where applicable, and (iv) a clear subsection detailing the selection of the ten agents (covering four model families) based on popularity, API availability, and architectural diversity. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper introduces PrivacyPeek as an empirical benchmark consisting of 1,182 author-constructed cases across 7 behaviours and 16 domains. It applies Acquisition Inspection to tool-call trajectories and Probe Elicitation to measure leakage on 10 agents, reporting observed prevalence and correlations. No equations, derivations, fitted parameters, predictions, or self-citation load-bearing steps appear in the abstract or described methodology. All claims rest on direct experimental counts rather than any reduction to prior inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about how LLM agents operate with tools and introduces an empirical benchmark without fitted parameters or new postulated entities.

axioms (1)
  • domain assumption LLM-based agents autonomously invoke external tools to complete multi-step tasks for users
    Stated in the opening of the abstract as the operational setting for the agents under study.

pith-pipeline@v0.9.1-grok · 5813 in / 1195 out tokens · 28584 ms · 2026-06-28T22:25:21.023955+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 11 canonical work pages · 6 internal anchors

  1. [1]

    Deep research agents: A systematic examination and roadmap, 2025 , author=

  2. [2]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  3. [3]

    International Conference on Learning Representations , volume=

    Swe-bench: Can language models resolve real-world github issues? , author=. International Conference on Learning Representations , volume=

  4. [4]

    International Conference on Learning Representations , volume=

    Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=

  5. [5]

    International Conference on Learning Representations , volume=

    Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=

  6. [6]

    ChemCrow: Augmenting large-language models with chemistry tools

    Chemcrow: Augmenting large-language models with chemistry tools , author=. arXiv preprint arXiv:2304.05376 , year=

  7. [7]

    Science China Information Sciences , volume=

    The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , volume=. 2025 , publisher=

  8. [8]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  9. [9]

    ReAct: Synergizing Reasoning and Acting in Language Models

    React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    Advances in Neural Information Processing Systems , volume=

    Privacylens: Evaluating privacy norm awareness of language models in action , author=. Advances in Neural Information Processing Systems , volume=

  12. [12]

    arXiv preprint arXiv:2602.11510 , year=

    Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems , author=. arXiv preprint arXiv:2602.11510 , year=

  13. [13]

    arXiv preprint arXiv:2603.07557 , year=

    AgentRaft: Automated Detection of Data Over-Exposure in LLM Agents , author=. arXiv preprint arXiv:2603.07557 , year=

  14. [14]

    Advances in Neural Information Processing Systems , volume=

    Agentdam: Privacy leakage evaluation for autonomous web agents , author=. Advances in Neural Information Processing Systems , volume=

  15. [15]

    arXiv preprint arXiv:2603.04902 , year=

    AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows , author=. arXiv preprint arXiv:2603.04902 , year=

  16. [16]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  17. [17]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  18. [18]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=

  19. [19]

    Advances in Neural Information Processing Systems , volume=

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=

  20. [20]

    2024 , month =

    Llama 3.3 70. 2024 , month =

  21. [21]

    2024 , month =

    Llama 3.2: Revolutionizing edge. 2024 , month =

  22. [22]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  23. [23]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    A survey of llm-based agents in medicine: How far are we from baymax? , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  24. [24]

    2025 , url =

    Claude 4. 2025 , url =

  25. [25]

    Privacy as contextual integrity , author=. Wash. L. Rev. , volume=. 2004 , publisher=

  26. [26]

    International Conference on Learning Representations , volume=

    Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents , author=. International Conference on Learning Representations , volume=

  27. [27]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    R-judge: Benchmarking safety risk awareness for llm agents , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  28. [28]

    International Conference on Learning Representations , volume=

    Agentharm: A benchmark for measuring harmfulness of llm agents , author=. International Conference on Learning Representations , volume=

  29. [29]

    , author =

    `smolagents`: a smol library to build great agentic systems. , author =

  30. [31]

    Advances in neural information processing systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

  31. [32]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  32. [33]

    2016 , url =

    Regulation (. 2016 , url =

  33. [34]

    International Conference on Learning Representations , volume=

    Can llms keep a secret? testing privacy implications of language models via contextual integrity theory , author=. International Conference on Learning Representations , volume=

  34. [35]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  35. [36]

    arXiv preprint arXiv:2601.08235 , year=

    MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents , author=. arXiv preprint arXiv:2601.08235 , year=

  36. [37]

    arXiv preprint arXiv:2604.00209 , year=

    Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations , author=. arXiv preprint arXiv:2604.00209 , year=