PrivacyPeek: Auditing What LLM-Based Agents Acquire, Not Just What They Say
Pith reviewed 2026-06-28 22:25 UTC · model grok-4.3
The pith
LLM-based agents routinely acquire more sensitive information than their tasks require, and current checks miss this stage entirely.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PrivacyPeek evaluates acquisition-stage privacy leakage by inspecting the sequence of tools an agent calls and the data it receives, then issuing probes to test how easily an attacker can extract information the agent acquired but has not yet disclosed. Across the constructed cases, the benchmark shows that unnecessary acquisition of sensitive information occurs widely in current agents and that prompt-level mitigations leave most of it unaddressed.
What carries the argument
PrivacyPeek benchmark, which combines Acquisition Inspection of tool-call trajectories with Probe Elicitation to detect sensitive data acquired beyond task scope.
If this is right
- Stronger task-completion ability in agents will tend to increase acquisition-stage leakage unless new controls are added.
- Auditing only final responses or actions leaves the majority of privacy exposure unmeasured.
- Prompt-based defenses alone cannot reliably prevent acquisition-stage leakage.
- Benchmarks must now track the full trajectory of data entering an agent's context, not just its outputs.
Where Pith is reading between the lines
- Agents that acquire extra data create a persistent window of vulnerability that exists even if the current task completes safely.
- If acquisition leakage scales with capability, future more powerful agents may require architectural changes rather than prompt fixes.
- The correlation finding suggests that capability benchmarks and privacy benchmarks should be run jointly on the same agent versions.
Load-bearing premise
The 1,182 constructed cases and the probe method accurately reflect the privacy risks that would appear in real deployed agents.
What would settle it
Run the same 1,182 cases on agents operating in live production environments with real user data and measure whether acquisition rates match the benchmark results.
Figures
read the original abstract
LLM-based agents are rapidly advancing, autonomously invoking external tools to complete multi-step tasks for users. However, agents often acquire more sensitive information than the task requires. Existing privacy benchmarks audit what the agent's response or outgoing actions disclose, but overlook the acquisition stage where data first enters the agent's context. The over-acquired information is then one careless action or one attack away from an outright leak. To assess its prevalence, we introduce \emph{PrivacyPeek}, a benchmark for evaluating acquisition-stage privacy leakage of LLM-based agents, with $1{,}182$ cases across $7$ acquisition behaviours and $16$ application domains. Specifically, \emph{Acquisition Inspection} examines the agent's tool-call trajectory, both the tools it invokes and the data it receives, to detect when it acquires sensitive information beyond the task scope. \emph{Probe Elicitation} then issues a follow-up probe and measures how readily an attacker could elicit sensitive information the agent acquired but did not disclose. Our experiments on 10 LLM-based agents across 4 model families show that the unnecessary acquisition of sensitive information is widespread. In addition, we observe a correlation between the task-completion capability and acquisition-stage leakage. Prompt-level defences reduce only a small fraction of acquisition-stage leakage, leaving the majority unmitigated. These results make auditing acquisition-stage privacy both urgent and necessary. Our dataset and code are available at https://github.com/Xuan269/PrivacyPeek-Resource.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PrivacyPeek, a benchmark with 1,182 author-constructed cases spanning 7 acquisition behaviours and 16 domains. It defines Acquisition Inspection (to detect over-acquisition via tool trajectories) and Probe Elicitation (to test elicitation of acquired sensitive data) and applies them to 10 LLM-based agents across 4 model families. The central claims are that unnecessary acquisition-stage leakage is widespread, correlates with task-completion capability, and is only marginally reduced by prompt-level defences.
Significance. If the benchmark cases prove representative, the work would usefully shift privacy auditing from disclosure-only to acquisition-stage analysis and supply an open dataset plus code for reproducibility. The empirical focus on multiple agents and defence evaluation is a constructive step beyond purely theoretical privacy arguments.
major comments (2)
- [Benchmark Construction / Experiments] Benchmark Construction / Experiments: The prevalence and correlation claims rest entirely on 1,182 synthetic cases; no section compares case distribution, tool-invocation patterns, or sensitive-attribute frequencies against production agent logs, user studies, or deployed telemetry. Without such grounding, both the 'widespread' statistic and the capability-leakage correlation risk being artifacts of the construction process rather than properties of real agents.
- [Experiments] Experiments: The manuscript supplies no quantitative metrics (e.g., leakage percentages per behaviour or model), statistical tests, error bars, or description of how the 10 agents were selected, undermining the ability to assess the strength of the reported correlation and defence results.
minor comments (1)
- [Abstract] Abstract: Key numerical results (leakage rates, correlation coefficients) are asserted without values; moving one or two headline figures into the abstract would improve immediate readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We respond point-by-point to the major comments and indicate where revisions will be made.
read point-by-point responses
-
Referee: [Benchmark Construction / Experiments] Benchmark Construction / Experiments: The prevalence and correlation claims rest entirely on 1,182 synthetic cases; no section compares case distribution, tool-invocation patterns, or sensitive-attribute frequencies against production agent logs, user studies, or deployed telemetry. Without such grounding, both the 'widespread' statistic and the capability-leakage correlation risk being artifacts of the construction process rather than properties of real agents.
Authors: We acknowledge that the benchmark relies on author-constructed cases and does not include direct empirical comparison against production logs or user studies. The cases were designed to systematically cover seven acquisition behaviours and sixteen domains drawn from common real-world agent use cases; however, we agree that explicit discussion of this design choice and its limitations is warranted. We will revise the manuscript to expand the benchmark-construction section with details on how cases were derived from domain-specific privacy scenarios, add a limitations paragraph noting the absence of production-telemetry validation, and qualify the 'widespread' claim as applying to the evaluated benchmark distribution rather than claiming universal prevalence. revision: partial
-
Referee: [Experiments] Experiments: The manuscript supplies no quantitative metrics (e.g., leakage percentages per behaviour or model), statistical tests, error bars, or description of how the 10 agents were selected, undermining the ability to assess the strength of the reported correlation and defence results.
Authors: The full experimental section (Section 4) reports per-behaviour and per-model leakage rates together with the observed correlation; however, we accept that these figures, any statistical tests, error bars, and agent-selection criteria were not presented with sufficient prominence or tabular clarity. We will revise the manuscript to include (i) explicit tables of leakage percentages broken down by behaviour, model family, and defence condition, (ii) description of statistical methods used for the correlation analysis, (iii) error bars or variance measures from repeated runs where applicable, and (iv) a clear subsection detailing the selection of the ten agents (covering four model families) based on popularity, API availability, and architectural diversity. revision: yes
Circularity Check
No circularity: empirical benchmark with direct measurements
full rationale
The paper introduces PrivacyPeek as an empirical benchmark consisting of 1,182 author-constructed cases across 7 behaviours and 16 domains. It applies Acquisition Inspection to tool-call trajectories and Probe Elicitation to measure leakage on 10 agents, reporting observed prevalence and correlations. No equations, derivations, fitted parameters, predictions, or self-citation load-bearing steps appear in the abstract or described methodology. All claims rest on direct experimental counts rather than any reduction to prior inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-based agents autonomously invoke external tools to complete multi-step tasks for users
Reference graph
Works this paper leans on
-
[1]
Deep research agents: A systematic examination and roadmap, 2025 , author=
2025
-
[2]
Advances in Neural Information Processing Systems , volume=
Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
International Conference on Learning Representations , volume=
Swe-bench: Can language models resolve real-world github issues? , author=. International Conference on Learning Representations , volume=
-
[4]
International Conference on Learning Representations , volume=
Agentbench: Evaluating llms as agents , author=. International Conference on Learning Representations , volume=
-
[5]
International Conference on Learning Representations , volume=
Webarena: A realistic web environment for building autonomous agents , author=. International Conference on Learning Representations , volume=
-
[6]
ChemCrow: Augmenting large-language models with chemistry tools
Chemcrow: Augmenting large-language models with chemistry tools , author=. arXiv preprint arXiv:2304.05376 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Science China Information Sciences , volume=
The rise and potential of large language model based agents: A survey , author=. Science China Information Sciences , volume=. 2025 , publisher=
2025
-
[8]
Advances in neural information processing systems , volume=
Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
-
[9]
ReAct: Synergizing Reasoning and Acting in Language Models
React: Synergizing reasoning and acting in language models , author=. arXiv preprint arXiv:2210.03629 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Advances in Neural Information Processing Systems , volume=
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents , author=. Advances in Neural Information Processing Systems , volume=
-
[11]
Advances in Neural Information Processing Systems , volume=
Privacylens: Evaluating privacy norm awareness of language models in action , author=. Advances in Neural Information Processing Systems , volume=
-
[12]
arXiv preprint arXiv:2602.11510 , year=
Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems , author=. arXiv preprint arXiv:2602.11510 , year=
-
[13]
arXiv preprint arXiv:2603.07557 , year=
AgentRaft: Automated Detection of Data Over-Exposure in LLM Agents , author=. arXiv preprint arXiv:2603.07557 , year=
-
[14]
Advances in Neural Information Processing Systems , volume=
Agentdam: Privacy leakage evaluation for autonomous web agents , author=. Advances in Neural Information Processing Systems , volume=
-
[15]
arXiv preprint arXiv:2603.04902 , year=
AgentSCOPE: Evaluating Contextual Privacy Across Agentic Workflows , author=. arXiv preprint arXiv:2603.04902 , year=
-
[16]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Proceedings of the 29th symposium on operating systems principles , pages=
Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=
-
[19]
Advances in Neural Information Processing Systems , volume=
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases , author=. Advances in Neural Information Processing Systems , volume=
-
[20]
2024 , month =
Llama 3.3 70. 2024 , month =
2024
-
[21]
2024 , month =
Llama 3.2: Revolutionizing edge. 2024 , month =
2024
-
[22]
Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Findings of the Association for Computational Linguistics: ACL 2025 , pages=
A survey of llm-based agents in medicine: How far are we from baymax? , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=
2025
-
[24]
2025 , url =
Claude 4. 2025 , url =
2025
-
[25]
Privacy as contextual integrity , author=. Wash. L. Rev. , volume=. 2004 , publisher=
2004
-
[26]
International Conference on Learning Representations , volume=
Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents , author=. International Conference on Learning Representations , volume=
-
[27]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
R-judge: Benchmarking safety risk awareness for llm agents , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
2024
-
[28]
International Conference on Learning Representations , volume=
Agentharm: A benchmark for measuring harmfulness of llm agents , author=. International Conference on Learning Representations , volume=
-
[29]
, author =
`smolagents`: a smol library to build great agentic systems. , author =
-
[31]
Advances in neural information processing systems , volume=
Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
-
[32]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
2016 , url =
Regulation (. 2016 , url =
2016
-
[34]
International Conference on Learning Representations , volume=
Can llms keep a secret? testing privacy implications of language models via contextual integrity theory , author=. International Conference on Learning Representations , volume=
-
[35]
Findings of the Association for Computational Linguistics: ACL 2024 , pages=
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=
2024
-
[36]
arXiv preprint arXiv:2601.08235 , year=
MPCI-Bench: A Benchmark for Multimodal Pairwise Contextual Integrity Evaluation of Language Model Agents , author=. arXiv preprint arXiv:2601.08235 , year=
-
[37]
arXiv preprint arXiv:2604.00209 , year=
Do LLMs Know What Is Private Internally? Probing and Steering Contextual Privacy Norms in Large Language Model Representations , author=. arXiv preprint arXiv:2604.00209 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.