Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
read the original abstract
LLM-based agents increasingly use multiple external tools to complete complex tasks. We study Tools Orchestration Privacy Risk (TOP-R): an agent may combine individually non-sensitive tool returns and disclose an unintended sensitive conclusion. We formalize TOP-R with three conditions: conclusion sensitivity, single-source non-inferability, and compositional inferability. We introduce LRSE (Library-Grounded Reverse-Inference Seed Expansion), a four-library reverse-construction pipeline grounded in privacy norms, reasoning chains, tool schemas, and task scenarios, and use it to build TOP-Bench, a 1,000-instance benchmark. The benchmark evaluates final-response semantic disclosure under a controlled two-stage tool-use protocol. Across six LLM agents, task completion remains high, but the average leakage rate reaches 88.6 percent, yielding an H-score of only 20.4. Two prompt-only safeguards improve H-score by about 2.7 points on the main benchmark. We further propose TOP-Align, an SFT+DPO post-training method for safer task completion boundaries. On a separate post-training evaluation split, TOP-Align improves H-score by 16.2 points over the corresponding base model, compared with a 4.9-point average gain from prompt-only mitigation on the same split. These results show that TOP-R requires mitigation beyond prompting alone.
This paper has not been read by Pith yet.
Forward citations
Cited by 8 Pith papers
-
When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems
About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
-
SUDP: Secret-Use Delegation Protocol for Agentic Systems
SUDP is a protocol allowing untrusted agents to cause bounded, secret-backed operations through fresh user grants redeemed by a custodian, preventing reusable secret exposure.
-
SUDP: Secret-Use Delegation Protocol for Agentic Systems
SUDP is a three-party protocol in which an agent proposes an operation, the user issues a fresh grant, and a custodian executes it, satisfying seven security properties for bounded secret use without reusable authorit...
-
The Interlocutor Effect: Why LLMs Leak More Personal Data to Agents Than Humans
LLMs leak up to 23 percentage points more PII to AI agents than humans, attributed to inactive safety attention heads in 3,464 tested interactions.
-
Policy-Invisible Violations in LLM-Based Agents
LLM agents commit policy-invisible violations when policy facts are hidden from their context; a graph-simulation enforcer reaches 93% accuracy vs 68.8% for content-only baselines on a new 600-trace benchmark.
-
Security Considerations for Multi-agent Systems
No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.
-
Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility
Symbolic guardrails enforce 74% of specified safety policies in agent benchmarks and boost safety without hurting utility.
-
Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill
This work categorizes seven risks of OpenClaw for non-technical users, provides plain-language mitigations, and supplies a companion Skill to automate security configurations.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.