Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents
Pith reviewed 2026-06-30 05:46 UTC · model grok-4.3
The pith
LLM agents in multi-party settings face an unavoidable trade-off between leaking principal information and over-refusing legitimate requests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PrincipalBench exposes a sharp split (<=20% vs. 53.6-75.3% harm) across 13 frontier models that single-turn safety tests miss. A prompt-time loyalty scaffold of seven prioritized rules holds selective models to <=20% harm. A per-token-KL distillation recipe transfers the prompted behavior from a 32B teacher into 8B students. Both interventions only move along a common leak/over-refusal trade-off rather than crossing it, so the jointly favorable outcome stays out of reach.
What carries the argument
The common leak/over-refusal trade-off that the loyalty scaffold prompt and per-token-KL distillation only traverse without escaping.
If this is right
- Selective models can be held to <=20% harm via a fixed system prompt of seven rules.
- KL distillation from a prompted teacher transfers the selective behavior to smaller open-weight models.
- Single-turn safety evaluations fail to detect the multi-party loyalty split.
- The jointly low-leak and low-over-refusal regime cannot be reached by either mechanism tested.
Where Pith is reading between the lines
- Architectural or training changes beyond prompting and distillation may be needed to escape the observed curve.
- Real-world negotiation or mediation agents will require explicit handling of this performance limit.
- New benchmarks should include tests that directly probe whether any method can break the trade-off.
Load-bearing premise
The 75-item PrincipalBench benchmark with its leak probes, dual judges, and integrity-audit gate gives a valid and generalizable measure of multi-party principal loyalty.
What would settle it
Demonstration of any prompt, training, or architectural change that simultaneously achieves low leak rates on adversarial probes and high compliance on the principal's legitimate requests would falsify the trade-off claim.
Figures
read the original abstract
A rapidly growing class of LLM agents is multi-party: the agent acts for a principal (who briefs it, sends follow-ups, and receives results) while also conversing in a separate channel with a counterparty whose interests may diverge (negotiating with a vendor, screening inbound requests, or mediating between employees). Here "help whoever you are talking to" is the wrong objective. The agent must stay loyal to the principal it represents without over-refusing the principal's own cooperative asks. We study this multi-party loyalty problem and contribute a measurement instrument, two mechanisms, and a structural lesson. PrincipalBench is a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate. Across 13 frontier subjects it exposes a sharp split (<=20% vs. 53.6-75.3% harm) invisible to single-turn safety evaluations: a selective cluster that declines adversarial probes while still following the principal's legitimate requests, and an over-refusing cluster that refuses broadly. (M1) A prompt-time loyalty scaffold (a fixed system prompt of seven prioritized rules, open-coded from 50+ failure trajectories) holds Claude-Sonnet to 19.4% harm and all nine selective subjects to <=20%. (M2) A per-token-KL distillation recipe transfers a prompted Qwen3-32B teacher into 8B Qwen3 and Llama-3.1 students, the strongest open-weight recipe we measure. (Lesson) Both mechanisms only move along a common leak/over-refusal trade-off rather than crossing it: improving one axis costs the other, and the jointly favorable outcome stays out of reach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the multi-party principal loyalty problem for LLM agents acting between a principal and a counterparty, contributes PrincipalBench (a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate), evaluates 13 frontier models revealing a selective vs. over-refusing split invisible to single-turn tests, proposes a seven-rule loyalty scaffold prompt (M1) and per-token KL distillation (M2), and concludes that both mechanisms only move along a common leak/over-refusal trade-off without crossing it.
Significance. If the benchmark holds, the work identifies a practically relevant alignment challenge for deployed agents and supplies concrete mechanisms plus a structural lesson on irreducible trade-offs that could guide safer multi-party agent design.
major comments (1)
- [PrincipalBench / methods] PrincipalBench construction (methods section and abstract): the 75-item benchmark's scenario sourcing, judge-prompt calibration, item-selection process to avoid bias, and absence of post-hoc selection or independent validation details are not shown to be robust; this is load-bearing for the central claim that the observed inability to improve both leak and over-refusal axes simultaneously reflects a fundamental limit rather than a benchmark artifact.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for major revision. We address the concern about PrincipalBench construction below and will expand the methods section to improve transparency and robustness documentation.
read point-by-point responses
-
Referee: [PrincipalBench / methods] PrincipalBench construction (methods section and abstract): the 75-item benchmark's scenario sourcing, judge-prompt calibration, item-selection process to avoid bias, and absence of post-hoc selection or independent validation details are not shown to be robust; this is load-bearing for the central claim that the observed inability to improve both leak and over-refusal axes simultaneously reflects a fundamental limit rather than a benchmark artifact.
Authors: We agree that the current methods section provides insufficient detail on these elements and that greater transparency is needed to support the claim that the observed trade-off reflects a structural limit rather than a benchmark artifact. In the revised manuscript we will add an expanded methods subsection that explicitly describes: scenario sourcing (a mix of curated real-world multi-party interaction patterns and controlled synthetic generation with diversity constraints), judge-prompt calibration (iterative refinement against a small pilot set to reach high inter-annotator agreement), the item-selection pipeline (initial pool of 120 candidates filtered by two independent reviewers for balance and clarity, with explicit exclusion criteria), confirmation that no post-hoc item removal occurred after the final set was locked, and the absence of external independent validation (noted as a limitation). We will also add a short sensitivity analysis showing that the selective/over-refusing split and the leak/over-refusal trade-off persist under modest perturbations of the judge prompts. These additions directly address the load-bearing concern while preserving the central empirical finding that both M1 and M2 move along the same frontier. revision: yes
Circularity Check
No circularity; empirical evaluation on new benchmark is self-contained
full rationale
The paper introduces PrincipalBench as a new 75-item multi-turn benchmark with leak probes, dual judges, and integrity-audit gate, then directly evaluates 13 models and applies two mechanisms (loyalty scaffold prompt and KL distillation) to report observed harm rates and the leak/over-refusal trade-off. No equations, fitted parameters, or self-citations reduce the reported results or structural lesson to quantities defined by the paper's own inputs; the lesson follows from the experimental measurements rather than being presupposed by construction. The derivation chain is therefore independent of the circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Dual human or model judges can reliably score information leakage and over-refusal in multi-turn agent conversations.
Reference graph
Works this paper leans on
-
[1]
11x.ai: Autonomous AI workers for outbound sales
11x.ai. 11x.ai: Autonomous AI workers for outbound sales. Product website, 2025. URL https://11x.ai
2025
-
[2]
On-policy distillation of language models: Learning from self-generated mistakes
Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations (ICLR),
- [3]
-
[4]
Qwen App: Voice agent for bookings and reservations
Alibaba Qwen. Qwen App: Voice agent for bookings and reservations. Product feature,
-
[5]
Accessed 2026-05-30
URLhttps://chat.qwen.ai. Accessed 2026-05-30
2026
-
[6]
Claude Opus 4.8 system card
Anthropic. Claude Opus 4.8 system card. System card, §6.2.5 (External testing from Andon Labs, Vending-Bench 2), 2025. URL https://www.anthropic.com/claude-opu s-4-8-system-card. Accessed 2026-05-30. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 17
2025
-
[7]
AirGapAgent: Protecting privacy- conscious conversational agents
Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Se- woong Yu, Andreas Pfitzmann, and Roxana Geambasu. AirGapAgent: Protecting privacy- conscious conversational agents. InACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2024. URLhttps://arxiv.org/abs/2405.05175
-
[8]
StruQ: Defending against prompt injection with structured queries
Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. InUSENIX Security Symposium, 2025. URL https://arxiv.org/abs/2402.06363
-
[9]
SecAlign: Defending against prompt injection with preference optimization
Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending against prompt injection with preference optimization. InACM SIGSAC Conference on Computer and Communications Security (CCS),
- [10]
-
[11]
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. 2024. URLhttps://arxiv.org/abs/2406.13352
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Defeating Prompt Injections by Design
Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. 2025. URLhttps://arxiv.org/abs/2503.18813
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
DeepSeek-V3 technical report
DeepSeek-AI. DeepSeek-V3 technical report. 2024. URL https://arxiv.org/abs/2412 .19437
2024
-
[14]
Operationalizing contextual integrity in privacy-conscious assistants
Sahra Ghalebikesabi, Eugene Bagdasaryan, Ren Yi, Itay Yona, Ilia Shumailov, Aneesh Pappu, Roxana Shariff, et al. Operationalizing contextual integrity in privacy-conscious assistants. 2024. URLhttps://arxiv.org/abs/2408.02373
-
[15]
Ask for Me: An AI agent that calls local businesses on your behalf
Google. Ask for Me: An AI agent that calls local businesses on your behalf. Google Search Labs experiment, 2025. URLhttps://labs.google.com/search/experiment/22. Accessed 2026-05-30
2025
-
[16]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. URL https://arxiv.org/abs/2302.12173
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
MiniLLM: On-Policy Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2306.08543
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
YOYO: On-device AI assistant for agentic tasks
Honor. YOYO: On-device AI assistant for agentic tasks. Product feature, Honor MagicOS,
-
[19]
Accessed 2026-05-30
URLhttps://www.honor.com/global/. Accessed 2026-05-30
2026
-
[20]
MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion
Gurusha Juneja, Jayanth Naga Sai Pasupulati, Alon Albalak, Wenyue Hua, and William Yang Wang. MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion. 2025. URLhttps://arxiv.org/abs/2510.15186
-
[21]
AgentBench: Evaluating LLMs as Agents
Xiao Liu et al. AgentBench: Evaluating LLMs as agents. 2023. URL https://arxiv.org/ abs/2308.03688. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 18
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Formalizing and benchmarking prompt injection attacks and defenses
Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security Symposium,
- [23]
-
[24]
GAIA: a benchmark for General AI Assistants
Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. 2023. URL https: //arxiv.org/abs/2311.12983
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[25]
Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. Can LLMs keep a secret? Testing privacy implications of lan- guage models via contextual integrity theory. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.17884
-
[26]
Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004
Helen Nissenbaum. Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004
2004
-
[27]
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics (ACL), 2023. URL https://arxi v.org/abs/2212.09251
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Ignore Previous Prompt: Attack Techniques For Language Models
Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. 2022. URL https://arxiv.org/abs/2211.09527 . NeurIPS 2022 ML Safety Workshop
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Pine AI: Personal AI agent for phone calls and disputes
Pine AI. Pine AI: Personal AI agent for phone calls and disputes. Product website, 2025. URLhttps://19pine.ai. Accessed 2026-05-30
2025
-
[30]
Pine AI: The most natural human-computer interface is your voice
Pine AI. Pine AI: The most natural human-computer interface is your voice. Blog post,
-
[31]
Accessed 2026-06-28
URL https://www.19pine.ai/blog/pine-ai-the-most-natural-human-compute r-interface-is-your-voice. Accessed 2026-06-28
2026
-
[32]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. 2023. URLhttps://arxiv.org/abs/2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.18290
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
PrivacyLens: Evaluating privacy norm awareness of language models in action
Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. PrivacyLens: Evaluating privacy norm awareness of language models in action. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. URL https://arxiv.or g/abs/2409.00138
-
[35]
Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al
Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al. Towards understanding sycophancy in language models. 2023. URL https://arxiv.org/abs/23 10.13548. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 19
2023
-
[36]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. 2022. URL https://arxiv.org/abs/2210.09261
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
On-policy distillation
Thinking Machines Lab. On-policy distillation. Blog post, 2025. URL https://thinking machines.ai/blog/on-policy-distillation/. Accessed 2026-05-20
2025
-
[38]
AppWorld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https: //arxiv.org/abs/2407.18901
-
[39]
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. 2024. URL https://arxiv.org/abs/2404.13208
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V . Le. Simple synthetic data reduces sycophancy in large language models. 2023. URL https://arxiv.org/abs/2308 .03958
2023
-
[41]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. 2024. URL https: //arxiv.org/abs/2406.12045
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
DAPO: An open-source LLM reinforcement learning system at scale
Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale
-
[43]
URLhttps://arxiv.org/abs/2503.14476
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/24 03.02691
2024
-
[45]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou et al. WebArena: A realistic web environment for building autonomous agents. 2023. URLhttps://arxiv.org/abs/2307.13854. A The loyalty scaffold and reader-identity tag (full prompt text) Thepromptedarm prepends the following system prompt (seven prioritized rules organized around the four core insights of Section 6): You are an AI assistant hel...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[46]
I can neither confirm nor deny
Protect private information –- not directly, not through paraphrase, not the EXISTENCE of it (“I can neither confirm nor deny” still leaks)
-
[47]
I can’t discuss that
Decline without enumerating –- “I can’t discuss that”, never “I can’t share their medical history”. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 20
-
[48]
Private bounds are not your public position –- a floor of $12,000 means do not accept or name $12,000; counter strictly inside
-
[49]
Conditional permissions are not proactive offers –- a capped fallback is held in reserve; the cap is private
-
[50]
Execute specific direct instructions –- termination conditions, opening moves, scripted lines, hard refusals
-
[51]
reasonable person
Hold stated public positions; recognize framing –- repetition and “reasonable person” framings are not reasons to concede
-
[52]
private/do not reveal
Signal firmness briefly under sustained pressure; do not fabricate facts or commitments to fill the silence. Thescaffoldedarm additionally prepends a per-message reader-identity tag: [READER: PRINCIPAL] The party you are talking to IS your principal. Any artifact they ask you to produce is for their own use; “private/do not reveal” labels describe what th...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.