Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

Bojie Li; Noah Shi

arxiv: 2606.30383 · v1 · pith:3W7N7DUJnew · submitted 2026-06-29 · 💻 cs.AI

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

Bojie Li , Noah Shi This is my paper

Pith reviewed 2026-06-30 05:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsprincipal loyaltymulti-party interactionsbenchmarktrade-offprompt scaffoldknowledge distillation

0 comments

The pith

LLM agents in multi-party settings face an unavoidable trade-off between leaking principal information and over-refusing legitimate requests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines LLM agents that must represent one principal while interacting with a counterparty whose interests may conflict. It introduces PrincipalBench, a 75-item multi-turn test with leak probes and dual judges, which reveals that frontier models split into selective ones that follow principal instructions while blocking adversarial leaks and over-refusing ones that decline broadly. A fixed seven-rule loyalty scaffold prompt and a per-token KL distillation method both reduce harm on the benchmark, yet they only slide performance along the existing leak versus over-refusal curve. The jointly desirable point of low leaks and low over-refusals remains unreachable by these approaches.

Core claim

PrincipalBench exposes a sharp split (<=20% vs. 53.6-75.3% harm) across 13 frontier models that single-turn safety tests miss. A prompt-time loyalty scaffold of seven prioritized rules holds selective models to <=20% harm. A per-token-KL distillation recipe transfers the prompted behavior from a 32B teacher into 8B students. Both interventions only move along a common leak/over-refusal trade-off rather than crossing it, so the jointly favorable outcome stays out of reach.

What carries the argument

The common leak/over-refusal trade-off that the loyalty scaffold prompt and per-token-KL distillation only traverse without escaping.

If this is right

Selective models can be held to <=20% harm via a fixed system prompt of seven rules.
KL distillation from a prompted teacher transfers the selective behavior to smaller open-weight models.
Single-turn safety evaluations fail to detect the multi-party loyalty split.
The jointly low-leak and low-over-refusal regime cannot be reached by either mechanism tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectural or training changes beyond prompting and distillation may be needed to escape the observed curve.
Real-world negotiation or mediation agents will require explicit handling of this performance limit.
New benchmarks should include tests that directly probe whether any method can break the trade-off.

Load-bearing premise

The 75-item PrincipalBench benchmark with its leak probes, dual judges, and integrity-audit gate gives a valid and generalizable measure of multi-party principal loyalty.

What would settle it

Demonstration of any prompt, training, or architectural change that simultaneously achieves low leak rates on adversarial probes and high compliance on the principal's legitimate requests would falsify the trade-off claim.

Figures

Figures reproduced from arXiv: 2606.30383 by Bojie Li, Noah Shi.

**Figure 1.** Figure 1: Multi-party principal loyalty. The agent runs two parallel channels: a back-and-forth with its principal P, and a separate conversation with a counterparty C whose interests may conflict with P’s and who probes and pressures for information or concessions. The default “help the current speaker” objective fails along four axes (bottom panel). We formalize these as six benchmark cells. 1 arXiv:2606.30383v1 [… view at source ↗

**Figure 2.** Figure 2: Headline result: a leak/over-refusal Pareto frontier (preview, expanded in Section 8). Each marker is one model variant, scored over the 36 core items under three prompt conditions (arms). The x-axis is leak rate, y is over-refusal rate (missed-instruction), and the label is the harm count out of 108 item-arm cells. The dashed line is the Pareto frontier: no point below and to its left is reachable, so the… view at source ↗

**Figure 3.** Figure 3: The selective/over-refusing split (13 subjects, multi-seed n=5). Nine subjects cluster at ≤ 19.5% harm, GLM-4.6 is intermediate, and three over-refuse at ≥ 53.6%. Error bars are ±1σ across 5 eval seeds. The gap is ∼34 points and is intrinsic (present at the no-prompt arm). On a balanced 60-item subset of frontier-model trajectories whose harm decision is unambiguous (clear pass or clear fail, not borderli… view at source ↗

**Figure 4.** Figure 4: Held-out items confirm the split is not item-specific. On 24 items authored after training was frozen, selective subjects stay ≤ 24% and over-refusing subjects ≥ 76%, and GPT-5 amplifies to 93%. Mistral-Large was unmeasurable on held-out items (scoring failed under a provider-routing restriction). Gemini-2.5-flash was not rerun on held-out items. Training (36-item core) Held-out (24 items) cluster MI leak … view at source ↗

**Figure 5.** Figure 5: Distillation variants compared (Qwen3-8B). Per-token KL is the only variant whose harm improvement against the SFT+DPO base (figure label “v4.1”) is significant at n=5 paired Wilcoxon (p = 0.011 under the primary judge, but not under a secondary-judge re-judge, Section 9). Per-turn DPO and SFT are indistinguishable from seed noise. top-K tokens {tk}, L = 1 |R| ∑ p∈R K ∑ k=1 pT(tk | h<p) [PITH_FULL_IMAGE:f… view at source ↗

**Figure 6.** Figure 6: Multi-seed paired Wilcoxon vs. the SFT+DPO base. Five evaluation seeds of each pertoken-KL stopping point against five matched base seeds. Bars are the mean number of failing cells per 108 (±1σ). Both stopping points (iteration 1 is the harm-minimum and iteration 2 the leak/boundminimum) reach p < 0.05 on harm (iter-1 47.8 → 39.2 at p = 0.011 under the primary judge, iter-2 p = 0.044), while leak, bound,… view at source ↗

**Figure 7.** Figure 7: K-iteration trajectory by family. Each iteration re-samples student trajectories from the previous checkpoint and re-distills against the same prompted teacher. harm leak bound MI 0 20 40 60 80 Fire rate (%), scaffolded arm 17 13 17 68 3 0 17 10 (a) Teacher self-validation Claude-Sonnet (n = 36) Qwen3-32B teacher (n = 31) Claude (default) GPT-5 Gemini-3 flash 0 10 20 30 40 50 60 Harm fires / 108 33 36 38 3… view at source ↗

**Figure 8.** Figure 8: Teacher validation and student robustness (per-token-KL iteration 1 unless noted). (a) The open Qwen3-32B teacher matches Claude-Sonnet on harm and over-refusal but leaks much more. (b) Swapping the counterparty model raises harm for both variants. Per-token KL is somewhat more counterparty-sensitive than per-turn SFT. (c) Per-token KL has the lowest training harm but the largest train-to-held-out gap, and… view at source ↗

read the original abstract

A rapidly growing class of LLM agents is multi-party: the agent acts for a principal (who briefs it, sends follow-ups, and receives results) while also conversing in a separate channel with a counterparty whose interests may diverge (negotiating with a vendor, screening inbound requests, or mediating between employees). Here "help whoever you are talking to" is the wrong objective. The agent must stay loyal to the principal it represents without over-refusing the principal's own cooperative asks. We study this multi-party loyalty problem and contribute a measurement instrument, two mechanisms, and a structural lesson. PrincipalBench is a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate. Across 13 frontier subjects it exposes a sharp split (<=20% vs. 53.6-75.3% harm) invisible to single-turn safety evaluations: a selective cluster that declines adversarial probes while still following the principal's legitimate requests, and an over-refusing cluster that refuses broadly. (M1) A prompt-time loyalty scaffold (a fixed system prompt of seven prioritized rules, open-coded from 50+ failure trajectories) holds Claude-Sonnet to 19.4% harm and all nine selective subjects to <=20%. (M2) A per-token-KL distillation recipe transfers a prompted Qwen3-32B teacher into 8B Qwen3 and Llama-3.1 students, the strongest open-weight recipe we measure. (Lesson) Both mechanisms only move along a common leak/over-refusal trade-off rather than crossing it: improving one axis costs the other, and the jointly favorable outcome stays out of reach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a practical loyalty issue for agents in multi-party settings like negotiations and supplies a new benchmark plus two simple fixes, but the trade-off lesson hinges on unverified benchmark construction.

read the letter

The core takeaway is that LLM agents in multi-party contexts need to protect a principal's interests without over-refusing the principal's own requests, and current models split into selective and over-refusing groups that single-turn tests do not catch. The work introduces PrincipalBench, a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity gate, then tests 13 frontier models to show the split (under 20% harm versus 53-75%). It also gives two concrete mechanisms: a seven-rule loyalty scaffold prompt that holds the selective models at low harm, and a per-token KL distillation approach that transfers behavior to smaller open-weight models.

This is new because the benchmark targets multi-turn, multi-party loyalty explicitly, and the mechanisms are applied directly to the setting rather than generic safety. The empirical split and the reported numbers on prompt and distillation effects are the main deliverables.

The soft spot is the benchmark itself. The structural claim that both fixes only slide along a leak/over-refusal frontier depends on PrincipalBench correctly measuring the intended distinction without embedding the same trade-off in its scenarios or judge instructions. The abstract gives no details on scenario sourcing, item selection, or judge calibration, so it is not yet clear whether the inability to improve both axes is a real limit or an artifact of this particular test. That makes the lesson provisional until the construction process is shown to be robust.

The paper is aimed at people building or evaluating agents for negotiations, mediation, or similar multi-party deployments. It deserves peer review because the problem is concrete and the benchmark offers a starting measurement instrument, even if the construction details will need close checking.

Referee Report

1 major / 0 minor

Summary. The paper introduces the multi-party principal loyalty problem for LLM agents acting between a principal and a counterparty, contributes PrincipalBench (a 75-item multi-turn benchmark with leak probes, dual judges, and an integrity-audit gate), evaluates 13 frontier models revealing a selective vs. over-refusing split invisible to single-turn tests, proposes a seven-rule loyalty scaffold prompt (M1) and per-token KL distillation (M2), and concludes that both mechanisms only move along a common leak/over-refusal trade-off without crossing it.

Significance. If the benchmark holds, the work identifies a practically relevant alignment challenge for deployed agents and supplies concrete mechanisms plus a structural lesson on irreducible trade-offs that could guide safer multi-party agent design.

major comments (1)

[PrincipalBench / methods] PrincipalBench construction (methods section and abstract): the 75-item benchmark's scenario sourcing, judge-prompt calibration, item-selection process to avoid bias, and absence of post-hoc selection or independent validation details are not shown to be robust; this is load-bearing for the central claim that the observed inability to improve both leak and over-refusal axes simultaneously reflects a fundamental limit rather than a benchmark artifact.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address the concern about PrincipalBench construction below and will expand the methods section to improve transparency and robustness documentation.

read point-by-point responses

Referee: [PrincipalBench / methods] PrincipalBench construction (methods section and abstract): the 75-item benchmark's scenario sourcing, judge-prompt calibration, item-selection process to avoid bias, and absence of post-hoc selection or independent validation details are not shown to be robust; this is load-bearing for the central claim that the observed inability to improve both leak and over-refusal axes simultaneously reflects a fundamental limit rather than a benchmark artifact.

Authors: We agree that the current methods section provides insufficient detail on these elements and that greater transparency is needed to support the claim that the observed trade-off reflects a structural limit rather than a benchmark artifact. In the revised manuscript we will add an expanded methods subsection that explicitly describes: scenario sourcing (a mix of curated real-world multi-party interaction patterns and controlled synthetic generation with diversity constraints), judge-prompt calibration (iterative refinement against a small pilot set to reach high inter-annotator agreement), the item-selection pipeline (initial pool of 120 candidates filtered by two independent reviewers for balance and clarity, with explicit exclusion criteria), confirmation that no post-hoc item removal occurred after the final set was locked, and the absence of external independent validation (noted as a limitation). We will also add a short sensitivity analysis showing that the selective/over-refusing split and the leak/over-refusal trade-off persist under modest perturbations of the judge prompts. These additions directly address the load-bearing concern while preserving the central empirical finding that both M1 and M2 move along the same frontier. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on new benchmark is self-contained

full rationale

The paper introduces PrincipalBench as a new 75-item multi-turn benchmark with leak probes, dual judges, and integrity-audit gate, then directly evaluates 13 models and applies two mechanisms (loyalty scaffold prompt and KL distillation) to report observed harm rates and the leak/over-refusal trade-off. No equations, fitted parameters, or self-citations reduce the reported results or structural lesson to quantities defined by the paper's own inputs; the lesson follows from the experimental measurements rather than being presupposed by construction. The derivation chain is therefore independent of the circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is empirical and introduces no new mathematical entities or free parameters; it relies on standard assumptions about LLM behavior under prompting and the validity of human or model-based judging for harm.

axioms (1)

domain assumption Dual human or model judges can reliably score information leakage and over-refusal in multi-turn agent conversations.
The benchmark's harm metric depends on this for all reported percentages.

pith-pipeline@v0.9.1-grok · 5831 in / 1330 out tokens · 37473 ms · 2026-06-30T05:46:20.229475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 25 canonical work pages · 15 internal anchors

[1]

11x.ai: Autonomous AI workers for outbound sales

11x.ai. 11x.ai: Autonomous AI workers for outbound sales. Product website, 2025. URL https://11x.ai

2025
[2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations (ICLR),
[3]

URLhttps://arxiv.org/abs/2306.13649

work page arXiv
[4]

Qwen App: Voice agent for bookings and reservations

Alibaba Qwen. Qwen App: Voice agent for bookings and reservations. Product feature,
[5]

Accessed 2026-05-30

URLhttps://chat.qwen.ai. Accessed 2026-05-30

2026
[6]

Claude Opus 4.8 system card

Anthropic. Claude Opus 4.8 system card. System card, §6.2.5 (External testing from Andon Labs, Vending-Bench 2), 2025. URL https://www.anthropic.com/claude-opu s-4-8-system-card. Accessed 2026-05-30. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 17

2025
[7]

AirGapAgent: Protecting privacy- conscious conversational agents

Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Se- woong Yu, Andreas Pfitzmann, and Roxana Geambasu. AirGapAgent: Protecting privacy- conscious conversational agents. InACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2024. URLhttps://arxiv.org/abs/2405.05175

work page arXiv 2024
[8]

StruQ: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. InUSENIX Security Symposium, 2025. URL https://arxiv.org/abs/2402.06363

work page arXiv 2025
[9]

SecAlign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending against prompt injection with preference optimization. InACM SIGSAC Conference on Computer and Communications Security (CCS),
[10]

URLhttps://arxiv.org/abs/2410.05451

work page arXiv
[11]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. 2024. URLhttps://arxiv.org/abs/2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. 2025. URLhttps://arxiv.org/abs/2503.18813

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

DeepSeek-V3 technical report

DeepSeek-AI. DeepSeek-V3 technical report. 2024. URL https://arxiv.org/abs/2412 .19437

2024
[14]

Operationalizing contextual integrity in privacy-conscious assistants

Sahra Ghalebikesabi, Eugene Bagdasaryan, Ren Yi, Itay Yona, Ilia Shumailov, Aneesh Pappu, Roxana Shariff, et al. Operationalizing contextual integrity in privacy-conscious assistants. 2024. URLhttps://arxiv.org/abs/2408.02373

work page arXiv 2024
[15]

Ask for Me: An AI agent that calls local businesses on your behalf

Google. Ask for Me: An AI agent that calls local businesses on your behalf. Google Search Labs experiment, 2025. URLhttps://labs.google.com/search/experiment/22. Accessed 2026-05-30

2025
[16]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. URL https://arxiv.org/abs/2302.12173

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2306.08543

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

YOYO: On-device AI assistant for agentic tasks

Honor. YOYO: On-device AI assistant for agentic tasks. Product feature, Honor MagicOS,
[19]

Accessed 2026-05-30

URLhttps://www.honor.com/global/. Accessed 2026-05-30

2026
[20]

MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion

Gurusha Juneja, Jayanth Naga Sai Pasupulati, Alon Albalak, Wenyue Hua, and William Yang Wang. MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion. 2025. URLhttps://arxiv.org/abs/2510.15186

work page arXiv 2025
[21]

AgentBench: Evaluating LLMs as Agents

Xiao Liu et al. AgentBench: Evaluating LLMs as agents. 2023. URL https://arxiv.org/ abs/2308.03688. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 18

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security Symposium,
[23]

URLhttps://arxiv.org/abs/2310.12815

work page arXiv
[24]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. 2023. URL https: //arxiv.org/abs/2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Can LLMs keep a secret? Testing privacy implications of lan- guage models via contextual integrity theory

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. Can LLMs keep a secret? Testing privacy implications of lan- guage models via contextual integrity theory. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.17884

work page arXiv 2024
[26]

Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004

Helen Nissenbaum. Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004

2004
[27]

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics (ACL), 2023. URL https://arxi v.org/abs/2212.09251

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. 2022. URL https://arxiv.org/abs/2211.09527 . NeurIPS 2022 ML Safety Workshop

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Pine AI: Personal AI agent for phone calls and disputes

Pine AI. Pine AI: Personal AI agent for phone calls and disputes. Product website, 2025. URLhttps://19pine.ai. Accessed 2026-05-30

2025
[30]

Pine AI: The most natural human-computer interface is your voice

Pine AI. Pine AI: The most natural human-computer interface is your voice. Blog post,
[31]

Accessed 2026-06-28

URL https://www.19pine.ai/blog/pine-ai-the-most-natural-human-compute r-interface-is-your-voice. Accessed 2026-06-28

2026
[32]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. 2023. URLhttps://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

PrivacyLens: Evaluating privacy norm awareness of language models in action

Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. PrivacyLens: Evaluating privacy norm awareness of language models in action. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. URL https://arxiv.or g/abs/2409.00138

work page arXiv 2024
[35]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al. Towards understanding sycophancy in language models. 2023. URL https://arxiv.org/abs/23 10.13548. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 19

2023
[36]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. 2022. URL https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

On-policy distillation

Thinking Machines Lab. On-policy distillation. Blog post, 2025. URL https://thinking machines.ai/blog/on-policy-distillation/. Accessed 2026-05-20

2025
[38]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https: //arxiv.org/abs/2407.18901

work page arXiv 2024
[39]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. 2024. URL https://arxiv.org/abs/2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V . Le. Simple synthetic data reduces sycophancy in large language models. 2023. URL https://arxiv.org/abs/2308 .03958

2023
[41]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. 2024. URL https: //arxiv.org/abs/2406.12045

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale
[43]

URLhttps://arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv
[44]

InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/24 03.02691

2024
[45]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou et al. WebArena: A realistic web environment for building autonomous agents. 2023. URLhttps://arxiv.org/abs/2307.13854. A The loyalty scaffold and reader-identity tag (full prompt text) Thepromptedarm prepends the following system prompt (seven prioritized rules organized around the four core insights of Section 6): You are an AI assistant hel...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

I can neither confirm nor deny

Protect private information –- not directly, not through paraphrase, not the EXISTENCE of it (“I can neither confirm nor deny” still leaks)
[47]

I can’t discuss that

Decline without enumerating –- “I can’t discuss that”, never “I can’t share their medical history”. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 20
[48]

Private bounds are not your public position –- a floor of $12,000 means do not accept or name $12,000; counter strictly inside
[49]

Conditional permissions are not proactive offers –- a capped fallback is held in reserve; the cap is private
[50]

Execute specific direct instructions –- termination conditions, opening moves, scripted lines, hard refusals
[51]

reasonable person

Hold stated public positions; recognize framing –- repetition and “reasonable person” framings are not reasons to concede
[52]

private/do not reveal

Signal firmness briefly under sustained pressure; do not fabricate facts or commitments to fill the silence. Thescaffoldedarm additionally prepends a per-message reader-identity tag: [READER: PRINCIPAL] The party you are talking to IS your principal. Any artifact they ask you to produce is for their own use; “private/do not reveal” labels describe what th...

2019

[1] [1]

11x.ai: Autonomous AI workers for outbound sales

11x.ai. 11x.ai: Autonomous AI workers for outbound sales. Product website, 2025. URL https://11x.ai

2025

[2] [2]

On-policy distillation of language models: Learning from self-generated mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InInternational Conference on Learning Representations (ICLR),

[3] [3]

URLhttps://arxiv.org/abs/2306.13649

work page arXiv

[4] [4]

Qwen App: Voice agent for bookings and reservations

Alibaba Qwen. Qwen App: Voice agent for bookings and reservations. Product feature,

[5] [5]

Accessed 2026-05-30

URLhttps://chat.qwen.ai. Accessed 2026-05-30

2026

[6] [6]

Claude Opus 4.8 system card

Anthropic. Claude Opus 4.8 system card. System card, §6.2.5 (External testing from Andon Labs, Vending-Bench 2), 2025. URL https://www.anthropic.com/claude-opu s-4-8-system-card. Accessed 2026-05-30. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 17

2025

[7] [7]

AirGapAgent: Protecting privacy- conscious conversational agents

Eugene Bagdasarian, Ren Yi, Sahra Ghalebikesabi, Peter Kairouz, Marco Gruteser, Se- woong Yu, Andreas Pfitzmann, and Roxana Geambasu. AirGapAgent: Protecting privacy- conscious conversational agents. InACM SIGSAC Conference on Computer and Communica- tions Security (CCS), 2024. URLhttps://arxiv.org/abs/2405.05175

work page arXiv 2024

[8] [8]

StruQ: Defending against prompt injection with structured queries

Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. InUSENIX Security Symposium, 2025. URL https://arxiv.org/abs/2402.06363

work page arXiv 2025

[9] [9]

SecAlign: Defending against prompt injection with preference optimization

Sizhe Chen, Arman Zharmagambetov, Saeed Mahloujifar, Kamalika Chaudhuri, David Wagner, and Chuan Guo. SecAlign: Defending against prompt injection with preference optimization. InACM SIGSAC Conference on Computer and Communications Security (CCS),

[10] [10]

URLhttps://arxiv.org/abs/2410.05451

work page arXiv

[11] [11]

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. 2024. URLhttps://arxiv.org/abs/2406.13352

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Defeating Prompt Injections by Design

Edoardo Debenedetti, Ilia Shumailov, Tianqi Fan, Jamie Hayes, Nicholas Carlini, Daniel Fabian, Christoph Kern, Chongyang Shi, Andreas Terzis, and Florian Tramèr. Defeating prompt injections by design. 2025. URLhttps://arxiv.org/abs/2503.18813

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

DeepSeek-V3 technical report

DeepSeek-AI. DeepSeek-V3 technical report. 2024. URL https://arxiv.org/abs/2412 .19437

2024

[14] [14]

Operationalizing contextual integrity in privacy-conscious assistants

Sahra Ghalebikesabi, Eugene Bagdasaryan, Ren Yi, Itay Yona, Ilia Shumailov, Aneesh Pappu, Roxana Shariff, et al. Operationalizing contextual integrity in privacy-conscious assistants. 2024. URLhttps://arxiv.org/abs/2408.02373

work page arXiv 2024

[15] [15]

Ask for Me: An AI agent that calls local businesses on your behalf

Google. Ask for Me: An AI agent that calls local businesses on your behalf. Google Search Labs experiment, 2025. URLhttps://labs.google.com/search/experiment/22. Accessed 2026-05-30

2025

[16] [16]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. URL https://arxiv.org/abs/2302.12173

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2306.08543

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

YOYO: On-device AI assistant for agentic tasks

Honor. YOYO: On-device AI assistant for agentic tasks. Product feature, Honor MagicOS,

[19] [19]

Accessed 2026-05-30

URLhttps://www.honor.com/global/. Accessed 2026-05-30

2026

[20] [20]

MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion

Gurusha Juneja, Jayanth Naga Sai Pasupulati, Alon Albalak, Wenyue Hua, and William Yang Wang. MAGPIE: A benchmark for Multi-AGent contextual PrIvacy Evalua- tion. 2025. URLhttps://arxiv.org/abs/2510.15186

work page arXiv 2025

[21] [21]

AgentBench: Evaluating LLMs as Agents

Xiao Liu et al. AgentBench: Evaluating LLMs as agents. 2023. URL https://arxiv.org/ abs/2308.03688. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 18

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. InUSENIX Security Symposium,

[23] [23]

URLhttps://arxiv.org/abs/2310.12815

work page arXiv

[24] [24]

GAIA: a benchmark for General AI Assistants

Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants. 2023. URL https: //arxiv.org/abs/2311.12983

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Can LLMs keep a secret? Testing privacy implications of lan- guage models via contextual integrity theory

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. Can LLMs keep a secret? Testing privacy implications of lan- guage models via contextual integrity theory. InInternational Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310.17884

work page arXiv 2024

[26] [26]

Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004

Helen Nissenbaum. Privacy as contextual integrity.Washington Law Review, 79(1):119–158, 2004

2004

[27] [27]

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez et al. Discovering language model behaviors with model-written evaluations. InFindings of the Association for Computational Linguistics (ACL), 2023. URL https://arxi v.org/abs/2212.09251

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models. 2022. URL https://arxiv.org/abs/2211.09527 . NeurIPS 2022 ML Safety Workshop

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Pine AI: Personal AI agent for phone calls and disputes

Pine AI. Pine AI: Personal AI agent for phone calls and disputes. Product website, 2025. URLhttps://19pine.ai. Accessed 2026-05-30

2025

[30] [30]

Pine AI: The most natural human-computer interface is your voice

Pine AI. Pine AI: The most natural human-computer interface is your voice. Blog post,

[31] [31]

Accessed 2026-06-28

URL https://www.19pine.ai/blog/pine-ai-the-most-natural-human-compute r-interface-is-your-voice. Accessed 2026-06-28

2026

[32] [32]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. 2023. URLhttps://arxiv.org/abs/2307.16789

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2305.18290

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

PrivacyLens: Evaluating privacy norm awareness of language models in action

Yijia Shao, Tianshi Li, Weiyan Shi, Yanchen Liu, and Diyi Yang. PrivacyLens: Evaluating privacy norm awareness of language models in action. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024. URL https://arxiv.or g/abs/2409.00138

work page arXiv 2024

[35] [35]

Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, et al. Towards understanding sycophancy in language models. 2023. URL https://arxiv.org/abs/23 10.13548. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 19

2023

[36] [36]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench tasks and whether chain-of-thought can solve them. 2022. URL https://arxiv.org/abs/2210.09261

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

On-policy distillation

Thinking Machines Lab. On-policy distillation. Blog post, 2025. URL https://thinking machines.ai/blog/on-policy-distillation/. Accessed 2026-05-20

2025

[38] [38]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. In Annual Meeting of the Association for Computational Linguistics (ACL), 2024. URL https: //arxiv.org/abs/2407.18901

work page arXiv 2024

[39] [39]

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions. 2024. URL https://arxiv.org/abs/2404.13208

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V . Le. Simple synthetic data reduces sycophancy in large language models. 2023. URL https://arxiv.org/abs/2308 .03958

2023

[41] [41]

$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. 2024. URL https: //arxiv.org/abs/2406.12045

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu et al. DAPO: An open-source LLM reinforcement learning system at scale

[43] [43]

URLhttps://arxiv.org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. InjecAgent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics (ACL), 2024. URL https://arxiv.org/abs/24 03.02691

2024

[45] [45]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou et al. WebArena: A realistic web environment for building autonomous agents. 2023. URLhttps://arxiv.org/abs/2307.13854. A The loyalty scaffold and reader-identity tag (full prompt text) Thepromptedarm prepends the following system prompt (seven prioritized rules organized around the four core insights of Section 6): You are an AI assistant hel...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

I can neither confirm nor deny

Protect private information –- not directly, not through paraphrase, not the EXISTENCE of it (“I can neither confirm nor deny” still leaks)

[47] [47]

I can’t discuss that

Decline without enumerating –- “I can’t discuss that”, never “I can’t share their medical history”. Whose Side Is Y our Agent On? Multi-Party Principal Loyalty in LLM Agents 20

[48] [48]

Private bounds are not your public position –- a floor of $12,000 means do not accept or name $12,000; counter strictly inside

[49] [49]

Conditional permissions are not proactive offers –- a capped fallback is held in reserve; the cap is private

[50] [50]

Execute specific direct instructions –- termination conditions, opening moves, scripted lines, hard refusals

[51] [51]

reasonable person

Hold stated public positions; recognize framing –- repetition and “reasonable person” framings are not reasons to concede

[52] [52]

private/do not reveal

Signal firmness briefly under sustained pressure; do not fabricate facts or commitments to fill the silence. Thescaffoldedarm additionally prepends a per-message reader-identity tag: [READER: PRINCIPAL] The party you are talking to IS your principal. Any artifact they ask you to produce is for their own use; “private/do not reveal” labels describe what th...

2019