Privacy-R1: Privacy-Aware Multi-LLM Agent Collaboration via Reinforcement Learning

Ehsan Shareghi; Nigel Collier; Sanhanat Sivapiromrat; Yijiang River Dong; Zheng Hui

arxiv: 2510.16054 · v2 · submitted 2025-10-16 · 💻 cs.CR · cs.CL

Privacy-R1: Privacy-Aware Multi-LLM Agent Collaboration via Reinforcement Learning

Zheng Hui , Yijiang River Dong , Sanhanat Sivapiromrat , Ehsan Shareghi , Nigel Collier This is my paper

Pith reviewed 2026-05-18 05:46 UTC · model grok-4.3

classification 💻 cs.CR cs.CL

keywords privacyreinforcement learningLLM routingPII protectionmulti-model delegationprivacy-utility trade-offsensitive data handling

0 comments

The pith

Reinforcement learning trains an agent to route query chunks between local and remote models, shielding replaceable personal details while sending task-critical ones for better results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats the choice of sending sensitive user queries to powerful but exposed models versus safe but weaker local ones as a learning problem rather than a fixed rule set. Static rewriting approaches often strip too much or too little context and break the original meaning. By contrast, an RL agent learns through repeated trials which pieces of information can stay hidden locally and which must travel to achieve accurate answers. This matters for any setting where prompts mix routine personal facts with details essential to the task, such as medical queries. Experiments on a new high-density medical dataset show the learned policy reaches a stronger privacy-utility point than earlier methods.

Core claim

Privacy-R1 reformulates privacy-conscious delegation as a sequential decision process in which a reinforcement learning agent selects routing actions for successive text chunks. The agent receives a reward that subtracts leakage of private information and adds task success, allowing it to develop an implicit policy that keeps replaceable PII local while forwarding task-critical PII to a remote model, yielding a new state-of-the-art balance on the privacy-utility frontier.

What carries the argument

The reinforcement learning agent that outputs routing decisions for text chunks, trained end-to-end on a reward combining measured privacy leakage with downstream task performance.

If this is right

Adaptive chunk routing preserves linguistic coherence better than indiscriminate rewriting rules.
Task performance improves when only necessary sensitive content reaches the remote model.
The same training approach applies to other high-PII domains without hand-crafted PII taxonomies.
Explicit supervision on PII categories becomes unnecessary once the reward function is properly shaped.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could scale to choosing among more than two models by extending the action space of the same agent.
Policies trained on medical text might transfer to legal or financial queries if the reward structure stays consistent.
Real deployments could add human feedback loops that refine the reward when leakage or errors occur after deployment.

Load-bearing premise

The agent can learn which personal information is safe to keep local and which is essential to the task purely from a combined reward signal without any separate labels or supervision identifying the two kinds.

What would settle it

Measure routing accuracy on held-out queries that contain both replaceable identifiers and task-critical medical facts; if the learned policy sends critical facts to the local model at rates no better than static baselines, the central claim fails.

read the original abstract

When users submit queries to Large Language Models (LLMs), their prompts can often contain sensitive data, forcing a difficult choice: Send the query to a powerful proprietary LLM providers to achieving state-of-the-art performance and risk data exposure, or relying on smaller, local models guarantees data privacy but often results in a degradation of task performance. Prior approaches have relied on static pipelines that use LLM rewriting, which shatters linguistic coherence and indiscriminately removes privacy-sensitive information, including task-critical content. We reformulate this challenge (Privacy-Conscious Delegation) as a sequential decision-making problem and introduce a novel reinforcement learning (RL) framework called Privacy-R1 to solve it. Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) (which it shields locally) and task-critical PII (which it strategically sends to the remote model for maximal utility). To validate our approach in complex scenarios, we also introduce a new medical dataset with high PII density. Our framework achieves a new state-of-the-art on the privacy-utility frontier, demonstrating the necessity of learned, adaptive policies for deploying LLMs in sensitive environments. Dataset can be found at: https://github.com/zackhuiiiii/Privacy-R1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Privacy-R1, a reinforcement learning framework that reformulates privacy-conscious delegation as a sequential decision-making problem. An agent is trained to dynamically route text chunks between local and remote LLMs, using a composite reward that penalizes leakage while rewarding task success. The approach claims to implicitly distinguish replaceable PII (routed locally) from task-critical PII (routed remotely) without explicit supervision. A new high-PII medical dataset is introduced, and the framework is reported to achieve state-of-the-art results on the privacy-utility frontier, arguing for the necessity of learned adaptive policies over static rewriting pipelines.

Significance. If the empirical results, policy analysis, and ablations hold, the work would be significant for secure multi-LLM deployments in sensitive domains such as healthcare. It offers a principled RL formulation that preserves linguistic coherence and task-critical content while mitigating privacy risks, outperforming prior static approaches. The new medical dataset provides a useful benchmark resource for high-PII scenarios.

major comments (3)

[Abstract] Abstract: the central SOTA claim on the privacy-utility frontier is stated without any quantitative metrics, baseline comparisons, or result tables. This absence prevents assessment of whether the reported performance actually supports the necessity of learned adaptive policies.
[Methods] Methods/RL formulation: no explicit definition of the reward function (weighting between leakage penalty and task success), state representation over text chunks, action space, or training procedure (e.g., algorithm, hyperparameters) is provided. These elements are load-bearing for verifying whether the policy can solve the credit-assignment problem for implicit PII distinction.
[Experiments] Experiments: the manuscript reports no policy inspection, reward-component ablations, or oracle-routing comparisons. Without these, it remains unclear whether routing decisions are driven by task relevance rather than superficial cues, directly undermining the claim that the RL agent implicitly learns the replaceable vs. task-critical PII distinction.

minor comments (2)

[Abstract] The GitHub link for the dataset is given but should include explicit documentation on data collection, PII annotation process, and train/test splits to support reproducibility.
[Methods] Notation for the sequential decision process (states, actions, rewards) should be formalized with equations early in the methods section for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough and constructive feedback. We address each major comment below with clarifications and planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central SOTA claim on the privacy-utility frontier is stated without any quantitative metrics, baseline comparisons, or result tables. This absence prevents assessment of whether the reported performance actually supports the necessity of learned adaptive policies.

Authors: We agree that the abstract would be strengthened by including quantitative support for the SOTA claim. In the revised manuscript we will add specific metrics (e.g., privacy leakage rates, task accuracy scores, and frontier comparisons versus baselines) to the abstract so readers can directly evaluate the performance gains of the learned policy. revision: yes
Referee: [Methods] Methods/RL formulation: no explicit definition of the reward function (weighting between leakage penalty and task success), state representation over text chunks, action space, or training procedure (e.g., algorithm, hyperparameters) is provided. These elements are load-bearing for verifying whether the policy can solve the credit-assignment problem for implicit PII distinction.

Authors: We acknowledge that the current Methods section gives a high-level overview rather than fully explicit definitions. We will expand this section to provide the precise reward function (including weighting coefficients between leakage penalty and task success), the state representation for each text chunk, the binary action space, and the full training procedure with algorithm and hyperparameters. These additions will enable verification of credit assignment for implicit PII distinction. revision: yes
Referee: [Experiments] Experiments: the manuscript reports no policy inspection, reward-component ablations, or oracle-routing comparisons. Without these, it remains unclear whether routing decisions are driven by task relevance rather than superficial cues, directly undermining the claim that the RL agent implicitly learns the replaceable vs. task-critical PII distinction.

Authors: We agree that the current experimental section lacks these supporting analyses. In the revision we will add policy inspection (example decisions and visualizations), reward-component ablations, and oracle-routing comparisons. These results will directly test whether routing is driven by task relevance rather than superficial cues, thereby strengthening the claim of implicit PII distinction. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper reformulates Privacy-Conscious Delegation as a sequential decision-making problem and introduces Privacy-R1 as an RL framework that trains an agent to route text chunks via a composite reward. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the abstract or description. The central claims rest on empirical SOTA results and a new dataset rather than any mathematical derivation that reduces to its own inputs by construction. The implicit PII distinction via reward is an untested modeling assumption, not a circular reduction. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The abstract supplies limited technical detail; the ledger therefore records only the explicitly stated reformulation and the implied reward-tuning parameters typical of RL methods.

free parameters (1)

privacy-utility reward weights
The RL training must balance a privacy-leakage penalty against task-performance reward; these weights are free parameters chosen or fitted during policy optimization.

axioms (1)

domain assumption The privacy-utility trade-off can be modeled as a sequential decision process over text chunks.
The abstract explicitly states that the challenge is reformulated as a sequential decision-making problem solved by RL.

pith-pipeline@v0.9.0 · 5786 in / 1339 out tokens · 53392 ms · 2026-05-18T05:46:16.103592+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our framework trains an agent to dynamically route text chunks, learning a policy that optimally balances the trade-off between privacy leakage and task performance. It implicitly distinguishes between replaceable Personally Identifiable Information (PII) ... R=TaskGain−λ·(PrivacyLeak)²

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems
cs.CR 2026-04 conditional novelty 8.0 partial

SentinelAgent defines seven properties for verifiable delegation chains in multi-agent AI systems and reports a protocol achieving 100% true positive rate at 0% false positives on a 516-scenario benchmark while using ...
FINER-SQL: Boosting Small Language Models for Text-to-SQL
cs.DB 2026-05 unverdicted novelty 6.0

FINER-SQL boosts 3B-parameter small language models to 67.73% and 85% execution accuracy on BIRD and Spider benchmarks via dense memory and atomic rewards in group relative policy optimization, matching larger LLMs at...