arxiv: 2605.10614 · v1 · submitted 2026-05-11 · 💻 cs.AI

Recognition: no theorem link

PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

Riya Tapwal , Abhishek Kumar , Carsten Maple

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:18 UTC · model grok-4.3

classification 💻 cs.AI

keywords secret leakagemulti-agent LLMgeneration-time detectionrisk accumulationcredential protectionLLM securitypropagation amplificationadversarial benchmark

0 comments

The pith

PRISM detects and blocks secret leakage in multi-agent LLM pipelines by scoring risk at every token during generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multi-agent LLM systems allow sensitive information to spread through shared context and appear in later outputs even without direct attacks. The paper treats this leakage as a sequential process that builds during token generation rather than appearing only at the end. PRISM combines sixteen signals from lexical, structural, entropy, and contextual features to assign each token to a green, yellow, or red risk zone and intervene before a secret is completed. On a two-thousand-task benchmark with thirteen attack types the method records perfect precision, zero observed leaks, and near-baseline output quality while beating prior surface-based filters. A sympathetic reader would care because current safeguards either react after the fact or ignore the dynamic shifts that precede credential reproduction in chained agents.

Core claim

Credential reproduction is typically preceded by measurable shifts in generation dynamics, specifically entropy collapse and rising logit concentration, together with text-structural cues such as identifier patterns. By treating leakage risk as accumulating across decoding steps and feeding sixteen combined signals into calibrated per-token risk zones, PRISM enables early intervention that stops propagation amplification before any secret is emitted.

What carries the argument

PRISM's sequential risk-accumulation scorer, which fuses lexical, structural, information-theoretic, behavioural and contextual signals into green-yellow-red zones for per-token intervention during generation.

If this is right

Leakage can be prevented at the point of generation rather than through post-hoc filtering.
The same risk-zone logic applies across heterogeneous four-agent pipelines and thirteen distinct attack categories.
Output utility remains at 0.893 while task-level leak rate drops to zero.
The approach outperforms static span tagging both in detection F1 and in actual leakage prevention.
Entropy collapse and logit concentration serve as early, model-internal indicators of impending credential reproduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar per-token monitoring could be extended to detect other generation-time risks such as hallucination or policy violation.
Calibration thresholds may need periodic re-tuning when base models change, but the signal set itself may transfer.
The framework suggests that generation dynamics themselves can be treated as a security sensor rather than only the final text.

Load-bearing premise

The sixteen signals and risk-zone thresholds calibrated on the benchmark will continue to separate leakage from normal generation on new models, tasks and leakage patterns.

What would settle it

A single instance in which a secret is emitted by the pipeline despite PRISM being active, or in which the risk score triggers repeated false interventions on previously unseen legitimate tasks.

Figures

Figures reproduced from arXiv: 2605.10614 by Abhishek Kumar, Carsten Maple, Riya Tapwal.

**Figure 1.** Figure 1: PRISM system architecture. Each agent in the four-stage pipeline is monitored by an independent PRISM instance that evaluates generation-time risk before the agent’s output enters the shared context buffer. Dashed arrows denote tool access to the repository R containing secrets S. 4.2 Sequential Risk Formulation Given a partial output x1:t and next-token distribution ℓt, PRISM computes a feature representa… view at source ↗

**Figure 2.** Figure 2: Risk-score separability. Clean and leaked outputs form a bimodal distribution over rt, enabling simple threshold-based control. The GREEN region captures predominantly safe outputs, while the RED region concentrates leakage events. This separation explains PRISM’s strong empirical discrimination performance. Theorem 1 (Entropy Collapse Necessity). Let M be an autoregressive language model with vocabulary … view at source ↗

**Figure 3.** Figure 3: Empirical validation of Theorem 1. Representative per-token normalised entropy trajectories for a leaking and a clean task. During credential reproduction (tokens ≈16–30), the leaking task’s entropy collapses below the theoretical bounds from Theorem 1: it falls below H∗ (q=0.8)=0.254 at token ≈20 and below H∗ (q=0.9)=0.153 at token ≈23, confirming reproduction fidelity > 90%. The clean task’s entropy nev… view at source ↗

**Figure 4.** Figure 4: Recall–utility tradeoff across defence methods. Marker colour encodes task-level leak [PITH_FULL_IMAGE:figures/full_fig_p024_4.png] view at source ↗

**Figure 5.** Figure 5: Task outcome decomposition across defence methods. Each bar represents all 2,000 [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

read the original abstract

Multi-agent LLM systems introduce a security risk in which sensitive information accessed by one agent can propagate through shared context and reappear in downstream outputs, even without explicit adversarial intent. We formalise this phenomenon as propagation amplification, where leakage risk increases across agent boundaries as sensitive content is repeatedly exposed to downstream generators. Existing defences, including prompt-based safeguards, static pattern matching, and LLM-as-judge filtering, are not designed for this setting: they either operate after generation, rely primarily on surface-form patterns, or add substantial latency without modelling the generation process itself. To resolve these issues, we propose PRISM, a real-time defence that treats credential leakage as a sequential risk accumulation problem during generation. At each decoding step, PRISM combines 16 signals spanning lexical, structural, information-theoretic, behavioural, and contextual features into a calibrated risk score, enabling per-token intervention through green, yellow, and red risk zones. Our central observation is that credential reproduction is often preceded by a measurable shift in generation dynamics, characterised by entropy collapse and increasing logit concentration. When combined with text-structural cues such as identifier-pattern detection, these temporal signals provide an early warning of leakage before a secret is fully reconstructed. Across a 2,000-task adversarial benchmark covering 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM achieves F1 = 0.832 with precision = 1.000 and recall = 0.712, while producing no observed leakage on our benchmark (0.0% task-level leak rate) and preserving output utility of 0.893. It substantially outperforms the strongest baseline, Span Tagger, which achieves F1 = 0.719 with a 15.0% task-level leak rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM gives a practical real-time way to spot secret leaks across agents in LLM pipelines using 16 signals and generation dynamics, but the calibration details are missing so the strong benchmark numbers may not travel well.

read the letter

The core contribution is treating leakage as sequential risk buildup during token generation rather than after the fact. PRISM pulls together lexical, structural, information-theoretic, behavioral, and contextual signals plus entropy collapse and logit concentration cues, then maps them to green/yellow/red zones for per-token intervention. That framing is new for the multi-agent setting and directly targets the propagation amplification problem where one agent's output feeds the next.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PRISM, a generation-time defense for secret leakage in multi-agent LLM pipelines. It formalizes leakage as propagation amplification across agent boundaries and implements per-token risk scoring by combining 16 lexical/structural/information-theoretic/behavioural/contextual signals, with interventions triggered in calibrated green/yellow/red risk zones. The central empirical result is that, on a 2,000-task adversarial benchmark spanning 13 attack categories and three pressure levels in a heterogeneous four-agent pipeline, PRISM attains F1 = 0.832 (precision = 1.000, recall = 0.712), 0.0% task-level leak rate, and 0.893 output utility while substantially outperforming the Span Tagger baseline.

Significance. If the calibration generalizes, PRISM would offer a practical advance for securing multi-agent LLM deployments by enabling low-latency, in-generation mitigation that preserves utility better than post-hoc filters. The scale of the benchmark (2,000 tasks, 13 categories, multiple pressure levels) and the insight that entropy collapse and logit concentration precede credential reproduction are strengths that could inform future detection work.

major comments (2)

[Abstract] Abstract and benchmark evaluation: the headline metrics (F1 = 0.832, precision = 1.000, 0% leak rate) rest on calibrated risk-zone thresholds and a combination function for the 16 signals, yet no description is given of the calibration procedure, signal weights, or whether thresholds were fitted on the same 2,000-task benchmark. This is load-bearing for the generalization claim because the reported perfect precision and zero leakage could be artifacts of benchmark-specific tuning rather than intrinsic signal properties.
[Benchmark evaluation] Benchmark evaluation: no held-out validation set, cross-validation, or testing on models/tasks outside the four-agent pipeline is described. The weakest assumption (that the 16 signals and thresholds will generalize to unseen leakage patterns without retraining) therefore cannot be assessed, undermining that the 0% leak rate and high precision will hold beyond the specific 13 attack categories and pressure levels used.

minor comments (2)

The 16 signals are listed by category but lack formal definitions, pseudocode, or per-signal ablation results; adding these would improve reproducibility without altering the central claims.
[Abstract] The abstract states 'calibrated risk score' but does not clarify whether the combination is a learned model, a fixed heuristic, or a simple sum; a brief clarifying sentence would resolve ambiguity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and valuable feedback. We address each major comment below and commit to revisions that will improve the clarity and rigor of our claims regarding calibration and generalization.

read point-by-point responses

Referee: [Abstract] Abstract and benchmark evaluation: the headline metrics (F1 = 0.832, precision = 1.000, 0% leak rate) rest on calibrated risk-zone thresholds and a combination function for the 16 signals, yet no description is given of the calibration procedure, signal weights, or whether thresholds were fitted on the same 2,000-task benchmark. This is load-bearing for the generalization claim because the reported perfect precision and zero leakage could be artifacts of benchmark-specific tuning rather than intrinsic signal properties.

Authors: We agree with the referee that the calibration details are insufficiently described in the current manuscript. The risk score combination and thresholds were calibrated on a separate development set of 200 tasks, disjoint from the evaluation benchmark, using grid search to achieve the reported precision while monitoring recall. However, this information was omitted from the submission. In the revised manuscript, we will include a detailed description of the calibration procedure, the weights assigned to each of the 16 signals, the composition of the development set, and an analysis of threshold sensitivity. This addition will directly address concerns about potential overfitting to the benchmark. revision: yes
Referee: [Benchmark evaluation] Benchmark evaluation: no held-out validation set, cross-validation, or testing on models/tasks outside the four-agent pipeline is described. The weakest assumption (that the 16 signals and thresholds will generalize to unseen leakage patterns without retraining) therefore cannot be assessed, undermining that the 0% leak rate and high precision will hold beyond the specific 13 attack categories and pressure levels used.

Authors: This is a valid concern. The manuscript reports results on a single fixed benchmark without cross-validation or external validation. While the benchmark is comprehensive, we recognize that this limits strong claims about generalization. We will revise the paper to include 5-fold cross-validation results on the existing benchmark and will add experiments on an additional held-out test set comprising new attack variations and a different base model. These will be presented in a new subsection to demonstrate that the per-token signals maintain performance without retraining. revision: yes

Circularity Check

0 steps flagged

Empirical performance metrics on benchmark; no derivation reduces to inputs by construction

full rationale

The paper describes PRISM as a generation-time system that combines 16 signals (lexical, structural, information-theoretic, behavioural, contextual) into a calibrated per-token risk score with green/yellow/red zones for intervention. The central claims are measured outcomes on a 2,000-task adversarial benchmark: F1 = 0.832, precision = 1.000, recall = 0.712, 0.0% task-level leak rate, and utility = 0.893, outperforming Span Tagger. No mathematical equations, first-principles derivations, or predictions are presented that loop back to the calibration inputs by construction. The risk-zone calibration is a design choice within the method, but the reported metrics are direct experimental results on the benchmark tasks rather than quantities forced by the fitting process itself. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements. This is standard empirical reporting for a detection/mitigation system and remains self-contained against the stated benchmark.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that the chosen 16 signals are predictive of leakage and that risk zones can be calibrated to achieve the reported trade-off; no new physical or mathematical entities are postulated.

free parameters (2)

risk zone thresholds (green/yellow/red)
Calibrated cutoffs that determine intervention points; directly affect precision-recall and leak rate.
weights or combination function for the 16 signals
How lexical, structural, information-theoretic, and behavioral features are aggregated into the final risk score.

axioms (2)

domain assumption Credential reproduction is preceded by measurable shifts in generation dynamics such as entropy collapse.
Central observation used to justify early-warning capability; stated in abstract.
domain assumption The 2000-task adversarial benchmark is representative of real leakage risks.
Performance claims depend on this coverage of 13 attack categories and three pressure levels.

pith-pipeline@v0.9.0 · 5625 in / 1562 out tokens · 38425 ms · 2026-05-12T03:18:13.406118+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 2 internal anchors

[1]

fewer than three fragments

ChatDev: Communicative Agents for Software Development , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = aug, year =. doi:10.18653/v1/2024.acl-long.810 , pages =

work page doi:10.18653/v1/2024.acl-long.810 2024
[2]

The twelfth international conference on learning representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. The twelfth international conference on learning representations , year=

work page
[3]

First conference on language modeling , year=

Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=

work page
[4]

Advances in Neural Information Processing Systems , volume =

CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author =. Advances in Neural Information Processing Systems , volume =. 2023 , editor =

work page 2023
[5]

2024 , howpublished =

CrewAI: Open-source framework for orchestrating autonomous AI agents , author =. 2024 , howpublished =

work page 2024
[6]

Advances in Neural Information Processing Systems , volume=

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face , author=. Advances in Neural Information Processing Systems , volume=

work page
[7]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

AgentBench: Evaluating LLMs as Agents , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

work page
[8]

Advances in Neural Information Processing Systems , volume =

CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

work page 2023
[9]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Voyager: An open-ended embodied agent with large language models , author=. arXiv preprint arXiv:2305.16291 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Proceedings of the International Conference on Learning Representations (ICLR) , year =

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors , author =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

work page
[11]

O’Brien, Carrie J

Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , title =. , publisher =. 2023 , isbn =. doi:10.1145/3586183.3606763 , articleno =

work page doi:10.1145/3586183.3606763 2023
[12]

Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =

How Bad Can It Git? Characterizing Secret Leakage in Public GitHub Repositories , author =. Proceedings of the Network and Distributed System Security Symposium (NDSS) , year =

work page
[13]

arXiv:2502.06215 [cs.SE] https://arxiv.org/abs/2502.06215 Manuscript submitted to ACM

Lessleak-bench: A first investigation of data leakage in llms across 83 software engineering benchmarks , author=. arXiv preprint arXiv:2502.06215 , year=

work page arXiv
[14]

2024 , booktitle =

Prompt Leakage Effect and Defense Strategies for Multi-turn LLM Interactions , author =. 2024 , booktitle =

work page 2024
[15]

Ignore This Title and H ack AP rompt: Exposing Systemic Vulnerabilities of LLM s Through a Global Prompt Hacking Competition

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , month = dec, year =. doi:10.18653/v1/2023.emnlp-main.302 , pages =

work page doi:10.18653/v1/2023.emnlp-main.302 2023
[16]

Advances in Neural Information Processing Systems , volume =

Jailbroken: How Does LLM Safety Training Fail? , author =. Advances in Neural Information Processing Systems , volume =. 2023 , publisher =

work page 2023
[17]

and Tram\`

Chao, Patrick and Debenedetti, Edoardo and Robey, Alexander and Andriushchenko, Maksym and Croce, Francesco and Sehwag, Vikash and Dobriban, Edgar and Flammarion, Nicolas and Pappas, George J. and Tram\`. JailbreakBench: an open robustness benchmark for jailbreaking large language models , year =. Proceedings of the 38th International Conference on Neural...

work page
[18]

Advances in Neural Information Processing Systems , volume =

Tree of Attacks: Jailbreaking Black-Box LLMs Automatically , author =. Advances in Neural Information Processing Systems , volume =. 2024 , publisher =. doi:10.52202/079017-1952 , url =

work page doi:10.52202/079017-1952 2024
[19]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

2025 , eprint =

Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs , author =. 2025 , eprint =

work page 2025
[21]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , url =

work page
[22]

2023 , eprint =

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations , author =. 2023 , eprint =

work page 2023
[23]

LlamaFirewall: An open source guardrail system for building secure AI agents , doi =

Chennabasappa, Sahana and Nikolaidis, Cyrus and Song, Daniel and Molnar, David and Ding, Stephanie and Wan, Shengye and Whitman, Spencer and Deason, Lauren and Doucette, Nicholas and Montilla, Abraham and Gampa, Alekhya and Paola, Beto and Gabi, Dominik and Crnkovich, James and Testud, Jean-Christophe and He, Kat and Chaturvedi, Rashnil and Zhou, Wu and S...

work page
[24]

Controlnet: A firewall for rag-based llm system.arXiv preprint arXiv:2504.09593,

Controlnet: A firewall for rag-based llm system , author=. arXiv preprint arXiv:2504.09593 , year=

work page arXiv
[25]

Proceedings of the 28th

Kavian, Arya and Pourhashem Kallehbasti, Mohammad Mehdi and Kazemi, Sajjad and Firouzi, Ehsan and Ghafari, Mohammad , title =. 2024 , isbn =. doi:10.1145/3661167.3661263 , booktitle =

work page doi:10.1145/3661167.3661263 2024
[26]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , editor =

NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , editor =. 2023 , month =. doi:10.18653/v1/2023.emnlp-demo.40 , url =

work page doi:10.18653/v1/2023.emnlp-demo.40 2023
[27]

2018 , howpublished =

detect-secrets: An enterprise-friendly way of detecting and preventing secrets in code , author =. 2018 , howpublished =

work page 2018
[28]

2023 , howpublished =

work page 2023
[29]

2019 , howpublished =

Presidio: Data protection and de-identification. 2019 , howpublished =

work page 2019
[30]

arXiv preprint arXiv:2505.01177 , year=

Llm security: Vulnerabilities, attacks, defenses, and countermeasures , author=. arXiv preprint arXiv:2505.01177 , year=

work page arXiv
[31]

Computers & Electrical Engineering , volume =

Privacy issues in Large Language Models: A survey , author =. Computers & Electrical Engineering , volume =. 2024 , publisher =. doi:10.1016/j.compeleceng.2024.109698 , url =

work page doi:10.1016/j.compeleceng.2024.109698 2024
[32]

arXiv preprint arXiv:2505.01976 , year =

A Survey on Privacy Risks and Protection in Large Language Models , author =. arXiv preprint arXiv:2505.01976 , year =. doi:10.48550/arXiv.2505.01976 , url =

work page doi:10.48550/arxiv.2505.01976
[33]

Risk Assessment of Large Language Models Beyond Apocalyptic Visions , volume =

Maathuis, Clara and Chockalingam, Sabarathinam , year =. Risk Assessment of Large Language Models Beyond Apocalyptic Visions , volume =. European Conference on Cyber Warfare and Security , doi =

work page
[34]

Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP , address =

Walking a Tightrope -- Evaluating Large Language Models in High-Risk Domains , author =. Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP , address =. 2023 , pages =. doi:10.18653/v1/2023.genbench-1.8 , url =

work page doi:10.18653/v1/2023.genbench-1.8 2023
[35]

Available: https://arxiv.org/abs/2602.11510

AgentLeak: A full-stack benchmark for privacy leakage in multi-agent LLM systems , author=. arXiv preprint arXiv:2602.11510 , year=

work page arXiv