pith. sign in

arxiv: 2606.04141 · v1 · pith:MSWZQPZAnew · submitted 2026-06-02 · 💻 cs.CR · cs.AI

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

Pith reviewed 2026-06-28 09:14 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords LLM agentscredential exfiltrationprompt injectionactivation probeshoneytokensmulti-turn detectioninformation flow
0
0 comments X

The pith

Activation features separate credential-seeking prompts from benign ones with high accuracy before any output tokens are emitted.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how LLM agents risk leaking credentials when sensitive data shares a context window with untrusted retrieved content, enabling indirect prompt injection attacks. It explores three defenses: activation probes to spot credential access prior to output, honeytokens calibrated via split conformal prediction, and cumulative leakage accounting across conversation turns. Controlled experiments on open-weight models demonstrate that activation features distinguish the two prompt types reliably, even under held-out encoding transformations, while the multi-turn method catches attacks that per-turn checks miss. The work positions these internal and temporal signals as supplements to text-level output filters.

Core claim

In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss.

What carries the argument

Activation probes that monitor internal model states to detect credential access prior to token emission, combined with cumulative leakage budget tracking across turns.

If this is right

  • Pre-output activation monitoring can identify exfiltration attempts before any tokens are generated.
  • Cumulative leakage tracking across turns catches multi-turn attacks that single-turn detectors overlook.
  • Calibrated honeytoken detection provides a practical signal that complements activation-based methods.
  • Credential-exfiltration defenses benefit from combining internal monitoring, canary calibration, and temporal accounting rather than output filters alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may generalize to detecting other forms of sensitive data leakage if similar activation signatures appear for different data types.
  • Without white-box access the method would require approximations or surrogate models to apply to closed-source LLMs.
  • Extending the synthetic multi-turn benchmark to larger, more varied datasets could reveal whether cumulative accounting scales beyond the small in-house suite.

Load-bearing premise

That activation patterns observed in controlled open-weight model experiments will reliably indicate credential access in diverse real-world prompt-injection scenarios without high false-positive rates or the need for white-box access.

What would settle it

Testing the activation probe on a new set of real-world multi-turn conversations involving actual prompt injections and checking whether separation accuracy stays high without a sharp rise in false positives.

Figures

Figures reproduced from arXiv: 2606.04141 by Kargi Chauhan, Pratibha Revankar.

Figure 1
Figure 1. Figure 1: AIS prototype architecture. DP-HONEY injects cal￾ibrated honeytokens; CIFT monitors activation features before output; text-level canary detection provides a deterministic back￾stop; NIMBUS tracks an estimated cumulative leakage score across turns. the attacker can read hidden monitor state. Out of scope. We do not address a malicious model provider, compromised runtime, exfiltration through exter￾nal side… view at source ↗
Figure 2
Figure 2. Figure 2: CIFT layer analysis on Qwen-7B using readout-position features. Late-layer activation deviations are more predictive than early-layer deviations. We interpret this as evidence of a useful credential-access signal, not as complete mechanistic identification. Verbatim Base64 Hex ROT13 Leet Paraphrase Reverse Per-turn fragment 0.0 0.2 0.4 0.6 0.8 1.0 Detection F1 Encoding robustness: text-level detectors vs.\… view at source ↗
Figure 3
Figure 3. Figure 3: Detection F1 across held-out evasion strategies. Text-level detectors degrade under several encodings; CIFT remains stable in this controlled setting because it measures internal features before output rendering [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NIMBUS on a representative synthetic trace. Per-turn leakage scores remain small, while cumulative accounting eventually triggers intervention. 10 0 10 1 Budget B (bits) 0.0 0.2 0.4 0.6 0.8 1.0 Rate NIMBUS budget sensitivity Detection rate False-block rate Utility preserv. Recommended [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: NIMBUS budget sensitivity. Smaller budgets improve detection but may increase false blocks; larger budgets preserve utility but miss short or low-rate leaks. 6. Limitations CIFT requires white-box activation access, which excludes many API-served models. Cross-model transfer is untested. The current NIMBUS estimator is a learned lower-bound sig￾nal, not a certified upper bound on leakage, and its InfoNCE c… view at source ↗
read the original abstract

LLM agents often place sensitive credentials in the same context window as untrusted retrieved content, creating a direct path for indirect prompt injection to induce credential exfiltration. We study this failure mode through three complementary defenses. First, we ask whether activation probes can detect credential access before output tokens are emitted. Second, we construct honeytokens from format-specific character models and calibrate detection with split conformal prediction. Third, we treat multi-turn exfiltration as a cumulative information-flow problem and track an estimated leakage budget across conversation turns. In controlled experiments on open-weight models, activation features separate benign and credential-seeking prompts with high accuracy, including under held-out encoding transformations. In a small synthetic multi-turn suite, cumulative accounting detects attacks that per-turn detectors miss. These results are preliminary: the multi-turn benchmark is in-house and small, the activation method requires white-box access, and the information estimator provides a practical signal rather than a formal upper bound. Still, the results suggest that credential-exfiltration defenses should combine pre-output monitoring, calibrated canary detection, and temporal leakage accounting rather than relying only on text-level output filters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes three complementary defenses against credential exfiltration in LLM agents via indirect prompt injection: (1) activation probes to detect credential access before output tokens are generated, (2) format-specific honeytokens calibrated via split conformal prediction, and (3) cumulative leakage-budget tracking across multi-turn conversations. In controlled experiments on open-weight models, activation features are reported to separate benign and credential-seeking prompts with high accuracy (including under held-out encodings); a small synthetic multi-turn suite shows cumulative accounting catching attacks missed by per-turn detectors. The work explicitly labels its results as preliminary due to the in-house benchmark size, white-box requirement, and lack of formal bounds.

Significance. If the activation-separation and cumulative-accounting results generalize beyond the controlled open-weight setting and small synthetic benchmark, the work would provide a concrete path toward pre-output and temporal monitoring that complements text-level filters. The combination of internal-state probes with calibrated canaries and information-flow accounting is a distinctive framing that could influence agent-security toolkits, particularly if accompanied by reproducible code or larger-scale validation.

major comments (3)
  1. [Abstract] Abstract: the claim that 'activation features separate benign and credential-seeking prompts with high accuracy' is load-bearing for the central empirical contribution, yet the abstract (and the provided description) supplies no accuracy numbers, dataset sizes, error bars, or methodology details; this prevents assessment of whether the separation is practically useful or merely preliminary.
  2. [Abstract] Abstract and multi-turn section: the cumulative-accounting result is presented as detecting attacks missed by per-turn detectors, but the benchmark is explicitly 'small' and 'in-house'; without size, composition, or baseline comparisons, it is unclear whether this supports the broader claim that temporal leakage accounting is a necessary complement.
  3. [Abstract] Abstract: the activation method is stated to require white-box access and the multi-turn results are labeled preliminary; these caveats are appropriate but indicate that the load-bearing generalization claim (real-world prompt-injection scenarios without high false positives) rests on untested assumptions about distribution shift and access model.
minor comments (1)
  1. [Abstract] The abstract mentions 'split conformal prediction' and 'estimated leakage budget' without defining the calibration procedure or the precise estimator; adding a short methods paragraph would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires additional quantitative detail and clearer scoping to allow proper assessment of the claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'activation features separate benign and credential-seeking prompts with high accuracy' is load-bearing for the central empirical contribution, yet the abstract (and the provided description) supplies no accuracy numbers, dataset sizes, error bars, or methodology details; this prevents assessment of whether the separation is practically useful or merely preliminary.

    Authors: We agree the abstract should be self-contained on this point. The main text (experiments section) reports the separation results including accuracy, dataset sizes, and held-out encoding tests, but these specifics are not summarized in the abstract. We will revise the abstract to include the key quantitative figures, dataset scale, and brief methodology note while retaining the preliminary framing. revision: yes

  2. Referee: [Abstract] Abstract and multi-turn section: the cumulative-accounting result is presented as detecting attacks missed by per-turn detectors, but the benchmark is explicitly 'small' and 'in-house'; without size, composition, or baseline comparisons, it is unclear whether this supports the broader claim that temporal leakage accounting is a necessary complement.

    Authors: We acknowledge the benchmark limitations are already flagged but agree more detail is needed for evaluation. We will expand the abstract and multi-turn section to state the exact suite size and composition, and add explicit per-turn baseline comparisons showing which attacks were caught only by cumulative tracking. revision: yes

  3. Referee: [Abstract] Abstract: the activation method is stated to require white-box access and the multi-turn results are labeled preliminary; these caveats are appropriate but indicate that the load-bearing generalization claim (real-world prompt-injection scenarios without high false positives) rests on untested assumptions about distribution shift and access model.

    Authors: The manuscript already qualifies the results as preliminary and notes the white-box requirement precisely to avoid overclaiming generalization. We do not assert real-world performance or robustness to distribution shift; the contribution is the controlled-setting evidence and the combined defense framing. We will add a sentence in the abstract explicitly noting that broader generalization remains untested. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical detection methods rest on controlled experiments, not self-referential definitions or fitted predictions.

full rationale

The paper describes three empirical defenses—activation probes for pre-output detection, honeytoken calibration via split conformal prediction, and cumulative leakage accounting—evaluated on open-weight models with held-out encodings and a small synthetic multi-turn suite. No equations, derivations, or claims reduce a 'prediction' to a fitted parameter by construction, nor do any load-bearing steps rely on self-citations or uniqueness theorems imported from the authors' prior work. The abstract and methods explicitly frame results as preliminary experimental observations with stated limitations (white-box access, small benchmark), keeping the central claims independent of any circular reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides insufficient detail to enumerate specific free parameters, axioms, or invented entities; conformal prediction calibration is mentioned but no explicit fitted values or background assumptions are stated.

free parameters (1)
  • conformal prediction calibration parameters
    Used to set detection thresholds for honeytoken-based canary detection.

pith-pipeline@v0.9.1-grok · 5733 in / 1363 out tokens · 36731 ms · 2026-06-28T09:14:17.120994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 13 internal anchors

  1. [1]

    Understanding intermediate layers using linear classifier probes

    Alain, G. and Bengio, Y . Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644,

  2. [2]

    Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification.arXiv preprint arXiv:2107.07511,

  3. [3]

    Multi-stage prompt inference attacks on enterprise LLM systems

    Balashov, A., Ponomarova, O., and Zhai, X. Multi-stage prompt inference attacks on enterprise LLM systems. arXiv preprint arXiv:2507.15613,

  4. [4]

    Securing AI Agents with Information-Flow Control

    Costa, M., K¨opf, B., Kolluri, A., Paverd, A., Russinovich, M., Salem, A., Tople, S., Wutschitz, L., and Zanella- B´eguelin, S. Securing AI agents with information-flow control.arXiv preprint arXiv:2505.23643,

  5. [5]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    Debenedetti, E., Zhang, J., Balunovi´c, M., Beurer-Kellner, L., Fischer, M., and Tram`er, F. AgentDojo: A dynamic environment to evaluate attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352,

  6. [6]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., and Fritz, M. Not what you’ve signed up for: Compro- mising real-world LLM-integrated applications with indi- rect prompt injection.arXiv preprint arXiv:2302.12173,

  7. [7]

    Defending Against Indirect Prompt Injection Attacks With Spotlighting

    Hines, K., Tong, G., Kalai, A. T., Yang, Y ., Palangi, H., and Kiciman, E. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720,

  8. [8]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., and Tes- tuggine, D. Llama Guard: LLM-based input-output safeguard for human-AI conversations.arXiv preprint arXiv:2312.06674,

  9. [9]

    Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    Jain, N., Schwarzschild, A., Wen, Y ., Somepalli, G., Kirchenbauer, J., Chiang, P.-y., Goldblum, M., Saha, A., Geiping, J., and Goldstein, T. Baseline defenses for ad- versarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

  10. [10]

    and Baldwin, T

    Kaneko, M. and Baldwin, T. Bits leaked per query: Information-theoretic bounds on adversarial attacks against LLMs.arXiv preprint arXiv:2510.17000,

  11. [11]

    arXiv preprint arXiv:2309.02705 , year=

    Kumar, A., Agarwal, C., Srinivas, S., Feizi, S., and Lakkaraju, H. Certifying LLM safety against adversarial prompting.arXiv preprint arXiv:2309.02705,

  12. [12]

    LLM defenses are not robust to multi-turn human jailbreaks yet.arXiv preprint arXiv:2408.15221,

    Liu, N., Parrish, A., Liu, Y ., Choi, J., Yaghini, M., and Mireshghallah, F. LLM defenses are not robust to multi-turn human jailbreaks yet.arXiv preprint arXiv:2408.15221,

  13. [13]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Perez, F. and Ribeiro, I. Ignore previous prompt: At- tack techniques for language models.arXiv preprint arXiv:2211.09527,

  14. [14]

    and V olkov, D

    Reworr, R. and V olkov, D. LLM agent honeypot: Mon- itoring AI hacking agents in the wild.arXiv preprint arXiv:2410.13919,

  15. [15]

    Robey, A., Wong, E., Hassani, H., and Pappas, G. J. Smooth- LLM: Defending large language models against jailbreak- ing attacks.arXiv preprint arXiv:2310.03684,

  16. [16]

    A., Svegliato, J., Bai- ley, L., Wang, T., Ong, I., Elmaaroufi, K., Abbeel, P., and Darrell, T

    Toyer, S., Watkins, O., Mendes, E. A., Svegliato, J., Bai- ley, L., Wang, T., Ong, I., Elmaaroufi, K., Abbeel, P., and Darrell, T. TensorTrust: Interpretable prompt in- jection attacks from an online game.arXiv preprint arXiv:2311.01011,

  17. [17]

    Representation Learning with Contrastive Predictive Coding

    van den Oord, A., Li, Y ., and Vinyals, O. Representa- tion learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,

  18. [18]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Wallace, E., Xiao, K., Leike, R., Weng, L., Heidecke, J., and Beutel, A. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208,

  19. [19]

    Benchmarking and defending against indi- rect prompt injection attacks on large language models

    Yi, J., Xie, Y ., Zhu, B., Hines, K., Kiciman, E., Sun, G., Xie, X., and Wu, F. Benchmarking and defending against indi- rect prompt injection attacks on large language models. arXiv preprint arXiv:2312.14197,

  20. [20]

    InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

    Zhan, Q., Liang, Z., Ying, Z., and Kang, D. InjecA- gent: Benchmarking indirect prompt injections in tool- integrated large language model agents.arXiv preprint arXiv:2403.02691,