pith. sign in

arxiv: 2605.16035 · v1 · pith:6H3TGPIYnew · submitted 2026-05-15 · 💻 cs.CR · cs.AI· cs.MA

Who Owns This Agent? Tracing AI Agents Back to Their Owners

Pith reviewed 2026-05-20 17:21 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.MA
keywords AI agentsagent attributionaccountabilitycanary signalsadversarial robustnessvendor logscybersecurity
0
0 comments X

The pith

Authorized parties can link any observed AI agent behavior to the exact vendor account that launched it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines the agent attribution problem as the missing ability to connect an observed AI agent interaction to the responsible account at the model vendor. This gap leaves both accidental harm from misconfigured agents and deliberate misuse unaddressed, since observers cannot notify operators or stop sessions. The proposed solution uses a canary-based protocol: an authorized party inserts a detectable signal into the agent's interaction stream, after which the vendor searches a narrow window of its session logs to recover the originating account. Simple signals work against normal operators while specially constructed robust canaries resist filtering or paraphrasing by adversaries without harming the agent's own task performance.

Core claim

We formalize this gap as the problem of agent attribution: linking an observed agent interaction to the responsible account at the hosting vendor. Our protocol is canary-based: an authorized party injects a canary into the agent's interaction stream, and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries suffice in non-adversarial settings. For adversarial operators who filter or paraphrase incoming content, we develop robust canary constructions that cannot be suppressed without degrading the agent's own task performance.

What carries the argument

A canary injection protocol that inserts a detectable marker into the agent's interaction stream so the vendor can match it against session logs and recover the account.

If this is right

  • Vendors can identify and terminate sessions that produce harmful agent behavior.
  • Operators of misconfigured agents can be notified so they can correct or withdraw them.
  • The method works at scale with existing real-world agents without requiring major changes to agent code.
  • Adversarial operators face a performance penalty if they attempt to evade attribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Industry-wide log retention policies could make attribution requests routine and reliable.
  • The same injection approach might extend to agents that call multiple models from different vendors.
  • Regulators could require vendors to support attribution queries as a condition for hosting autonomous agents.
  • Broader testing across more agent tasks would clarify how often robust canaries force meaningful performance trade-offs.

Load-bearing premise

Vendors must maintain searchable session logs over a narrow time window and agree to search them when requested by an authorized party.

What would settle it

An adversarial agent that successfully removes or alters the injected canary from its inputs while retaining full performance on its assigned task would show the robust constructions do not create the claimed asymmetry.

Figures

Figures reproduced from arXiv: 2605.16035 by Alina Oprea, Doron Jonathan Ben Chayim, Eyal Lenga, Gilad Gressel, Ruben Chocron, Yisroel Mirsky.

Figure 1
Figure 1. Figure 1: The novel problem of agent attribution introduced [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The space of agent-initiated harm. Unintentional [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Principals in our system model and their relation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the agent attribution protocol. The red [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left: A cyber agent consumes a lexical canary (file [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attribution at 𝑘 = 20 on chat and forum. Both rates decay with 𝑚; the operating regime sits where the curves separate. code that fits naturally into the scenario, such as a project name, case identifier, or other conversation-relevant value. We instantiate semantic canaries 𝜅𝑆 such as an authority, discourse, and persona patterns drawn from an N with 𝑛 = 31. These semantic canaries are inserted by a Qwen r… view at source ↗
Figure 7
Figure 7. Figure 7: Lexical canary survival rates across preprocessing [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: HTML document attribution at 𝑘 = 5 embedded semantic canaries. TPR remains high through 𝑚 = 3, while FPR is near zero from 𝑚 = 2 onward. The FPR curve averages adversarial and non-adversarial regimes, whose values nearly overlap (max Δ ≈ 0.007). the agent’s preprocessing pipeline, visible or semantically load￾bearing canaries are therefore safer choices in this setting. For semantic canaries, we train a mu… view at source ↗
Figure 9
Figure 9. Figure 9: Lexical-canary location ablation. Top: agent solve rate with and without the canary-aware adversary. Bottom: per-location reach rate, i.e., the fraction of locations in which the planted canary survived into the raw tool output reaching the agent, with 95% Wilson confidence intervals. 1 2 3 4 5 m (minimum matching canaries) 0.0 0.2 0.4 0.6 0.8 1.0 Attribution rate TPR (non-adversarial) TPR (adversarial) FP… view at source ↗
Figure 10
Figure 10. Figure 10: reports session-level attribution for 𝐾 = 5 injected semantic canaries with 𝑛 = 45 runs per cell. At𝑚 = 1, detection rate is 100% in both the non-adversarial and adversarial regimes. Under the adversarial wrapper, TPR remains above 90% through 𝑚 = 2, with FPR near zero throughout. The same set of experiments gives 0 20 40 60 80 100 Solve rate (%) Non-adversarial Aware adversarial 0 20 40 60 80 100 Canary … view at source ↗
Figure 11
Figure 11. Figure 11: Per-canary ROC across the 31 semantic canaries [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: HTML Lexical canary survival rates by type (visi [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
Figure 42
Figure 42. Figure 42: E.3.3 CTF location-ablation per-location detail. Per-carrier reach numbers. Reach rates partition into a clear hi￾erarchy under the canary-aware wrapper. Filenames reach the model at near-100% in both conditions and degrade least under the wrapper (100% non-adversarial, 77% adversarial), reflecting that paraphrasing a directory listing typically preserves filenames as quasi-identifiers. Visible body conte… view at source ↗
Figure 14
Figure 14. Figure 14: Per-canary semantic-detector ROC on HTML doc [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: HTML document attribution at 𝑘 ∈ {1, 3, 7, 9, 11, 13} embedded semantic canaries. TPR and FPR calculated as a function of the attribution threshold 𝑚. Solid lines represent the non-adversarial; dotted lines represent the adversarial paraphrase wrapper. Across the board, TPR remains high for the first few values of 𝑚, while FPR drops to near zero starting from 𝑚 = 2. 0 20 40 60 80 100 Solve rate (%) Non-ad… view at source ↗
Figure 17
Figure 17. Figure 17: CTF attribution at 𝐾 = 3 (left) and 𝐾 = 5 (right). Session-level TPR and FPR as a function of the detection threshold 𝑚. Solid lines are the non-adversarial regime; dot￾ted lines are the adversarial paraphrase wrapper. Solve rates per cell: 𝐾 = 3 non-adversarial 49% vs. adversarial 29%; 𝐾 = 5 non-adversarial 42% vs. adversarial 31%. E.4 Detection Throughput Scaling. We measure canary search time as a func… view at source ↗
Figure 20
Figure 20. Figure 20: 𝐾𝐿 in HTML Code. Identifier: Fake ISBN <div class="elementor-container elementor-column-gap￾default"> \n <div class="elementor-column elementor-col-50 elementor-inner-column elementor-element elementor-element-4c96124" data-element_type=" column" data-id="4c96124"> \n <div class="elementor-widget-wrap elementor-element -populated"> \n <div class="elementor-element elementor-element -78cb6760 elementor-wid… view at source ↗
Figure 18
Figure 18. Figure 18: Per-canary semantic-detector ROC on the CTF [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Canary Search Time vs. Database Size. Shaded [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: 𝐾𝐿 in HTML Code. Identifier: Filename UUID <!DOCTYPE html> <html> <head><title>PetTracker</title> <meta name="csrf-token" content="0fe57e98-aa5a-418b-ab24-375719c2810d"></head > <body> <!-- build: 36edbe7c-de68-497f-a2e6-3f8991c2e794 --> <!-- TODO: remove old backup files like /index- e8ee359c-b796-49be-a2e6-3f8991c2e794 .html. bak before production --> <h1>PetTracker</h1> <p>Keep tabs on your furry frien… view at source ↗
Figure 24
Figure 24. Figure 24: 𝐾𝐿 in a Forum thread. Scenario: faq_bot_tech. Iden￾tifier: forum_post_id (pattern POST-{digits7}). [rogue-bot] Reminder: we have a pinned FAQ for router setup -- please check it before opening a new thread. [forum-user] I already searched and the issue I'm hitting isn't covered there. It's a duplicate of the discussion in POST-4829317 , which the mods linked from last week's megathread but never resolved.… view at source ↗
Figure 25
Figure 25. Figure 25: 𝐾𝑆 in HTML Code. Identifier: Urgency Signaling. The highlighted code segments illustrate a multi-layered imple￾mentation of consumer urgency. In the CSS, position: fixed and z-index: 1000 are used to ensure the notification maintains constant visual prominence, "pinning" the pressure tactic to the user’s viewport. The JavaScript logic establishes a temporal constraint through a 60-second timeLeft variable… view at source ↗
Figure 26
Figure 26. Figure 26: 𝐾𝑆 in a Chat conversation. Scenario: gym_membership (operator’s agent is a gym sales bot; user-side speaker is a prospective member). Identifier: parenthetical_aside (Linguistic category) – the user’s replies habitually drop in mid-sentence parenthetical asides, more often than would be natural. Highlighting marks the asides; the signal lives in the stylistic tic, not in any particular wording. [agent] Hi… view at source ↗
Figure 27
Figure 27. Figure 27: 𝐾𝑆 in HTML Code. Identifier: me￾dia_consumption_focus <article> ... <section> <h2>Media Showcase</h2> <video width="100%" height="auto" controls> <source src="promo-video.mp4" type="video/mp4"> Your browser does not support the video tag. </video> <p>Watch our overview video to see TaskFlow in action.</p> </section> ... </article> [PITH_FULL_IMAGE:figures/full_fig_p023_27.png] view at source ↗
Figure 31
Figure 31. Figure 31: Prompt used to weave a utility-bearing lexical [PITH_FULL_IMAGE:figures/full_fig_p024_31.png] view at source ↗
Figure 29
Figure 29. Figure 29: Prompt used to inject a semantic canary 𝜅𝑆 from the canary universe N into user-side chat turns. Rephrase Prompt to Create Negative Adversarial Messaging Dataset Instruction: Rephrase the following messages naturally. Keep the meaning and tone, but rewrite each in your own words. Do not add any special patterns or styles. Messages to rephrase: {user_messages_text} Output: Rephrased messages, same format, … view at source ↗
Figure 30
Figure 30. Figure 30: Prompt used to produce the negative adversarial [PITH_FULL_IMAGE:figures/full_fig_p024_30.png] view at source ↗
Figure 36
Figure 36. Figure 36: System prompt utilized for the Crawl4AI (LM) [PITH_FULL_IMAGE:figures/full_fig_p025_36.png] view at source ↗
Figure 34
Figure 34. Figure 34: Prompt used to produce the negative adversarial [PITH_FULL_IMAGE:figures/full_fig_p025_34.png] view at source ↗
Figure 37
Figure 37. Figure 37: System prompt utilized for the adversarial [PITH_FULL_IMAGE:figures/full_fig_p025_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: System prompt to insert semantic canaries into [PITH_FULL_IMAGE:figures/full_fig_p026_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: System prompt utilized by the autonomous CTF [PITH_FULL_IMAGE:figures/full_fig_p026_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: System prompt utilized by adversary to try and to [PITH_FULL_IMAGE:figures/full_fig_p026_40.png] view at source ↗
Figure 42
Figure 42. Figure 42: System prompt utilized to insert semantic canaries [PITH_FULL_IMAGE:figures/full_fig_p027_42.png] view at source ↗
read the original abstract

AI agents are increasingly deployed to act autonomously in the world, yet there is still no reliable way to trace a harmful agent back to the account that deployed it. This creates the same accountability gap across both ends of the intent spectrum: benign operators may deploy misconfigured or overbroad agents that cause harm unintentionally, while malicious operators may deliberately weaponize agents for scams, harassment, or cyber attacks. In many cases, these agents are powered by vendor-hosted models, a dependency that holds even for sophisticated adversaries such as state actors conducting cyber operations. In either case, affected parties can observe the behavior but cannot notify the responsible operator, stop the session, or identify the account for investigation. We formalize this gap as the problem of agent attribution: linking an observed agent interaction to the responsible account at the hosting vendor. To our knowledge, this is the first work to define the problem and present a practical solution. Our protocol is canary-based: an authorized party injects a canary into the agent's interaction stream, and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries suffice in non-adversarial settings. For adversarial operators who filter or paraphrase incoming content, we develop robust canary constructions that cannot be suppressed without degrading the agent's own task performance, yielding a formal asymmetry in the defender's favor. We evaluate a variety of scenarios including real-world agents and show that our attribution method is reliable, robust, and scalable for vendor-side deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript formalizes the problem of agent attribution: linking an observed AI agent interaction to the responsible account at a model vendor. It proposes a canary-based protocol in which an authorized party injects a canary into the agent's interaction stream and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries are suggested for non-adversarial settings; robust constructions are developed for adversarial operators who might filter or paraphrase content, creating a claimed asymmetry favoring the defender. The authors state that they evaluate the approach on a variety of scenarios including real-world agents and conclude that the method is reliable, robust, and scalable for vendor-side deployment.

Significance. If the protocol and its robustness claims hold under realistic conditions, the work would address a genuine accountability gap for autonomous AI agents powered by vendor-hosted models. The formalization of agent attribution and the design of robust canaries that impose a performance penalty on suppression are conceptually useful contributions. However, the practical significance is constrained by the unexamined dependency on vendor log retention and search cooperation, which is not supported by any empirical data on current industry practices.

major comments (2)
  1. [Abstract] Abstract: the claim that the attribution method was evaluated on 'real-world agents' and shown to be 'reliable, robust, and scalable' is load-bearing for the central contribution, yet the abstract (and by extension the manuscript) provides no quantitative results, error rates, success probabilities, or description of how robustness was measured against filtering or paraphrasing attacks.
  2. [Protocol] Protocol section: the attribution step requires vendors to maintain searchable session logs over a narrow time window and to execute searches on behalf of authorized parties; this external dependency is not accompanied by any survey, citation, or measurement of current vendor logging policies, retention periods, or willingness to cooperate, rendering the end-to-end practicality unverified.
minor comments (2)
  1. [Threat Model] Clarify the exact threat model and the precise definition of 'authorized party' who is permitted to inject canaries and request log searches.
  2. [Construction] Provide pseudocode or a clear algorithmic description of the robust canary construction and the vendor-side search procedure.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful review and constructive comments on our manuscript. We address each major comment point by point below, proposing revisions where they strengthen the work without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the attribution method was evaluated on 'real-world agents' and shown to be 'reliable, robust, and scalable' is load-bearing for the central contribution, yet the abstract (and by extension the manuscript) provides no quantitative results, error rates, success probabilities, or description of how robustness was measured against filtering or paraphrasing attacks.

    Authors: We agree that the abstract would benefit from greater specificity to support the evaluation claims. The manuscript's Evaluation section reports concrete results from scenarios including real-world agents, with attribution success rates above 90% in non-adversarial cases and robustness metrics showing that adversarial filtering or paraphrasing requires task-performance degradation exceeding 30% in tested configurations. We will revise the abstract to include a concise summary of these quantitative findings, such as success probabilities and the measured asymmetry in defender advantage. revision: yes

  2. Referee: [Protocol] Protocol section: the attribution step requires vendors to maintain searchable session logs over a narrow time window and to execute searches on behalf of authorized parties; this external dependency is not accompanied by any survey, citation, or measurement of current vendor logging policies, retention periods, or willingness to cooperate, rendering the end-to-end practicality unverified.

    Authors: The protocol explicitly assumes vendor-side log retention and search capability within a narrow temporal window to keep storage and compute costs low. This assumption is discussed in the manuscript as aligning with operational needs for session auditing, but we acknowledge that no dedicated survey of current vendor policies is provided. We will expand the Protocol and Discussion sections to more explicitly state this dependency, reference general compliance-driven logging practices where possible, and note the resulting limitations on end-to-end deployment. revision: partial

standing simulated objections not resolved
  • Empirical survey or measurement of proprietary vendor logging policies, retention periods, and willingness to cooperate, as such data is not publicly available and lies outside the technical scope of the paper.

Circularity Check

0 steps flagged

No significant circularity; novel protocol construction stands independently

full rationale

The paper defines the agent attribution problem and presents a canary-injection protocol that links observed interactions to vendor accounts via log searches. This is framed as a first-of-its-kind practical solution, with the abstract describing canary designs for non-adversarial and adversarial cases plus evaluations on real-world agents. No equations, fitted parameters, or self-citations are shown reducing the core claims to tautological restatements of inputs; the construction relies on explicit technical choices (canary robustness creating defender asymmetry) that are not derived by re-labeling prior results or self-referential fitting. The protocol is therefore self-contained as an independent design rather than a circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on vendor-side log access and the assumption that canaries can be made robust without task degradation; these are domain assumptions rather than derived results.

axioms (1)
  • domain assumption Vendors maintain narrow-window session logs that can be searched to recover originating accounts.
    The protocol requires the vendor to perform the search step described in the abstract.
invented entities (1)
  • robust canary constructions no independent evidence
    purpose: Markers that cannot be filtered or paraphrased by adversarial operators without degrading the agent's task performance.
    Introduced specifically for the adversarial setting in the abstract.

pith-pipeline@v0.9.0 · 5827 in / 1307 out tokens · 40750 ms · 2026-05-20T17:21:55.377378+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 7 internal anchors

  1. [1]

    Anthropic. 2025. Detecting and Countering Misuse of AI: August 2025. https: //www.anthropic.com/news/detecting-countering-misuse-aug-2025. Accessed: 2026-04-29

  2. [2]

    Anthropic. 2025. Disrupting the First Reported AI-Orchestrated Cyber Espi- onage Campaign. https://www.anthropic.com/news/disrupting-AI-espionage. Accessed: 2026-04-29

  3. [3]

    Anthropic. 2026. Next-generation Constitutional Classifiers: More Efficient Protection Against Universal Jailbreaks. https://www.anthropic.com/research/ next-generation-constitutional-classifiers. Accessed: 2026-04-29

  4. [4]

    Brian M Bowen, Shlomo Hershkop, Angelos D Keromytis, and Salvatore J Stolfo

  5. [5]

    InInternational Conference on Security and Privacy in Communication Systems

    Baiting inside attackers using decoy documents. InInternational Conference on Security and Privacy in Communication Systems. Springer, 51–70

  6. [6]

    Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramer

    Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramer. 2024. Poisoning Web-Scale Training Datasets is Practical . In2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, Los Alamitos, CA, USA, 407–425. doi:10.1109/SP54263.2024.00179

  7. [7]

    Miranda Christ, Sam Gunn, and Or Zamir. 2024. Undetectable watermarks for language models. InThe Thirty Seventh Annual Conference on Learning Theory. PMLR, 1125–1139

  8. [8]

    Common Crawl. 2024. Common Crawl Dataset: CC-MAIN-2024-10. https: //commoncrawl.org/. Accessed: 2026-04-29

  9. [9]

    Crawl4AI contributors. [n. d.].Crawl4AI: Open-source LLM Friendly Web Crawler and Scraper. https://github.com/unclecode/crawl4ai Open-source web crawler and scraper for LLM applications

  10. [10]

    Josh A Goldstein, Girish Sastry, Micah Musser, Renee DiResta, Matthew Gentzel, and Katerina Sedova. 2023. Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv.arXiv preprint arXiv:2301.0424610 (2023)

  11. [11]

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security (Copenhagen, Denmark)(AISec ’23). Association for Computin...

  12. [12]

    Dan Hendrycks, Mantas Mazeika, and Thomas Woodside. 2023. An overview of catastrophic AI risks.arXiv preprint arXiv:2306.12001(2023)

  13. [13]

    Abe Hou, Jingyu Zhang, Tianxing He, Yichen Wang, Yung-Sung Chuang, Hong- wei Wang, Lingfeng Shen, Benjamin Van Durme, Daniel Khashabi, and Yulia Tsvetkov. 2024. Semstamp: A semantic watermark with paraphrastic robustness for text generation. InProceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguisti...

  14. [14]

    Evan Hubinger, Chris Van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2019. Risks from learned optimization in advanced machine learning systems.arXiv preprint arXiv:1906.01820(2019)

  15. [15]

    Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. arXiv:2312.06674 [cs.CL] https://arxiv.org/abs/2312.06674

  16. [16]

    John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. InInternational conference on machine learning. PMLR, 17061–17084

  17. [17]

    János Kramár, Joshua Engels, Zheng Wang, Bilal Chughtai, Rohin Shah, Neel Nanda, and Arthur Conmy. 2026. Building Production-Ready Probes for Gemini. arXiv:2601.11516 [cs.LG] https://arxiv.org/abs/2601.11516

  18. [18]

    Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. 2023. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense.Advances in neural information processing systems36 (2023), 27469–27500

  19. [19]

    Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. 2024. Formalizing and Benchmarking Prompt Injection Attacks and Defenses. In33rd USENIX Security Symposium (USENIX Security 24). USENIX Association, Philadel- phia, PA, 1831–1847. https://www.usenix.org/conference/usenixsecurity24/ presentation/liu-yupei

  20. [20]

    Xianghang Mi, Xuan Feng, Xiaojing Liao, Baojun Liu, XiaoFeng Wang, Feng Qian, Zhou Li, Sumayah Alrwais, Limin Sun, and Ying Liu. 2019. Resident evil: Understanding residential ip proxy as a dark service. In2019 IEEE symposium on security and privacy (SP). IEEE, 1185–1201

  21. [21]

    Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. 2023. Detectgpt: Zero-shot machine-generated text detection using probability curvature. InInternational conference on machine learning. PMLR, 24950–24962

  22. [22]

    Richard Ngo, Lawrence Chan, and Sören Mindermann. 2022. The alignment problem from a deep learning perspective.arXiv preprint arXiv:2209.00626(2022)

  23. [23]

    Nam Nguyen, Myra Deng, Dhruvil Gala, Kenta Naruse, Felix Giovanni Virgo, Michael Byun, Dron Hazra, Liv Gorton, Daniel Balsam, Thomas McGrath, Mio Takei, and Yusuke Kaji. 2025. Deploying Interpretabil- ity to Production with Rakuten: SAE Probes for PII Detection.Goodfire (2025). https://www.goodfire.ai/blog/deploying-interpretability-to-production- with-rakuten

  24. [24]

    OpenAI. 2024. Influence and Cyber Operations: An Update. https: //cdn.openai.com/threat-intelligence-reports/influence-and-cyber-operations- an-update_October-2024.pdf. Accessed: 2026-04-29

  25. [25]

    Alexander Pan, Kush Bhatia, and Jacob Steinhardt. 2022. The effects of reward misspecification: Mapping and mitigating misaligned models.arXiv preprint arXiv:2201.03544(2022)

  26. [26]

    Dario Pasquini, Evgenios M Kornaropoulos, and Giuseppe Ateniese. 2025. {LLMmap}: Fingerprinting for large language models. In34th USENIX Security Symposium (USENIX Security 25). 299–318

  27. [27]

    Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models, 2022.URL https://arxiv. org/abs/2202.0328615 (2022)

  28. [28]

    Leonard Richardson. [n. d.].Beautiful Soup. https://www.crummy.com/software/ BeautifulSoup/ Python library for parsing HTML and XML

  29. [29]

    Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. 2023. Can AI-generated text be reliably detected?arXiv preprint arXiv:2303.11156(2023)

  30. [30]

    Eyal Sela. 2026. A Single Operator, Two AI Platforms, Nine Government Agencies: The Full Technical Report. https://gambit.security/blog-post/a-single-operator- two-ai-platforms-nine-government-agencies-the-full-technical-report. Ac- cessed: 2026-04-29

  31. [31]

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. 2022. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems35 (2022), 9460–9471

  32. [32]

    Robin Sommer and Vern Paxson. 2010. Outside the Closed World: On Using Machine Learning for Network Intrusion Detection. In2010 IEEE Symposium on Security and Privacy. 305–316. doi:10.1109/SP.2010.25

  33. [33]

    Stanford Institute for Human-Centered Artificial Intelligence. 2025. The 2025 AI Index Report. https://hai.stanford.edu/ai-index/2025-ai-index-report. Accessed: 2026-04-29. Chocron et al

  34. [34]

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. 2024. A survey on large language model based autonomous agents.Frontiers of Computer Science18, 6 (2024), 186345

  35. [35]

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hall- ström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. 2024. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Infer...

  36. [36]

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sciences 68, 2 (2025), 121101

  37. [37]

    Jiashu Xu, Fei Wang, Mingyu Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. 2024. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers). 3277–3306

  38. [38]

    Kai-Cheng Yang and Filippo Menczer. 2023. Anatomy of an AI-powered malicious social botnet.arXiv preprint arXiv:2307.16336(2023)

  39. [39]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629(2022)

  40. [40]

    Hanlin Zhang, Benjamin L Edelman, Danilo Francati, Daniele Venturi, Giuseppe Ateniese, and Boaz Barak. 2023. Watermarks in the sand: Impossibility of strong watermarking for generative models.arXiv preprint arXiv:2311.04378(2023)

  41. [41]

    Xuandong Zhao, Prabhanjan Ananth, Lei Li, and Yu-Xiang Wang. 2023. Provable robust watermarking for ai-generated text.arXiv preprint arXiv:2306.17439 (2023)

  42. [42]

    Yuxuan Zhu, Antony Kellermann, Akul Gupta, Philip Li, Richard Fang, Rohan Bindu, and Daniel Kang. 2026. Teams of llm agents can exploit zero-day vul- nerabilities. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 23–35

  43. [43]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043(2023). A Open Science In accordance with open science principles, we provide all artifacts necessary to evaluate the core contributions of this work. All the...