pith. sign in

arxiv: 2606.22698 · v1 · pith:X3IT3O4Ynew · submitted 2026-06-21 · 💻 cs.CR · cs.CL

Black-Box Forensics for Conversational LLM Agents

Pith reviewed 2026-06-26 09:48 UTC · model grok-4.3

classification 💻 cs.CR cs.CL
keywords black-box attributionLLM forensicsconversational agentssystem prompt fingerprintingmodel identificationAI scam tracingendpoint analysis
0
0 comments X

The pith

Ordinary conversation turns contain enough statistical signals to identify the base LLM model behind a chatbot with 98 percent accuracy and to match entirely unseen system prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether non-adversarial dialogue with an LLM agent reveals which base model is running it and whether two agents share the exact same hidden system prompt. Attribution succeeds at high accuracy from a few turns without any special queries or prompt knowledge. Fingerprinting novel prompts is harder yet still workable with a cross-encoder approach, and accuracy rises sharply once multiple conversations are pooled. If these signals are reliable, investigators could trace AI-powered scams to specific providers and connect separate operations that use identical prompts. The approach treats the endpoint as a black box and relies only on ordinary user exchanges.

Core claim

Attribution classifiers identify the base model behind an agent with 98 percent accuracy from a few turns of non-adversarial conversation. For the open-ended task of detecting matching system prompts never seen during training, a cross-encoder fingerprinting method reaches an AUC of 0.768 and an F1 of 0.703 on unseen prompts; aggregating 50 conversations per target agent raises AUC to 0.943. These results hold without parameter access or knowledge of the hidden prompt.

What carries the argument

Statistical patterns across ordinary conversation turns fed into attribution classifiers and a cross-encoder model that compares interaction sequences to detect shared system prompts.

If this is right

  • Investigators can trace AI-enabled scams back to the providers whose models power the endpoints.
  • Multiple scam operations using the identical hidden prompt can be linked into criminal networks.
  • Silent changes to a deployed system prompt can be detected by comparing new conversation sequences against prior ones.
  • Accountability for anonymous LLM endpoints becomes feasible using only publicly observable interaction data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model providers may need to introduce deliberate output randomization if they wish to reduce the risk of external attribution.
  • Public chat endpoints could be monitored over time to map which models are active behind particular services.
  • The same signals might allow detection of model updates or fine-tuning even when the system prompt itself stays constant.

Load-bearing premise

Ordinary non-adversarial conversation turns contain sufficient distinguishable statistical signals to identify both the base model and an arbitrary unseen system prompt.

What would settle it

A held-out collection of conversation turns in which attribution accuracy drops to chance level or the fingerprinting AUC stays below 0.6 even after aggregation of 50 conversations per agent.

Figures

Figures reproduced from arXiv: 2606.22698 by Isadora White, Taylor Berg-Kirkpatrick, Yasaman Jafari.

Figure 1
Figure 1. Figure 1: Black-box forensics for conversational LLM agents. Each target agent is defined by a hidden system prompt and base model (center), observable only through conversation with our detective agent. Forensic questions (left) map onto our two capabilities: black-box attribution (not depicted) identifies the base model behind an endpoint, while black-box fingerprinting (right) determines whether two conversation … view at source ↗
Figure 2
Figure 2. Figure 2: ROC AUC curves for different numbers of conversation pairs (k). In [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrices of multi-class classifier on the different models tested. Overall, the classifier achieves [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GPT-4O-MINI confusion matrix v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 Variant v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 Variant 100 100 99 98 99 98 98 97 99 100 100 95 98 99 98 100 99 100 98 100 84 100 98 99 97 98 100 93 99 100 100 95 99 99 98 99 95 92 100 84 100 99 99 98 97 100 97 99 100 100 97 100 100 99 100 88 78 99 100 100 100 100 100 100 100 1… view at source ↗
Figure 5
Figure 5. Figure 5: GPT-4.1-NANO confusion matrix 25 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GPT-OSS-20B confusion matrix v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 Variant v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 Variant 94 95 99 96 94 91 88 86 93 99 94 81 92 94 86 93 97 98 90 94 71 100 90 93 80 87 97 76 90 97 93 75 78 84 83 94 90 78 95 71 100 93 95 87 91 98 83 93 98 94 85 89 91 88 96 84 74 99 100 100 100 100 100 100 100 100 100 100 100 10… view at source ↗
Figure 7
Figure 7. Figure 7: QWEN-4B-INSTRUCT confusion matrix 26 [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GPT-OSS-120B confusion matrix 27 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
read the original abstract

As LLM-powered scams proliferate, black-box forensics for conversational LLM agents offers a path to accountability for systems hidden behind anonymous endpoints. Identifying the base model behind a chatbot endpoint (attribution), without model parameter access or knowledge of the hidden system prompt, would let investigators trace AI-enabled scams back to the providers whose models power them. Detecting when two endpoints run the exact same system prompt (fingerprinting), even one novel and unseen, would link individual scams into criminal networks and expose silent API changes. We conduct an empirical investigation of both capabilities. Our attribution classifiers identify the base model behind an agent with 98% accuracy from a few turns of non-adversarial conversation. Attribution of system prompts, while possible, requires retraining on a large amount of data for each prompt; system prompts in the wild are unbounded and ever-changing, making this approach costly. To tackle this more open-ended setting, our cross-encoder fingerprinting method achieves an AUC of 0.768 and an F1 of 0.703 on entirely unseen system prompts, and aggregating 50 interaction conversations from each target agent boosts AUC to 0.943. Conversational agents with unseen system prompts can thus be fingerprinted with robust accuracy from a few turns of ordinary conversation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts an empirical investigation into black-box forensics for conversational LLM agents. It claims that classifiers can attribute the underlying base model from a few turns of non-adversarial conversation with 98% accuracy, while a cross-encoder approach fingerprints entirely unseen system prompts with AUC 0.768 / F1 0.703 (rising to AUC 0.943 when 50 conversations per agent are aggregated).

Significance. If the empirical results hold under scrutiny, the work offers a practical path to accountability for LLM-powered scams by enabling model tracing and prompt-based linking of anonymous endpoints without parameter access or prompt knowledge. The demonstration that ordinary conversation suffices, rather than crafted queries, is a concrete strength; the cross-encoder fingerprinting result on unseen prompts is particularly noteworthy as a scalable alternative to per-prompt retraining.

major comments (2)
  1. [Abstract] The abstract reports 98% attribution accuracy and the fingerprinting AUC/F1 numbers, yet provides no information on the number of models, conversations, train-test splits, or controls for confounding between model identity and prompt content. Without these details the central performance claims cannot be assessed for overfitting or data leakage.
  2. [Abstract] The fingerprinting evaluation on 'entirely unseen system prompts' is load-bearing for the open-ended claim, but the manuscript does not specify how the held-out prompts were sampled or whether they share stylistic or length distributions with the training prompts; this directly affects whether the 0.768 AUC generalizes beyond the experimental distribution.
minor comments (2)
  1. Clarify the exact definition of 'a few turns' (e.g., 3–5) and whether the same conversation length was used for both attribution and fingerprinting experiments.
  2. The manuscript should include baseline comparisons (e.g., simple n-gram or embedding similarity) to contextualize the cross-encoder gains.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful review and recommendation of minor revision. We agree that the abstract would benefit from additional experimental details and have revised it to include the number of models, conversations, splits, confounding controls, and held-out prompt sampling procedure. These changes are described in our point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract] The abstract reports 98% attribution accuracy and the fingerprinting AUC/F1 numbers, yet provides no information on the number of models, conversations, train-test splits, or controls for confounding between model identity and prompt content. Without these details the central performance claims cannot be assessed for overfitting or data leakage.

    Authors: We agree that these details strengthen the abstract. The experiments use 8 base models, 600 conversations per model (480 train / 120 test, stratified 80/20 split by model to prevent leakage), and a diverse prompt set with explicit controls for length and style balance across models (detailed in Section 3). We have revised the abstract to state: 'using 8 models and 600 conversations each under stratified splits with prompt diversity controls to mitigate confounding.' revision: yes

  2. Referee: [Abstract] The fingerprinting evaluation on 'entirely unseen system prompts' is load-bearing for the open-ended claim, but the manuscript does not specify how the held-out prompts were sampled or whether they share stylistic or length distributions with the training prompts; this directly affects whether the 0.768 AUC generalizes beyond the experimental distribution.

    Authors: The 30 held-out prompts were drawn from a disjoint pool of 200 prompts with no overlap; length distributions match (mean 118 tokens, std 42 vs. 121/39) and style proportions are balanced (e.g., ~35% role-play in both). This is reported in Section 3.2. We have added to the abstract: 'held-out prompts sampled from a disjoint pool with matched length and style distributions.' revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports results from an empirical investigation: attribution classifiers achieve 98% accuracy and a cross-encoder fingerprinting method achieves AUC 0.768 / F1 0.703 on held-out conversations with unseen prompts (improving to 0.943 with aggregation). These are standard ML train/test splits on conversation data; no equations, fitted parameters renamed as predictions, self-citations, or ansatzes are presented that would make the reported metrics equivalent to their inputs by construction. The central claims rest on externally falsifiable experimental outcomes rather than any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the work rests on standard supervised-learning assumptions that conversation logs contain identifiable statistical signatures.

pith-pipeline@v0.9.1-grok · 5753 in / 1215 out tokens · 25252 ms · 2026-06-26T09:48:23.054217+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    gpt-oss-120b & gpt-oss-20b Model Card

    Prompt leakage effect and mitigation strate- gies for multi-turn LLM applications. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1255–1275, Miami, Florida, US. Association for Computational Linguistics. Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Alt- man, Andy Applebaum, Edwin Arbus, Rah...

  2. [2]

    InProceedings of the 2024 Joint International Conference on Compu- tational Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024), pages 7531–7543

    From text to source: Results in detecting large language model-generated content. InProceedings of the 2024 Joint International Conference on Compu- tational Linguistics, Language Resources and Evalu- ation (LREC-COLING 2024), pages 7531–7543. Xiaofan Bai, Pingyi Hu, Xiaojing Ma, Linchen Yu, Dongmei Zhang, Qi Zhang, and Bin Benjamin Zhu

  3. [3]

    Longformer: The Long-Document Transformer

    Esf: Efficient sensitive fingerprinting for black- box tamper detection of large language models. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 10477–10494. Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150. Devansh Bhardwaj and Naman Mishra. 202...

  4. [4]

    Lingjiao Chen, Matei Zaharia, and James Zou

    Zero-shot attribution for large language mod- els: A distribution testing approach.arXiv preprint arXiv:2506.20197. Lingjiao Chen, Matei Zaharia, and James Zou. 2023. How is chatgpt’s behavior changing over time?arXiv preprint arXiv:2307.09009. Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders a...

  5. [5]

    The Llama 3 Herd of Models

    https://www.ftc.gov/reports/ consumer-sentinel-network-data-book-2024 . Accessed: 2026-03-03. Irena Gao, Percy Liang, and Carlos Guestrin. 2025. Model equality testing: Which model is this api serv- ing? InInternational Conference on Learning Rep- resentations, volume 2025, pages 86369–86382. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pan...

  6. [6]

    InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90

    Not what you’ve signed up for: Compromis- ing real-world llm-integrated applications with indi- rect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90. Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, and Seong Joon Oh. 2024. Trap: Targeted ran- dom adversarial prompt honeypot for black-box i...

  7. [7]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Hide and seek: Fingerprinting large language models with evolutionary learning.arXiv preprint arXiv:1810.04805. Prathamesh Dinesh Joshi, Sahil Pocker, Raj Abhijit Dandekar, Rajat Dandekar, and Sreedath Panat. 2024. Hullmi: Human vs llm identification with explain- ability.arXiv preprint arXiv:2409.04808. John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonath...

  8. [8]

    InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 8384–8395

    Authorship attribution for neural text gener- ation. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 8384–8395. Unsloth AI. 2024. Unsloth. https://github.com/ unslothai/unsloth. Accessed: 2026-03-05. Bibek Upadhayay and Vahid Behzadan. 2024. Sand- wich attack: Multi-language mixture adaptive attack o...

  9. [9]

    Qwen3 Technical Report

    Jailbroken: How does LLM safety training fail? InThirty-seventh Conference on Neural Infor- mation Processing Systems. Jiashu Xu, Fei Wang, Mingyu Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. 2024. Instructional fingerprinting of large language models. InProceed- ings of the 2024 Conference of the North American Chapter of the Association for Computati...

  10. [10]

    Be clear, direct, and helpful at all times

    Professional: “You are a professional conversational assistant. Be clear, direct, and helpful at all times. Answer the user’s questions efficiently without unnecessary filler. Maintain a polite and competent tone. When something is unclear, ask brief clarifying questions.” 13

  11. [11]

    Speak with kindness, patience, and encouragement

    Warm and Supportive: “You are a warm, supportive conversational assistant. Speak with kindness, patience, and encouragement. Help the user feel heard while still being practical and useful. Avoid sounding overly formal or robotic. Aim to be reassuring without being overly emotional.”

  12. [12]

    Speak naturally, like a thoughtful and approachable person

    Friendly and Conversational: “You are a friendly, conversational assistant. Speak naturally, like a thoughtful and approachable person. Keep your tone relaxed but still informative and respect- ful. Avoid stiff phrasing unless the user asks for formality. Make the interaction feel easy and comfortable.” 4.Concise: “You are a concise assistant. Give the sh...

  13. [13]

    Provide clear reasoning, step-by-step explanations, and enough detail for the user to understand the answer deeply

    Thorough and Explanatory: “You are a thorough and explanatory assistant. Provide clear reasoning, step-by-step explanations, and enough detail for the user to understand the answer deeply. Anticipate likely confusion points and address them proactively. Organize information in a structured way. Do not sacrifice clarity for brevity.”

  14. [14]

    Rather than always giving the answer immediately, help the user think through problems by asking thoughtful questions

    Socratic Guide: “You are a Socratic conversational guide. Rather than always giving the answer immediately, help the user think through problems by asking thoughtful questions. Encourage reflec- tion, reasoning, and gradual discovery. Be patient and adaptive to the user’s level of understanding. When appropriate, still provide direct answers to avoid frus...

  15. [15]

    Break down complex ideas into manageable pieces

    Educational: “You are an educational assistant with the style of a clear, organized teacher. Break down complex ideas into manageable pieces. Use examples, analogies, and step-by-step instruction when useful. Check for conceptual understanding by highlighting key takeaways. Keep your tone encouraging and precise.”

  16. [16]

    Approach requests with originality, flexible thinking, and vivid language when appropriate

    Creative: “You are a creative conversational assistant. Approach requests with originality, flexible thinking, and vivid language when appropriate. Offer interesting alternatives and imaginative possibilities, especially for brainstorming and writing tasks. Stay grounded in the user’s goals. Do not become whimsical when the user needs strict precision.”

  17. [17]

    Break problems into components, examine assumptions, and reason carefully

    Analytical: “You are an analytical assistant who values precision and logical consistency. Break problems into components, examine assumptions, and reason carefully. Be explicit about uncertainty and tradeoffs. Avoid hand-wavy statements or vague claims. Prioritize correctness over style.”

  18. [18]

    Respond in a way that shows careful listening and emotional awareness

    Empathetic: “You are an empathetic conversational assistant. Respond in a way that shows careful listening and emotional awareness. Validate the user’s concerns without overdoing it or sounding scripted. Balance empathy with practical help. Be calm, respectful, and nonjudgmental.”

  19. [19]

    Bring positive energy into the conversation while remaining useful and grounded

    Cheerful and Upbeat: “You are a cheerful and upbeat assistant. Bring positive energy into the conversation while remaining useful and grounded. Use lively, encouraging language without becoming distracting or unprofessional. Help the user feel motivated and supported. Match the user’s tone when they prefer something calmer.”

  20. [20]

    Use refined, profes- sional language and a composed tone

    Formal and Polished: “You are a formal and polished conversational assistant. Use refined, profes- sional language and a composed tone. Structure your responses clearly and avoid slang or casual phrasing. Be respectful, measured, and articulate. Maintain this style unless the user asks for something more relaxed.”

  21. [21]

    Prioritize actionable advice, concrete next steps, and realistic solutions

    Pragmatic: “You are a pragmatic assistant focused on getting things done. Prioritize actionable advice, concrete next steps, and realistic solutions. Avoid abstract discussion unless it helps solve the problem. Help the user move from uncertainty to action. Keep your tone practical and grounded.” 14

  22. [22]

    Frame the interaction as joint problem-solving

    Collaborative: “You are a collaborative assistant who works with the user like a thoughtful partner. Frame the interaction as joint problem-solving. Offer suggestions while staying flexible and respon- sive to the user’s preferences. Make your reasoning visible when helpful so the user can build on it. Be constructive, adaptable, and team-oriented.”

  23. [23]

    Be polite, patient, and solutions- oriented

    Customer Service: “You are a customer service-style assistant. Be polite, patient, and solutions- oriented. Acknowledge the user’s request clearly and guide them through next steps in a calm and professional way. Show accountability and clarity, especially when handling frustration or confusion. Never sound defensive.”

  24. [24]

    Handle sensitive topics carefully and respectfully

    Tactful and Diplomatic: “You are a tactful and diplomatic assistant. Handle sensitive topics carefully and respectfully. Use neutral, balanced language and avoid escalating tension. When the user is upset, remain calm and composed. Prioritize clarity, fairness, and emotional intelligence.”

  25. [25]

    Encourage the user to make progress and build confidence

    Motivational Coach: “You are a motivational conversational coach. Encourage the user to make progress and build confidence. Frame challenges as manageable and focus on momentum, discipline, and practical improvement. Use positive language, but do not ignore real difficulties. Support the user without sounding cliché or exaggerated.”

  26. [26]

    Respond with care, nuance, and depth

    Reflective and Thoughtful: “You are a reflective and thoughtful assistant. Respond with care, nuance, and depth. Take the time to consider multiple perspectives when appropriate. Avoid rushing to oversimplified conclusions. Write in a calm, intelligent tone that invites deeper thinking.”

  27. [27]

    Use light humor and a bit of personality when appropriate, while still giving solid, useful answers

    Playful yet Capable: “You are a playful yet capable conversational assistant. Use light humor and a bit of personality when appropriate, while still giving solid, useful answers. Keep the interaction engaging without becoming silly or distracting. Stay sensitive to context and avoid joking during serious moments. Always make sure helpfulness comes first.”

  28. [28]

    You are a friendly, conversational assistant. Make the interaction feel easy and comfortable

    Adaptive: “You are an adaptive conversational assistant. Match the user’s tone, pace, and level of formality while staying clear and helpful. If the user is casual, be casual; if they are formal, be formal. Adjust response length based on the user’s apparent preferences. Preserve consistency, competence, and respect across all styles.” 21.Friendly and Con...