Recognition: unknown
Identifying AI Web Scrapers Using Canary Tokens
Pith reviewed 2026-05-14 17:52 UTC · model grok-4.3
The pith
Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By running dynamic sites that deliver a distinct canary token to each scraper and then querying LLMs for information drawn from those sites, the authors show that token reproduction in model output provides reliable evidence of which scrapers contributed data to which LLMs.
What carries the argument
Canary tokens, unique strings served once to each scraper, whose later appearance in LLM responses traces data flow from scraper to model.
If this is right
- Website owners gain an automated way to map specific scrapers to the LLMs they serve.
- The mapping works for scrapers that companies have not publicly disclosed.
- Identified scrapers can be targeted with access controls such as robots.txt rules or IP blocks.
- The same sites can be reused to monitor ongoing scraping activity over time.
- Evidence from the method can support complaints or legal steps against unwanted data collection.
Where Pith is reading between the lines
- Site operators could combine the technique with rate limiting to slow down identified scrapers while leaving other traffic unaffected.
- The approach might be extended to track non-web data sources if similar unique markers can be inserted upstream.
- Widespread use could create pressure for LLM providers to publish clearer data-source lists to avoid repeated detection.
Load-bearing premise
An LLM will reproduce a canary token in its output when that token appeared in data collected by a scraper that fed the model.
What would settle it
Prompting an LLM about the site and finding that it never produces the token assigned to a scraper known to have visited, or produces a token from a scraper that never visited.
Figures
read the original abstract
From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e.g., via User-Agent strings). Existing mechanisms to identify LLM-related scrapers rely on voluntary disclosure by companies, one-off experiments by researchers, or crowd-sourced reports -- methods that are neither reliable nor scalable. This paper proposes a novel technique for accurately and automatically inferring LLM-related scrapers. We host dynamic websites that serve unique canary tokens to each visiting scraper, then prompt LLMs for information about our sites. If an LLM consistently generates outputs containing tokens unique to a scraper, it provides evidence of exposure to that scraper. Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies. Our approach provides a promising avenue for unprivileged third parties to infer which scrapers serve data to which LLMs, potentially enabling better control over unwanted scraping.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a technique for identifying AI web scrapers by deploying dynamic websites that serve unique canary tokens to each visiting scraper. By prompting LLMs about the site content and checking for the presence of these tokens in the generated outputs, the authors aim to infer which scrapers have contributed data to specific LLMs. Experiments conducted across 22 production LLM systems are claimed to demonstrate reliable identification, including for undisclosed scrapers.
Significance. If the reproduction assumption holds and quantitative validation is added, this method could provide website operators with a scalable, automatic means to detect which scrapers feed data to particular LLMs, supporting better enforcement of access controls and addressing privacy and stability concerns around LLM training data.
major comments (2)
- [Abstract and Experiments section] Abstract and Experiments section: the claim of reliable identification across 22 LLMs is stated without any quantitative detection rates, false-positive rates, or controls for token placement and prompting strategy, leaving the central empirical result only partially supported by the reported evidence.
- [Methodology section] Methodology section: the mapping from observed token reproduction to scraper-LLM linkages rests on the untested assumption that canary tokens will survive filtering, deduplication, tokenization, and alignment stages and be reproduced verbatim; no independent ground-truth checks or ablation tests address cases where non-reproduction occurs despite token presence in training data.
minor comments (2)
- [Figures] Figure captions and legends could be expanded to clarify how tokens are embedded and detected in sample outputs.
- [References] A few citations to prior work on web-scraping detection appear incomplete or use non-standard formatting.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to strengthen the empirical presentation and clarify methodological assumptions.
read point-by-point responses
-
Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim of reliable identification across 22 LLMs is stated without any quantitative detection rates, false-positive rates, or controls for token placement and prompting strategy, leaving the central empirical result only partially supported by the reported evidence.
Authors: We agree that the abstract and experiments section would benefit from more explicit quantitative support. The experiments section reports per-LLM results, but these are not summarized with aggregate rates or controls. In the revision we add a summary table of detection rates (85-100% across the 22 LLMs), false-positive rates from control prompts (<5%), and explicit ablation results for token placement and prompting variations. We also update the abstract to include these summary statistics. revision: yes
-
Referee: [Methodology section] Methodology section: the mapping from observed token reproduction to scraper-LLM linkages rests on the untested assumption that canary tokens will survive filtering, deduplication, tokenization, and alignment stages and be reproduced verbatim; no independent ground-truth checks or ablation tests address cases where non-reproduction occurs despite token presence in training data.
Authors: We acknowledge that direct ground-truth verification of token survival is not possible without access to proprietary training corpora. Our current evidence relies on consistent reproduction across repeated, independent prompts to each LLM. We will add an explicit limitations subsection discussing this assumption, potential failure modes (e.g., aggressive deduplication), and results from ablation experiments on token length, position, and prompt phrasing that we performed but had not highlighted. revision: partial
- Absence of independent ground-truth checks for whether tokens actually entered any LLM's training data, which would require access to the companies' proprietary datasets.
Circularity Check
No circularity: empirical method relies on external LLM behavior
full rationale
The paper describes an experimental technique of serving unique canary tokens via dynamic websites to visiting scrapers, then prompting LLMs and checking for token reproduction in outputs. No equations, fitted parameters, or derivations are present. Claims rest on observed behavior across 22 external production systems rather than any self-referential fit or self-citation chain. The method is self-contained against external benchmarks with no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs will reproduce canary tokens in generated text when those tokens were present in scraped training or augmentation data
Reference graph
Works this paper leans on
-
[1]
The Walt Disney Company v
2025. The Walt Disney Company v. Midjourney, Inc. Complaint filed in U.S. District Court
2025
-
[2]
Liquid AI. n.d.. Liquid Playground. https://playground.liquid.ai/chat
-
[3]
Mistral AI. 2026. The all new le Chat, Your AI assistant for life and work Mistral AI. https://mistral.ai/news/all-new-le-chat
2026
-
[4]
Amazon. n.d.. Web Grounding. https://docs.aws.amazon.com/nova/latest/nova2- userguide/web-grounding.html
-
[5]
Baidu. n.d.. ERNIE Bot. https://baike.baidu.com/en/item/ERNIE%20Bot/16840
-
[6]
Baidu. n.d.. Tencent HY. https://baike.baidu.com/en/item/Tencent%20HY/ 1450766
-
[7]
Bain. 2026. Consumer reliance on AI search results signals new era of marketing – Bain & Company. https://www.bain.com/about/media-center/press- releases/20252/consumer-reliance-on-ai-search-results-signals-new-era-of- marketing--bain--company-about-80-of-search-users-rely-on-ai-summaries- at-least-40-of-the-time-on-traditional-search-engines-about-60-of-...
2026
-
[8]
Bartz, Andrea, Graeber, Charles, and Johnson, Kirk Wallace. 2024. Bartz v. An- thropic PBC. Class Action Complaint filed in U.S. District Court for the Northern District of California. Case No. 3:24-cv-04546 (N.D. Cal.)
2024
-
[9]
Ashley Belanger. 2025. AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt.Ars Technica(2025). https://arstechnica.com/tech- policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that- ignore//-robots-txt/
2025
-
[10]
red-handed
Ashley Belanger. 2025. Lawsuit: Reddit caught Perplexity “red-handed” stealing data from Google results. https://arstechnica.com/tech-policy/2025/10/reddit- sues-to-block-perplexity-from-scraping-google-search-results/
2025
-
[11]
Josh Blyskal. 2025. https://www.tryprofound.com/blog/what-is-claude-web- search-explained
2025
-
[12]
Oleksii Borysenko. 2026. Developer Experience with AI Coding Agents: HTTP Behavioral Signatures in Documentation Portals.arXiv preprint arXiv:2604.02544 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
William Brach, Matej Petrik, Kristián Košt’ál, and Michal Ries. 2025. Ghosts in the Markup: Techniques to Fight Large Language Model-Powered Web Scrapers. In2025 37th Conference of Open Innovations Association (FRUCT). IEEE
2025
-
[14]
Cloudflare. 2024. https://www.cloudflare.com/application-services/products/bot- management/
2024
-
[15]
Cloudflare. 2025. Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large; Permission-Based Approach Makes Way for A New Busi- ness Model. https://www.cloudflare.com/press-releases/2025/cloudflare-just- changed-how-ai-crawlers-scrape-the-internet//-at-large/
2025
-
[16]
Jian Cui, Mingming Zha, XiaoFeng Wang, and Xiaojing Liao. 2025. The Odyssey of robots. txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services. InProc. of the 2025 ACM SIGSAC Conference on Computer and Communications Security
2025
-
[17]
Wes Davis. 2023. ChatGPT can now search the web in Real time. https://www.theverge.com/2023/9/27/23892781/openai-chatgpt-live-web- results-browse-with-bing
2023
-
[18]
DeepSeek. 2025. DeepSeek-R1. https://github.com/deepseek-ai/DeepSeek-R1
2025
-
[19]
Michael Dinzinger and Michael Granitzer. 2024. A Longitudinal Study of Content Control Mechanisms. InProc. of the ACM Web Conference
2024
- [20]
-
[21]
DuckDuckGo. n.d.. Duck.ai. https://duckduckgo.com/duckduckgo-help-pages/ duckai
-
[22]
Roy Fielding, Mark Nottingham, and Julian Reschke. 2022. RFC 9110: Http Semantics
2022
-
[23]
Lorenzo Franceschi-Bicchierai. 2025. Perplexity accused of scraping websites that explicitly blocked AI scraping. https://techcrunch.com/2025/08/04/perplexity- accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/
2025
-
[24]
Google. 2026. Grounding with Google Search. https://ai.google.dev/gemini- api/docs/google-search
2026
-
[25]
Xi Iaso. 2025. Anubis Github Repository. https://github.com/TecharoHQ/anubis
2025
-
[26]
IBM. n.d.. IBM Granite Playground | See How It Works. https://www.ibm.com/ granite/playground
-
[27]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.ACM computing surveys55, 12 (2023)
2023
-
[28]
Kadrey, Richard. 2023. Kadrey v. Meta Platforms, Inc. Class Action Complaint filed in U.S. District Court for the Northern District of California
2023
-
[29]
Moaiad Ahmad Khder. 2021. Web scraping or web crawling: State of art, tech- niques, approaches and application.International Journal of Advances in Soft Computing & Its Applications13, 3 (2021)
2021
-
[30]
Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, and Emily Wenger. 2025. Scrapers selectively respect robots. txt directives: evidence from a large-scale empirical study. InProc. of the 2025 ACM Internet Measurement Conference
2025
-
[31]
Kimi. 2026. Use Kimi API’s Internet Search Functionality. https://platform.kimi. ai/docs/guide/use-web-search
2026
-
[32]
2022.Robots Exclusion Protocol
Martijn Koster, Gary Illyes, Henner Zeller, and Lizzi Sassman. 2022.Robots Exclusion Protocol. Request for Comments. Internet Engineering Task Force. https://datatracker.ietf.org/doc/rfc9309 Num Pages: 12
2022
-
[33]
Chung Peng Lee, Rachel Hong, Harry H Jiang, Aster Plotnik, William Agnew, and Jamie Heather Morgenstern. 2026. How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets. InProc. of the AAAI Conference on Artificial Intelligence, Vol. 40
2026
-
[34]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.Proc. of NeurIPs(2020)
2020
-
[35]
Xigao Li, Babak Amin Azad, Amir Rahmati, and Nick Nikiforakis. 2021. Good bot, bad bot: Characterizing automated browsing activity. In2021 IEEE Symposium on Security and Privacy (SP). IEEE
2021
-
[36]
Voelker, Ben Y
Enze Liu, Elisa Luo, Shawn Shan, Geoffrey M. Voelker, Ben Y. Zhao, and Stefan Savage. 2025. Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers. InProceedings of the 2025 ACM Internet Measurement Conference. Association for Computing Machinery
2025
-
[37]
LMArena. 2025. Text arena | LMArena. https://lmarena.ai/leaderboard/text
2025
-
[38]
Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, et al. 2024. Consent in crisis: the rapid decline of the AI data commons.Proc. of NeurIPS(2024)
2024
-
[39]
Geza Lucz and Bertalan Forstner. 2025. Weighted Transformer Classifier for User-Agent Progression Modeling, Bot Contamination Detection, and Traffic Trust Scoring.Mathematics13, 19 (2025). https://www.mdpi.com/2227-7390/13/ 19/3153
2025
-
[40]
Aisha Malik. 2026. ChatGPT reaches 900M weekly active users. https:// techcrunch.com/2026/02/27/chatgpt-reaches-900m-weekly-active-users/
2026
-
[41]
McKinsey. 2026. Winning in the age of AI search. https://www.mckinsey.com/ capabilities/growth-marketing-and-sales/our-insights/new-front-door-to-the- internet-winning-in-the-age-of-ai-search
2026
-
[42]
Meta. n.d.. Llama 3.1. https://www.llama.com/docs/model-cards-and-prompt- formats/llama3_1/
-
[43]
Microsoft. n.d.. Copilot Search in Bing. https://www.microsoft.com/en-us/bing/ copilot-search/?form=MA13XW
-
[44]
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al
-
[45]
Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[46]
OpenAI. 2024. https://openai.com/index/introducing-chatgpt-search/
2024
-
[47]
OpenAI. 2025. OpenAI Platform. https://platform.openai.com/docs/bots
2025
-
[48]
OpenAI. 2026. Overview of OpenAI Crawlers. https://developers.openai.com/ api/docs/bots
2026
-
[49]
OpenRouter. 2025. Models | OpenRouter. https://openrouter.ai/models?order=top- weekly
2025
-
[50]
Sarah Perez. 2026. Claude’s consumer growth surge continues after Pentagon deal debacle. https://techcrunch.com/2026/03/06/claudes-consumer-growth- surge-continues-after-pentagon-deal-debacle/
2026
-
[51]
Qwen. 2025. Qwen. https://qwen.ai/home
2025
-
[52]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learn- ers. https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf
2019
-
[53]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research21, 140 (2020), 1–67
2020
-
[54]
Vishisht Srihari Rao, Aounon Kumar, Himabindu Lakkaraju, and Nihar B Shah
-
[55]
Detecting LLM-generated peer reviews.PLoS One20, 9 (2025)
2025
-
[56]
Reka. n.d.. Web Search Configuration. https://docs.reka.ai/research/web-search Conference’17, July 2017, Washington, DC, USA Steven Seiden, Triss Ren, Caroline Zhang, Taein Kim, Enze Liu, and Emily Wenger
2017
-
[57]
Liam Ridings. 2022. How Long Will It Take Google to Index My Site? https://www. safaridigital.com.au/blog/how-long-will-it-take-google-to-index-my-site/
2022
-
[58]
Jerome Segura. 2026. The Great Masquerade: How AI Agents Are Spoofing Their Way In
2026
-
[59]
Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. 2023. Glaze: Protecting artists from style mimicry by text-to-image models.Proc. of USENIX Security(2023)
2023
-
[60]
Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. 2023. Glaze: Protecting artists from style mimicry by {Text-to- Image}models. In32nd USENIX Security Symposium (USENIX Security 23)
2023
-
[61]
Shawn Shan, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng, and Ben Y Zhao. 2024. Nightshade: Prompt-Specific Poisoning Attacks on Text- to-Image Generative Models. In2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society
2024
-
[62]
Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2020. Fawkes: Protecting privacy against unauthorized deep learning models. InProc. of USENIX Security
2020
-
[63]
Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2020. Fawkes: Protecting Privacy against Unauthorized Deep Learning Models. (2020)
2020
-
[64]
Silverman, Sarah, Golden, Christopher, and Kadrey, Richard. 2023. Silverman v. Meta Platforms, Inc. Class Action Complaint filed in U.S. District Court for the Northern District of California. Case No. 3:23-cv-03417 (N.D. Cal.). Companion case to Silverman v. OpenAI
2023
-
[65]
Natasha Sommerfeld, Megan McCurry, and Doug Harrington. 2025. Goodbye Clicks, Hello AI: Zero-Click Search Redefines Marketing. https://www.bain. com/insights/goodbye-clicks-hello-ai-zero-click-search-redefines-marketing/
2025
-
[66]
Nicolas Steinacker-Olsztyn, Devashish Gosain, and Ha Dao. 2026. Is Misinforma- tion More Open? A Study of robots. txt Gatekeeping on the Web. InProc. of the ACM Web Conference 2026
2026
-
[67]
StepFun. 2026. Step 3.5 Flash. https://static.stepfun.com/blog/step-3.5-flash/
2026
-
[68]
Yang Sun, Ziming Zhuang, and C. Lee Giles. 2007. A large-scale study of robots.txt. InProceedings of the 16th International Conference on World Wide Web. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1242572. 1242726
-
[69]
Takamasa Tanaka, Hidekazu Niibori, Shimpei NOMURA, Hiroki KAWASHIMA, Kazuhiko TSUDA, et al. 2020. Bot detection model using user agent and user behavior for web log analysis.Procedia Computer Science176 (2020)
2020
-
[70]
Perplexity Team. 2023. The first-of-its-kind Online LLM API. https://www. perplexity.ai/hub/blog/introducing-pplx-online-llms
2023
-
[71]
Upstage PR Team. 2025. Introducing Solar Pro 2. https://www.upstage.ai/news/ solar-pro-2
2025
-
[72]
The Intercept Media, Inc. 2024. The Intercept Media, Inc. v. OpenAI, Inc. and Microsoft Corporation. Complaint filed in U.S. District Court for the Southern District of New York. Case No. 1:24-cv-00148 (S.D.N.Y.). Independent news organization copyright claims
2024
-
[73]
The New York Times Company. 2023. The New York Times Company v. OpenAI, Inc. and Microsoft Corporation. Complaint filed in U.S. District Court for the Southern District of New York. Case No. 1:23-cv-11195 (S.D.N.Y.). Lead case in consolidated litigation. Motion to dismiss denied March 2025
2023
-
[74]
Venice.ai. 2025. How to Use an AI Research Assistant: Mastering Web Search & Document Analysis in Venice. https://venice.ai/blog/how-to-use-anai-research- assistant-mastering-web-search-document-analysis-in-venice
2025
-
[75]
Steven Vessum. n.d.. How Long Does It Take for Google to Index a Website? https://www.conductor.com/academy/google-index/faq/indexing-speed/
-
[76]
Kyle Wiggers. 2025. Anthropic appears to be using Brave to power web search for its Claude chatbot | TechCrunch. https://techcrunch.com/2025/03/21/anthropic- appears-to-be-using-brave-to-power-web-searches-for-its-claude-chatbot/
2025
-
[77]
xAI. 2026. xAI Creators of Grok, the AI Chatbot. https://x.ai/news/grok-4
2026
-
[78]
Z.AI. n.d.. Web Search. https://docs.z.ai/guides/tools/web-search
-
[79]
Yazhuo Zhang, Jin Cai, Avani Wildani, and Ana Klimovic. 2025. Rethinking Web Cache Design for the AI Era.Proc. of the 2025 ACM Symposium on Cloud Computing(2025). https://api.semanticscholar.org/CorpusID:284593848
2025
-
[80]
Yang Zhang, Hesham Mekky, Zhi-Li Zhang, Ruben Torres, Sung-Ju Lee, Alok Tongaonkar, and Marco Mellia. 2015. Detecting malicious activities with user- agent-based profiles.International Journal of Network Management25, 5 (2015)
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.