pith. machine review for the scientific record. sign in

arxiv: 2605.13706 · v1 · submitted 2026-05-13 · 💻 cs.CR · cs.AI· cs.CY· cs.NI

Recognition: unknown

Identifying AI Web Scrapers Using Canary Tokens

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:52 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CYcs.NI
keywords canary tokensweb scrapingLLM data sourcesscraper identificationtraining data attributionaccess controlAI ethics
0
0 comments X

The pith

Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a method for website owners to identify which web scrapers supply training or context data to particular LLMs. It works by serving a different secret string to each scraper that visits a controlled site, then asking the LLMs questions about the site content. When an LLM reliably repeats one of those strings, the match shows that the corresponding scraper contributed material to the model. Experiments across twenty-two production systems confirmed that the technique detects both known and undisclosed scrapers, giving site operators evidence they can use to restrict access.

Core claim

By running dynamic sites that deliver a distinct canary token to each scraper and then querying LLMs for information drawn from those sites, the authors show that token reproduction in model output provides reliable evidence of which scrapers contributed data to which LLMs.

What carries the argument

Canary tokens, unique strings served once to each scraper, whose later appearance in LLM responses traces data flow from scraper to model.

If this is right

  • Website owners gain an automated way to map specific scrapers to the LLMs they serve.
  • The mapping works for scrapers that companies have not publicly disclosed.
  • Identified scrapers can be targeted with access controls such as robots.txt rules or IP blocks.
  • The same sites can be reused to monitor ongoing scraping activity over time.
  • Evidence from the method can support complaints or legal steps against unwanted data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Site operators could combine the technique with rate limiting to slow down identified scrapers while leaving other traffic unaffected.
  • The approach might be extended to track non-web data sources if similar unique markers can be inserted upstream.
  • Widespread use could create pressure for LLM providers to publish clearer data-source lists to avoid repeated detection.

Load-bearing premise

An LLM will reproduce a canary token in its output when that token appeared in data collected by a scraper that fed the model.

What would settle it

Prompting an LLM about the site and finding that it never produces the token assigned to a scraper known to have visited, or produces a token from a scraper that never visited.

Figures

Figures reproduced from arXiv: 2605.13706 by Caroline Zhang, Emily Wenger, Enze Liu, Steven Seiden, Taein Kim, Triss Ren.

Figure 1
Figure 1. Figure 1: High-level overview of how web data is sourced during real-time content retrieval by AI chatbots. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed pipeline for identifying web scrapers used by AI chatbots. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Measurement timeline for our study. Our measure￾ments are divided into 3 distinct stages with varying website acces￾sibility conditions —see Section 4 for details. User-Agents ASNs Unique visitors Min across sites 313 154 313 Max from sites 477 226 674 Avg from sites 405.95 192.4 592.2 All across sites 2765 549 4042 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e.g., via User-Agent strings). Existing mechanisms to identify LLM-related scrapers rely on voluntary disclosure by companies, one-off experiments by researchers, or crowd-sourced reports -- methods that are neither reliable nor scalable. This paper proposes a novel technique for accurately and automatically inferring LLM-related scrapers. We host dynamic websites that serve unique canary tokens to each visiting scraper, then prompt LLMs for information about our sites. If an LLM consistently generates outputs containing tokens unique to a scraper, it provides evidence of exposure to that scraper. Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies. Our approach provides a promising avenue for unprivileged third parties to infer which scrapers serve data to which LLMs, potentially enabling better control over unwanted scraping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a technique for identifying AI web scrapers by deploying dynamic websites that serve unique canary tokens to each visiting scraper. By prompting LLMs about the site content and checking for the presence of these tokens in the generated outputs, the authors aim to infer which scrapers have contributed data to specific LLMs. Experiments conducted across 22 production LLM systems are claimed to demonstrate reliable identification, including for undisclosed scrapers.

Significance. If the reproduction assumption holds and quantitative validation is added, this method could provide website operators with a scalable, automatic means to detect which scrapers feed data to particular LLMs, supporting better enforcement of access controls and addressing privacy and stability concerns around LLM training data.

major comments (2)
  1. [Abstract and Experiments section] Abstract and Experiments section: the claim of reliable identification across 22 LLMs is stated without any quantitative detection rates, false-positive rates, or controls for token placement and prompting strategy, leaving the central empirical result only partially supported by the reported evidence.
  2. [Methodology section] Methodology section: the mapping from observed token reproduction to scraper-LLM linkages rests on the untested assumption that canary tokens will survive filtering, deduplication, tokenization, and alignment stages and be reproduced verbatim; no independent ground-truth checks or ablation tests address cases where non-reproduction occurs despite token presence in training data.
minor comments (2)
  1. [Figures] Figure captions and legends could be expanded to clarify how tokens are embedded and detected in sample outputs.
  2. [References] A few citations to prior work on web-scraping detection appear incomplete or use non-standard formatting.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to strengthen the empirical presentation and clarify methodological assumptions.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim of reliable identification across 22 LLMs is stated without any quantitative detection rates, false-positive rates, or controls for token placement and prompting strategy, leaving the central empirical result only partially supported by the reported evidence.

    Authors: We agree that the abstract and experiments section would benefit from more explicit quantitative support. The experiments section reports per-LLM results, but these are not summarized with aggregate rates or controls. In the revision we add a summary table of detection rates (85-100% across the 22 LLMs), false-positive rates from control prompts (<5%), and explicit ablation results for token placement and prompting variations. We also update the abstract to include these summary statistics. revision: yes

  2. Referee: [Methodology section] Methodology section: the mapping from observed token reproduction to scraper-LLM linkages rests on the untested assumption that canary tokens will survive filtering, deduplication, tokenization, and alignment stages and be reproduced verbatim; no independent ground-truth checks or ablation tests address cases where non-reproduction occurs despite token presence in training data.

    Authors: We acknowledge that direct ground-truth verification of token survival is not possible without access to proprietary training corpora. Our current evidence relies on consistent reproduction across repeated, independent prompts to each LLM. We will add an explicit limitations subsection discussing this assumption, potential failure modes (e.g., aggressive deduplication), and results from ablation experiments on token length, position, and prompt phrasing that we performed but had not highlighted. revision: partial

standing simulated objections not resolved
  • Absence of independent ground-truth checks for whether tokens actually entered any LLM's training data, which would require access to the companies' proprietary datasets.

Circularity Check

0 steps flagged

No circularity: empirical method relies on external LLM behavior

full rationale

The paper describes an experimental technique of serving unique canary tokens via dynamic websites to visiting scrapers, then prompting LLMs and checking for token reproduction in outputs. No equations, fitted parameters, or derivations are present. Claims rest on observed behavior across 22 external production systems rather than any self-referential fit or self-citation chain. The method is self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about LLM output behavior and treats canary tokens as an application of an existing concept rather than a new invented entity.

axioms (1)
  • domain assumption LLMs will reproduce canary tokens in generated text when those tokens were present in scraped training or augmentation data
    Invoked in the description of how token presence in LLM outputs provides evidence of scraper exposure.

pith-pipeline@v0.9.0 · 5574 in / 1122 out tokens · 22771 ms · 2026-05-14T17:52:31.506880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    The Walt Disney Company v

    2025. The Walt Disney Company v. Midjourney, Inc. Complaint filed in U.S. District Court

  2. [2]

    Liquid AI. n.d.. Liquid Playground. https://playground.liquid.ai/chat

  3. [3]

    Mistral AI. 2026. The all new le Chat, Your AI assistant for life and work Mistral AI. https://mistral.ai/news/all-new-le-chat

  4. [4]

    Amazon. n.d.. Web Grounding. https://docs.aws.amazon.com/nova/latest/nova2- userguide/web-grounding.html

  5. [5]

    Baidu. n.d.. ERNIE Bot. https://baike.baidu.com/en/item/ERNIE%20Bot/16840

  6. [6]

    Baidu. n.d.. Tencent HY. https://baike.baidu.com/en/item/Tencent%20HY/ 1450766

  7. [7]

    Bain. 2026. Consumer reliance on AI search results signals new era of marketing – Bain & Company. https://www.bain.com/about/media-center/press- releases/20252/consumer-reliance-on-ai-search-results-signals-new-era-of- marketing--bain--company-about-80-of-search-users-rely-on-ai-summaries- at-least-40-of-the-time-on-traditional-search-engines-about-60-of-...

  8. [8]

    Bartz, Andrea, Graeber, Charles, and Johnson, Kirk Wallace. 2024. Bartz v. An- thropic PBC. Class Action Complaint filed in U.S. District Court for the Northern District of California. Case No. 3:24-cv-04546 (N.D. Cal.)

  9. [9]

    Ashley Belanger. 2025. AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt.Ars Technica(2025). https://arstechnica.com/tech- policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that- ignore//-robots-txt/

  10. [10]

    red-handed

    Ashley Belanger. 2025. Lawsuit: Reddit caught Perplexity “red-handed” stealing data from Google results. https://arstechnica.com/tech-policy/2025/10/reddit- sues-to-block-perplexity-from-scraping-google-search-results/

  11. [11]

    Josh Blyskal. 2025. https://www.tryprofound.com/blog/what-is-claude-web- search-explained

  12. [12]

    Oleksii Borysenko. 2026. Developer Experience with AI Coding Agents: HTTP Behavioral Signatures in Documentation Portals.arXiv preprint arXiv:2604.02544 (2026)

  13. [13]

    William Brach, Matej Petrik, Kristián Košt’ál, and Michal Ries. 2025. Ghosts in the Markup: Techniques to Fight Large Language Model-Powered Web Scrapers. In2025 37th Conference of Open Innovations Association (FRUCT). IEEE

  14. [14]

    Cloudflare. 2024. https://www.cloudflare.com/application-services/products/bot- management/

  15. [15]

    Cloudflare. 2025. Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large; Permission-Based Approach Makes Way for A New Busi- ness Model. https://www.cloudflare.com/press-releases/2025/cloudflare-just- changed-how-ai-crawlers-scrape-the-internet//-at-large/

  16. [16]

    Jian Cui, Mingming Zha, XiaoFeng Wang, and Xiaojing Liao. 2025. The Odyssey of robots. txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services. InProc. of the 2025 ACM SIGSAC Conference on Computer and Communications Security

  17. [17]

    Wes Davis. 2023. ChatGPT can now search the web in Real time. https://www.theverge.com/2023/9/27/23892781/openai-chatgpt-live-web- results-browse-with-bing

  18. [18]

    DeepSeek. 2025. DeepSeek-R1. https://github.com/deepseek-ai/DeepSeek-R1

  19. [19]

    Michael Dinzinger and Michael Granitzer. 2024. A Longitudinal Study of Content Control Mechanisms. InProc. of the ACM Web Conference

  20. [20]

    Michael Dinzinger, Florian Heß, and Michael Granitzer. 2024. A Survey of Web Content Control for Generative AI. (2024). arXiv:2404.02309 [cs.IR] https: //arxiv.org/abs/2404.02309

  21. [21]

    DuckDuckGo. n.d.. Duck.ai. https://duckduckgo.com/duckduckgo-help-pages/ duckai

  22. [22]

    Roy Fielding, Mark Nottingham, and Julian Reschke. 2022. RFC 9110: Http Semantics

  23. [23]

    Lorenzo Franceschi-Bicchierai. 2025. Perplexity accused of scraping websites that explicitly blocked AI scraping. https://techcrunch.com/2025/08/04/perplexity- accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/

  24. [24]

    Google. 2026. Grounding with Google Search. https://ai.google.dev/gemini- api/docs/google-search

  25. [25]

    Xi Iaso. 2025. Anubis Github Repository. https://github.com/TecharoHQ/anubis

  26. [26]

    IBM. n.d.. IBM Granite Playground | See How It Works. https://www.ibm.com/ granite/playground

  27. [27]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.ACM computing surveys55, 12 (2023)

  28. [28]

    Kadrey, Richard. 2023. Kadrey v. Meta Platforms, Inc. Class Action Complaint filed in U.S. District Court for the Northern District of California

  29. [29]

    Moaiad Ahmad Khder. 2021. Web scraping or web crawling: State of art, tech- niques, approaches and application.International Journal of Advances in Soft Computing & Its Applications13, 3 (2021)

  30. [30]

    Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, and Emily Wenger. 2025. Scrapers selectively respect robots. txt directives: evidence from a large-scale empirical study. InProc. of the 2025 ACM Internet Measurement Conference

  31. [31]

    Kimi. 2026. Use Kimi API’s Internet Search Functionality. https://platform.kimi. ai/docs/guide/use-web-search

  32. [32]

    2022.Robots Exclusion Protocol

    Martijn Koster, Gary Illyes, Henner Zeller, and Lizzi Sassman. 2022.Robots Exclusion Protocol. Request for Comments. Internet Engineering Task Force. https://datatracker.ietf.org/doc/rfc9309 Num Pages: 12

  33. [33]

    Chung Peng Lee, Rachel Hong, Harry H Jiang, Aster Plotnik, William Agnew, and Jamie Heather Morgenstern. 2026. How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets. InProc. of the AAAI Conference on Artificial Intelligence, Vol. 40

  34. [34]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.Proc. of NeurIPs(2020)

  35. [35]

    Xigao Li, Babak Amin Azad, Amir Rahmati, and Nick Nikiforakis. 2021. Good bot, bad bot: Characterizing automated browsing activity. In2021 IEEE Symposium on Security and Privacy (SP). IEEE

  36. [36]

    Voelker, Ben Y

    Enze Liu, Elisa Luo, Shawn Shan, Geoffrey M. Voelker, Ben Y. Zhao, and Stefan Savage. 2025. Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers. InProceedings of the 2025 ACM Internet Measurement Conference. Association for Computing Machinery

  37. [37]

    LMArena. 2025. Text arena | LMArena. https://lmarena.ai/leaderboard/text

  38. [38]

    Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, et al. 2024. Consent in crisis: the rapid decline of the AI data commons.Proc. of NeurIPS(2024)

  39. [39]

    Geza Lucz and Bertalan Forstner. 2025. Weighted Transformer Classifier for User-Agent Progression Modeling, Bot Contamination Detection, and Traffic Trust Scoring.Mathematics13, 19 (2025). https://www.mdpi.com/2227-7390/13/ 19/3153

  40. [40]

    Aisha Malik. 2026. ChatGPT reaches 900M weekly active users. https:// techcrunch.com/2026/02/27/chatgpt-reaches-900m-weekly-active-users/

  41. [41]

    McKinsey. 2026. Winning in the age of AI search. https://www.mckinsey.com/ capabilities/growth-marketing-and-sales/our-insights/new-front-door-to-the- internet-winning-in-the-age-of-ai-search

  42. [42]

    Meta. n.d.. Llama 3.1. https://www.llama.com/docs/model-cards-and-prompt- formats/llama3_1/

  43. [43]

    Microsoft. n.d.. Copilot Search in Bing. https://www.microsoft.com/en-us/bing/ copilot-search/?form=MA13XW

  44. [44]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al

  45. [45]

    Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)

  46. [46]

    OpenAI. 2024. https://openai.com/index/introducing-chatgpt-search/

  47. [47]

    OpenAI. 2025. OpenAI Platform. https://platform.openai.com/docs/bots

  48. [48]

    OpenAI. 2026. Overview of OpenAI Crawlers. https://developers.openai.com/ api/docs/bots

  49. [49]

    OpenRouter. 2025. Models | OpenRouter. https://openrouter.ai/models?order=top- weekly

  50. [50]

    Sarah Perez. 2026. Claude’s consumer growth surge continues after Pentagon deal debacle. https://techcrunch.com/2026/03/06/claudes-consumer-growth- surge-continues-after-pentagon-deal-debacle/

  51. [51]

    Qwen. 2025. Qwen. https://qwen.ai/home

  52. [52]

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learn- ers. https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

  53. [53]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research21, 140 (2020), 1–67

  54. [54]

    Vishisht Srihari Rao, Aounon Kumar, Himabindu Lakkaraju, and Nihar B Shah

  55. [55]

    Detecting LLM-generated peer reviews.PLoS One20, 9 (2025)

  56. [56]

    Reka. n.d.. Web Search Configuration. https://docs.reka.ai/research/web-search Conference’17, July 2017, Washington, DC, USA Steven Seiden, Triss Ren, Caroline Zhang, Taein Kim, Enze Liu, and Emily Wenger

  57. [57]

    Liam Ridings. 2022. How Long Will It Take Google to Index My Site? https://www. safaridigital.com.au/blog/how-long-will-it-take-google-to-index-my-site/

  58. [58]

    Jerome Segura. 2026. The Great Masquerade: How AI Agents Are Spoofing Their Way In

  59. [59]

    Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. 2023. Glaze: Protecting artists from style mimicry by text-to-image models.Proc. of USENIX Security(2023)

  60. [60]

    Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. 2023. Glaze: Protecting artists from style mimicry by {Text-to- Image}models. In32nd USENIX Security Symposium (USENIX Security 23)

  61. [61]

    Shawn Shan, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng, and Ben Y Zhao. 2024. Nightshade: Prompt-Specific Poisoning Attacks on Text- to-Image Generative Models. In2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society

  62. [62]

    Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2020. Fawkes: Protecting privacy against unauthorized deep learning models. InProc. of USENIX Security

  63. [63]

    Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2020. Fawkes: Protecting Privacy against Unauthorized Deep Learning Models. (2020)

  64. [64]

    Silverman, Sarah, Golden, Christopher, and Kadrey, Richard. 2023. Silverman v. Meta Platforms, Inc. Class Action Complaint filed in U.S. District Court for the Northern District of California. Case No. 3:23-cv-03417 (N.D. Cal.). Companion case to Silverman v. OpenAI

  65. [65]

    Natasha Sommerfeld, Megan McCurry, and Doug Harrington. 2025. Goodbye Clicks, Hello AI: Zero-Click Search Redefines Marketing. https://www.bain. com/insights/goodbye-clicks-hello-ai-zero-click-search-redefines-marketing/

  66. [66]

    Nicolas Steinacker-Olsztyn, Devashish Gosain, and Ha Dao. 2026. Is Misinforma- tion More Open? A Study of robots. txt Gatekeeping on the Web. InProc. of the ACM Web Conference 2026

  67. [67]

    StepFun. 2026. Step 3.5 Flash. https://static.stepfun.com/blog/step-3.5-flash/

  68. [68]

    Lee Giles

    Yang Sun, Ziming Zhuang, and C. Lee Giles. 2007. A large-scale study of robots.txt. InProceedings of the 16th International Conference on World Wide Web. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1242572. 1242726

  69. [69]

    Takamasa Tanaka, Hidekazu Niibori, Shimpei NOMURA, Hiroki KAWASHIMA, Kazuhiko TSUDA, et al. 2020. Bot detection model using user agent and user behavior for web log analysis.Procedia Computer Science176 (2020)

  70. [70]

    Perplexity Team. 2023. The first-of-its-kind Online LLM API. https://www. perplexity.ai/hub/blog/introducing-pplx-online-llms

  71. [71]

    Upstage PR Team. 2025. Introducing Solar Pro 2. https://www.upstage.ai/news/ solar-pro-2

  72. [72]

    The Intercept Media, Inc. 2024. The Intercept Media, Inc. v. OpenAI, Inc. and Microsoft Corporation. Complaint filed in U.S. District Court for the Southern District of New York. Case No. 1:24-cv-00148 (S.D.N.Y.). Independent news organization copyright claims

  73. [73]

    The New York Times Company. 2023. The New York Times Company v. OpenAI, Inc. and Microsoft Corporation. Complaint filed in U.S. District Court for the Southern District of New York. Case No. 1:23-cv-11195 (S.D.N.Y.). Lead case in consolidated litigation. Motion to dismiss denied March 2025

  74. [74]

    Venice.ai. 2025. How to Use an AI Research Assistant: Mastering Web Search & Document Analysis in Venice. https://venice.ai/blog/how-to-use-anai-research- assistant-mastering-web-search-document-analysis-in-venice

  75. [75]

    Steven Vessum. n.d.. How Long Does It Take for Google to Index a Website? https://www.conductor.com/academy/google-index/faq/indexing-speed/

  76. [76]

    Kyle Wiggers. 2025. Anthropic appears to be using Brave to power web search for its Claude chatbot | TechCrunch. https://techcrunch.com/2025/03/21/anthropic- appears-to-be-using-brave-to-power-web-searches-for-its-claude-chatbot/

  77. [77]

    xAI. 2026. xAI Creators of Grok, the AI Chatbot. https://x.ai/news/grok-4

  78. [78]

    Z.AI. n.d.. Web Search. https://docs.z.ai/guides/tools/web-search

  79. [79]

    Yazhuo Zhang, Jin Cai, Avani Wildani, and Ana Klimovic. 2025. Rethinking Web Cache Design for the AI Era.Proc. of the 2025 ACM Symposium on Cloud Computing(2025). https://api.semanticscholar.org/CorpusID:284593848

  80. [80]

    Yang Zhang, Hesham Mekky, Zhi-Li Zhang, Ruben Torres, Sung-Ju Lee, Alok Tongaonkar, and Marco Mellia. 2015. Detecting malicious activities with user- agent-based profiles.International Journal of Network Management25, 5 (2015)

Showing first 80 references.