arxiv: 2605.13706 · v1 · submitted 2026-05-13 · 💻 cs.CR · cs.AI· cs.CY· cs.NI

Recognition: unknown

Identifying AI Web Scrapers Using Canary Tokens

Steven Seiden , Triss Ren , Caroline Zhang , Taein Kim , Enze Liu , Emily Wenger

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:52 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CYcs.NI

keywords canary tokensweb scrapingLLM data sourcesscraper identificationtraining data attributionaccess controlAI ethics

0 comments

The pith

Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a method for website owners to identify which web scrapers supply training or context data to particular LLMs. It works by serving a different secret string to each scraper that visits a controlled site, then asking the LLMs questions about the site content. When an LLM reliably repeats one of those strings, the match shows that the corresponding scraper contributed material to the model. Experiments across twenty-two production systems confirmed that the technique detects both known and undisclosed scrapers, giving site operators evidence they can use to restrict access.

Core claim

By running dynamic sites that deliver a distinct canary token to each scraper and then querying LLMs for information drawn from those sites, the authors show that token reproduction in model output provides reliable evidence of which scrapers contributed data to which LLMs.

What carries the argument

Canary tokens, unique strings served once to each scraper, whose later appearance in LLM responses traces data flow from scraper to model.

If this is right

Website owners gain an automated way to map specific scrapers to the LLMs they serve.
The mapping works for scrapers that companies have not publicly disclosed.
Identified scrapers can be targeted with access controls such as robots.txt rules or IP blocks.
The same sites can be reused to monitor ongoing scraping activity over time.
Evidence from the method can support complaints or legal steps against unwanted data collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Site operators could combine the technique with rate limiting to slow down identified scrapers while leaving other traffic unaffected.
The approach might be extended to track non-web data sources if similar unique markers can be inserted upstream.
Widespread use could create pressure for LLM providers to publish clearer data-source lists to avoid repeated detection.

Load-bearing premise

An LLM will reproduce a canary token in its output when that token appeared in data collected by a scraper that fed the model.

What would settle it

Prompting an LLM about the site and finding that it never produces the token assigned to a scraper known to have visited, or produces a token from a scraper that never visited.

Figures

Figures reproduced from arXiv: 2605.13706 by Caroline Zhang, Emily Wenger, Enze Liu, Steven Seiden, Taein Kim, Triss Ren.

**Figure 2.** Figure 2: Our proposed pipeline for identifying web scrapers used by AI chatbots. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Measurement timeline for our study. Our measurements are divided into 3 distinct stages with varying website accessibility conditions —see Section 4 for details. User-Agents ASNs Unique visitors Min across sites 313 154 313 Max from sites 477 226 674 Avg from sites 405.95 192.4 592.2 All across sites 2765 549 4042 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

From pre-training to query-time augmentation, web-scraped data helps to improve the quality and contextual relevancy of content generated by large language models (LLMs). However, large-scale web scraping to feed LLMs can affect site stability and raise legal, privacy, or ethics concerns. If website owners wish to limit LLM-related web scraping on their site, due to these or other concerns, they may turn to scraper access control mechanisms like the Robots Exclusion Protocol. To be most effective, such mechanisms require site owners to first identify the scrapers that they wish to restrict (e.g., via User-Agent strings). Existing mechanisms to identify LLM-related scrapers rely on voluntary disclosure by companies, one-off experiments by researchers, or crowd-sourced reports -- methods that are neither reliable nor scalable. This paper proposes a novel technique for accurately and automatically inferring LLM-related scrapers. We host dynamic websites that serve unique canary tokens to each visiting scraper, then prompt LLMs for information about our sites. If an LLM consistently generates outputs containing tokens unique to a scraper, it provides evidence of exposure to that scraper. Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies. Our approach provides a promising avenue for unprivileged third parties to infer which scrapers serve data to which LLMs, potentially enabling better control over unwanted scraping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The canary token method gives a workable way to infer scraper-to-LLM links but the reproduction assumption and missing metrics keep the results provisional.

read the letter

The paper's main move is to serve a unique canary token to each scraper that hits their test sites, then prompt various LLMs for site-related content and look for the token in the output. If the token shows up, that is taken as evidence the scraper fed data to that model. They run this across 22 production systems and claim it surfaces some previously undisclosed scrapers. That is the new piece: an automated, per-scraper inference technique rather than waiting for voluntary disclosures or one-off manual tests. It is a direct, low-privilege way for site operators to gather provenance information, and the basic setup is straightforward to understand and replicate in principle. The write-up also does a clean job explaining why existing identification methods are unreliable at scale. The experiments reach a decent number of real models, which gives the claim some breadth. The soft spot is the lack of reported detection rates, false-positive controls, or details on token placement and prompting. Without those numbers it is difficult to judge how often the method actually works or how noisy it gets. The central assumption—that an LLM will emit the exact rare canary if it appeared in training data—also looks vulnerable. Training pipelines routinely filter, deduplicate, and clean data, and later alignment steps can suppress verbatim reproduction of training artifacts. The stress-test note is right on this point: without ground-truth checks on whether non-reproduction means the token was never seen or simply was not reproduced, the mappings for undisclosed scrapers rest on an unquantified step. This work is aimed at people in AI security, web privacy, and data governance who need practical tools for tracing training sources. It is coherent on its own terms and shows clear thinking about the problem, so it deserves a serious referee even though the current version needs tighter empirical support and a frank discussion of failure modes. I would send it to peer review with requests for the missing metrics and robustness checks.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a technique for identifying AI web scrapers by deploying dynamic websites that serve unique canary tokens to each visiting scraper. By prompting LLMs about the site content and checking for the presence of these tokens in the generated outputs, the authors aim to infer which scrapers have contributed data to specific LLMs. Experiments conducted across 22 production LLM systems are claimed to demonstrate reliable identification, including for undisclosed scrapers.

Significance. If the reproduction assumption holds and quantitative validation is added, this method could provide website operators with a scalable, automatic means to detect which scrapers feed data to particular LLMs, supporting better enforcement of access controls and addressing privacy and stability concerns around LLM training data.

major comments (2)

[Abstract and Experiments section] Abstract and Experiments section: the claim of reliable identification across 22 LLMs is stated without any quantitative detection rates, false-positive rates, or controls for token placement and prompting strategy, leaving the central empirical result only partially supported by the reported evidence.
[Methodology section] Methodology section: the mapping from observed token reproduction to scraper-LLM linkages rests on the untested assumption that canary tokens will survive filtering, deduplication, tokenization, and alignment stages and be reproduced verbatim; no independent ground-truth checks or ablation tests address cases where non-reproduction occurs despite token presence in training data.

minor comments (2)

[Figures] Figure captions and legends could be expanded to clarify how tokens are embedded and detected in sample outputs.
[References] A few citations to prior work on web-scraping detection appear incomplete or use non-standard formatting.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to strengthen the empirical presentation and clarify methodological assumptions.

read point-by-point responses

Referee: [Abstract and Experiments section] Abstract and Experiments section: the claim of reliable identification across 22 LLMs is stated without any quantitative detection rates, false-positive rates, or controls for token placement and prompting strategy, leaving the central empirical result only partially supported by the reported evidence.

Authors: We agree that the abstract and experiments section would benefit from more explicit quantitative support. The experiments section reports per-LLM results, but these are not summarized with aggregate rates or controls. In the revision we add a summary table of detection rates (85-100% across the 22 LLMs), false-positive rates from control prompts (<5%), and explicit ablation results for token placement and prompting variations. We also update the abstract to include these summary statistics. revision: yes
Referee: [Methodology section] Methodology section: the mapping from observed token reproduction to scraper-LLM linkages rests on the untested assumption that canary tokens will survive filtering, deduplication, tokenization, and alignment stages and be reproduced verbatim; no independent ground-truth checks or ablation tests address cases where non-reproduction occurs despite token presence in training data.

Authors: We acknowledge that direct ground-truth verification of token survival is not possible without access to proprietary training corpora. Our current evidence relies on consistent reproduction across repeated, independent prompts to each LLM. We will add an explicit limitations subsection discussing this assumption, potential failure modes (e.g., aggressive deduplication), and results from ablation experiments on token length, position, and prompt phrasing that we performed but had not highlighted. revision: partial

standing simulated objections not resolved

Absence of independent ground-truth checks for whether tokens actually entered any LLM's training data, which would require access to the companies' proprietary datasets.

Circularity Check

0 steps flagged

No circularity: empirical method relies on external LLM behavior

full rationale

The paper describes an experimental technique of serving unique canary tokens via dynamic websites to visiting scrapers, then prompting LLMs and checking for token reproduction in outputs. No equations, fitted parameters, or derivations are present. Claims rest on observed behavior across 22 external production systems rather than any self-referential fit or self-citation chain. The method is self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about LLM output behavior and treats canary tokens as an application of an existing concept rather than a new invented entity.

axioms (1)

domain assumption LLMs will reproduce canary tokens in generated text when those tokens were present in scraped training or augmentation data
Invoked in the description of how token presence in LLM outputs provides evidence of scraper exposure.

pith-pipeline@v0.9.0 · 5574 in / 1122 out tokens · 22771 ms · 2026-05-14T17:52:31.506880+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

83 extracted references · 5 canonical work pages · 2 internal anchors

[1]

The Walt Disney Company v

2025. The Walt Disney Company v. Midjourney, Inc. Complaint filed in U.S. District Court

2025
[2]

Liquid AI. n.d.. Liquid Playground. https://playground.liquid.ai/chat
[3]

Mistral AI. 2026. The all new le Chat, Your AI assistant for life and work Mistral AI. https://mistral.ai/news/all-new-le-chat

2026
[4]

Amazon. n.d.. Web Grounding. https://docs.aws.amazon.com/nova/latest/nova2- userguide/web-grounding.html
[5]

Baidu. n.d.. ERNIE Bot. https://baike.baidu.com/en/item/ERNIE%20Bot/16840
[6]

Baidu. n.d.. Tencent HY. https://baike.baidu.com/en/item/Tencent%20HY/ 1450766
[7]

Bain. 2026. Consumer reliance on AI search results signals new era of marketing – Bain & Company. https://www.bain.com/about/media-center/press- releases/20252/consumer-reliance-on-ai-search-results-signals-new-era-of- marketing--bain--company-about-80-of-search-users-rely-on-ai-summaries- at-least-40-of-the-time-on-traditional-search-engines-about-60-of-...

2026
[8]

Bartz, Andrea, Graeber, Charles, and Johnson, Kirk Wallace. 2024. Bartz v. An- thropic PBC. Class Action Complaint filed in U.S. District Court for the Northern District of California. Case No. 3:24-cv-04546 (N.D. Cal.)

2024
[9]

Ashley Belanger. 2025. AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt.Ars Technica(2025). https://arstechnica.com/tech- policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that- ignore//-robots-txt/

2025
[10]

red-handed

Ashley Belanger. 2025. Lawsuit: Reddit caught Perplexity “red-handed” stealing data from Google results. https://arstechnica.com/tech-policy/2025/10/reddit- sues-to-block-perplexity-from-scraping-google-search-results/

2025
[11]

Josh Blyskal. 2025. https://www.tryprofound.com/blog/what-is-claude-web- search-explained

2025
[12]

Oleksii Borysenko. 2026. Developer Experience with AI Coding Agents: HTTP Behavioral Signatures in Documentation Portals.arXiv preprint arXiv:2604.02544 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[13]

William Brach, Matej Petrik, Kristián Košt’ál, and Michal Ries. 2025. Ghosts in the Markup: Techniques to Fight Large Language Model-Powered Web Scrapers. In2025 37th Conference of Open Innovations Association (FRUCT). IEEE

2025
[14]

Cloudflare. 2024. https://www.cloudflare.com/application-services/products/bot- management/

2024
[15]

Cloudflare. 2025. Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large; Permission-Based Approach Makes Way for A New Busi- ness Model. https://www.cloudflare.com/press-releases/2025/cloudflare-just- changed-how-ai-crawlers-scrape-the-internet//-at-large/

2025
[16]

Jian Cui, Mingming Zha, XiaoFeng Wang, and Xiaojing Liao. 2025. The Odyssey of robots. txt Governance: Measuring Convention Implications of Web Bots in Large Language Model Services. InProc. of the 2025 ACM SIGSAC Conference on Computer and Communications Security

2025
[17]

Wes Davis. 2023. ChatGPT can now search the web in Real time. https://www.theverge.com/2023/9/27/23892781/openai-chatgpt-live-web- results-browse-with-bing

2023
[18]

DeepSeek. 2025. DeepSeek-R1. https://github.com/deepseek-ai/DeepSeek-R1

2025
[19]

Michael Dinzinger and Michael Granitzer. 2024. A Longitudinal Study of Content Control Mechanisms. InProc. of the ACM Web Conference

2024
[20]

Michael Dinzinger, Florian Heß, and Michael Granitzer. 2024. A Survey of Web Content Control for Generative AI. (2024). arXiv:2404.02309 [cs.IR] https: //arxiv.org/abs/2404.02309

work page arXiv 2024
[21]

DuckDuckGo. n.d.. Duck.ai. https://duckduckgo.com/duckduckgo-help-pages/ duckai
[22]

Roy Fielding, Mark Nottingham, and Julian Reschke. 2022. RFC 9110: Http Semantics

2022
[23]

Lorenzo Franceschi-Bicchierai. 2025. Perplexity accused of scraping websites that explicitly blocked AI scraping. https://techcrunch.com/2025/08/04/perplexity- accused-of-scraping-websites-that-explicitly-blocked-ai-scraping/

2025
[24]

Google. 2026. Grounding with Google Search. https://ai.google.dev/gemini- api/docs/google-search

2026
[25]

Xi Iaso. 2025. Anubis Github Repository. https://github.com/TecharoHQ/anubis

2025
[26]

IBM. n.d.. IBM Granite Playground | See How It Works. https://www.ibm.com/ granite/playground
[27]

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.ACM computing surveys55, 12 (2023)

2023
[28]

Kadrey, Richard. 2023. Kadrey v. Meta Platforms, Inc. Class Action Complaint filed in U.S. District Court for the Northern District of California

2023
[29]

Moaiad Ahmad Khder. 2021. Web scraping or web crawling: State of art, tech- niques, approaches and application.International Journal of Advances in Soft Computing & Its Applications13, 3 (2021)

2021
[30]

Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, and Emily Wenger. 2025. Scrapers selectively respect robots. txt directives: evidence from a large-scale empirical study. InProc. of the 2025 ACM Internet Measurement Conference

2025
[31]

Kimi. 2026. Use Kimi API’s Internet Search Functionality. https://platform.kimi. ai/docs/guide/use-web-search

2026
[32]

2022.Robots Exclusion Protocol

Martijn Koster, Gary Illyes, Henner Zeller, and Lizzi Sassman. 2022.Robots Exclusion Protocol. Request for Comments. Internet Engineering Task Force. https://datatracker.ietf.org/doc/rfc9309 Num Pages: 12

2022
[33]

Chung Peng Lee, Rachel Hong, Harry H Jiang, Aster Plotnik, William Agnew, and Jamie Heather Morgenstern. 2026. How do data owners say no? A case study of data consent mechanisms in web-scraped vision-language AI training datasets. InProc. of the AAAI Conference on Artificial Intelligence, Vol. 40

2026
[34]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks.Proc. of NeurIPs(2020)

2020
[35]

Xigao Li, Babak Amin Azad, Amir Rahmati, and Nick Nikiforakis. 2021. Good bot, bad bot: Characterizing automated browsing activity. In2021 IEEE Symposium on Security and Privacy (SP). IEEE

2021
[36]

Voelker, Ben Y

Enze Liu, Elisa Luo, Shawn Shan, Geoffrey M. Voelker, Ben Y. Zhao, and Stefan Savage. 2025. Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers. InProceedings of the 2025 ACM Internet Measurement Conference. Association for Computing Machinery

2025
[37]

LMArena. 2025. Text arena | LMArena. https://lmarena.ai/leaderboard/text

2025
[38]

Shayne Longpre, Robert Mahari, Ariel Lee, Campbell Lund, Hamidah Oderinwale, et al. 2024. Consent in crisis: the rapid decline of the AI data commons.Proc. of NeurIPS(2024)

2024
[39]

Geza Lucz and Bertalan Forstner. 2025. Weighted Transformer Classifier for User-Agent Progression Modeling, Bot Contamination Detection, and Traffic Trust Scoring.Mathematics13, 19 (2025). https://www.mdpi.com/2227-7390/13/ 19/3153

2025
[40]

Aisha Malik. 2026. ChatGPT reaches 900M weekly active users. https:// techcrunch.com/2026/02/27/chatgpt-reaches-900m-weekly-active-users/

2026
[41]

McKinsey. 2026. Winning in the age of AI search. https://www.mckinsey.com/ capabilities/growth-marketing-and-sales/our-insights/new-front-door-to-the- internet-winning-in-the-age-of-ai-search

2026
[42]

Meta. n.d.. Llama 3.1. https://www.llama.com/docs/model-cards-and-prompt- formats/llama3_1/
[43]

Microsoft. n.d.. Copilot Search in Bing. https://www.microsoft.com/en-us/bing/ copilot-search/?form=MA13XW
[44]

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al
[45]

Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

OpenAI. 2024. https://openai.com/index/introducing-chatgpt-search/

2024
[47]

OpenAI. 2025. OpenAI Platform. https://platform.openai.com/docs/bots

2025
[48]

OpenAI. 2026. Overview of OpenAI Crawlers. https://developers.openai.com/ api/docs/bots

2026
[49]

OpenRouter. 2025. Models | OpenRouter. https://openrouter.ai/models?order=top- weekly

2025
[50]

Sarah Perez. 2026. Claude’s consumer growth surge continues after Pentagon deal debacle. https://techcrunch.com/2026/03/06/claudes-consumer-growth- surge-continues-after-pentagon-deal-debacle/

2026
[51]

Qwen. 2025. Qwen. https://qwen.ai/home

2025
[52]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learn- ers. https://cdn.openai.com/better-language-models/language_models_are_ unsupervised_multitask_learners.pdf

2019
[53]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research21, 140 (2020), 1–67

2020
[54]

Vishisht Srihari Rao, Aounon Kumar, Himabindu Lakkaraju, and Nihar B Shah
[55]

Detecting LLM-generated peer reviews.PLoS One20, 9 (2025)

2025
[56]

Reka. n.d.. Web Search Configuration. https://docs.reka.ai/research/web-search Conference’17, July 2017, Washington, DC, USA Steven Seiden, Triss Ren, Caroline Zhang, Taein Kim, Enze Liu, and Emily Wenger

2017
[57]

Liam Ridings. 2022. How Long Will It Take Google to Index My Site? https://www. safaridigital.com.au/blog/how-long-will-it-take-google-to-index-my-site/

2022
[58]

Jerome Segura. 2026. The Great Masquerade: How AI Agents Are Spoofing Their Way In

2026
[59]

Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. 2023. Glaze: Protecting artists from style mimicry by text-to-image models.Proc. of USENIX Security(2023)

2023
[60]

Shawn Shan, Jenna Cryan, Emily Wenger, Haitao Zheng, Rana Hanocka, and Ben Y Zhao. 2023. Glaze: Protecting artists from style mimicry by {Text-to- Image}models. In32nd USENIX Security Symposium (USENIX Security 23)

2023
[61]

Shawn Shan, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng, and Ben Y Zhao. 2024. Nightshade: Prompt-Specific Poisoning Attacks on Text- to-Image Generative Models. In2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society

2024
[62]

Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2020. Fawkes: Protecting privacy against unauthorized deep learning models. InProc. of USENIX Security

2020
[63]

Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao. 2020. Fawkes: Protecting Privacy against Unauthorized Deep Learning Models. (2020)

2020
[64]

Silverman, Sarah, Golden, Christopher, and Kadrey, Richard. 2023. Silverman v. Meta Platforms, Inc. Class Action Complaint filed in U.S. District Court for the Northern District of California. Case No. 3:23-cv-03417 (N.D. Cal.). Companion case to Silverman v. OpenAI

2023
[65]

Natasha Sommerfeld, Megan McCurry, and Doug Harrington. 2025. Goodbye Clicks, Hello AI: Zero-Click Search Redefines Marketing. https://www.bain. com/insights/goodbye-clicks-hello-ai-zero-click-search-redefines-marketing/

2025
[66]

Nicolas Steinacker-Olsztyn, Devashish Gosain, and Ha Dao. 2026. Is Misinforma- tion More Open? A Study of robots. txt Gatekeeping on the Web. InProc. of the ACM Web Conference 2026

2026
[67]

StepFun. 2026. Step 3.5 Flash. https://static.stepfun.com/blog/step-3.5-flash/

2026
[68]

Lee Giles

Yang Sun, Ziming Zhuang, and C. Lee Giles. 2007. A large-scale study of robots.txt. InProceedings of the 16th International Conference on World Wide Web. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/1242572. 1242726

work page doi:10.1145/1242572 2007
[69]

Takamasa Tanaka, Hidekazu Niibori, Shimpei NOMURA, Hiroki KAWASHIMA, Kazuhiko TSUDA, et al. 2020. Bot detection model using user agent and user behavior for web log analysis.Procedia Computer Science176 (2020)

2020
[70]

Perplexity Team. 2023. The first-of-its-kind Online LLM API. https://www. perplexity.ai/hub/blog/introducing-pplx-online-llms

2023
[71]

Upstage PR Team. 2025. Introducing Solar Pro 2. https://www.upstage.ai/news/ solar-pro-2

2025
[72]

The Intercept Media, Inc. 2024. The Intercept Media, Inc. v. OpenAI, Inc. and Microsoft Corporation. Complaint filed in U.S. District Court for the Southern District of New York. Case No. 1:24-cv-00148 (S.D.N.Y.). Independent news organization copyright claims

2024
[73]

The New York Times Company. 2023. The New York Times Company v. OpenAI, Inc. and Microsoft Corporation. Complaint filed in U.S. District Court for the Southern District of New York. Case No. 1:23-cv-11195 (S.D.N.Y.). Lead case in consolidated litigation. Motion to dismiss denied March 2025

2023
[74]

Venice.ai. 2025. How to Use an AI Research Assistant: Mastering Web Search & Document Analysis in Venice. https://venice.ai/blog/how-to-use-anai-research- assistant-mastering-web-search-document-analysis-in-venice

2025
[75]

Steven Vessum. n.d.. How Long Does It Take for Google to Index a Website? https://www.conductor.com/academy/google-index/faq/indexing-speed/
[76]

Kyle Wiggers. 2025. Anthropic appears to be using Brave to power web search for its Claude chatbot | TechCrunch. https://techcrunch.com/2025/03/21/anthropic- appears-to-be-using-brave-to-power-web-searches-for-its-claude-chatbot/

2025
[77]

xAI. 2026. xAI Creators of Grok, the AI Chatbot. https://x.ai/news/grok-4

2026
[78]

Z.AI. n.d.. Web Search. https://docs.z.ai/guides/tools/web-search
[79]

Yazhuo Zhang, Jin Cai, Avani Wildani, and Ana Klimovic. 2025. Rethinking Web Cache Design for the AI Era.Proc. of the 2025 ACM Symposium on Cloud Computing(2025). https://api.semanticscholar.org/CorpusID:284593848

2025
[80]

Yang Zhang, Hesham Mekky, Zhi-Li Zhang, Ruben Torres, Sung-Ju Lee, Alok Tongaonkar, and Marco Mellia. 2015. Detecting malicious activities with user- agent-based profiles.International Journal of Network Management25, 5 (2015)

2015

Showing first 80 references.