pith:GTX6VVQE
Identifying AI Web Scrapers Using Canary Tokens
Dynamic websites can issue unique canary tokens to visiting scrapers so that reproduction of a token in an LLM's output reveals which scraper supplied data to that model.
arxiv:2605.13706 v1 · 2026-05-13 · cs.CR · cs.AI · cs.CY · cs.NI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{GTX6VVQE2MDPUE5RGF7WHHSHUB}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Via experiments across 22 production LLM systems, we demonstrate that our approach can reliably identify which scrapers feed which LLM, including several that are not publicly known or disclosed by the companies.
That an LLM will reproduce a canary token in its generated output when the token was present in data collected by a scraper that fed the model, without the token being filtered or ignored during training or inference.
Unique canary tokens served to visiting scrapers can be recovered from LLM outputs to identify which scrapers feed data to which of 22 tested production LLMs.
References
Receipt and verification
| First computed | 2026-05-18T02:44:16.805441Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
34efead604d306fa13b1317f639e47a06ef9b4a556b1c67f2b516cc4946633c1
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/GTX6VVQE2MDPUE5RGF7WHHSHUB \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 34efead604d306fa13b1317f639e47a06ef9b4a556b1c67f2b516cc4946633c1
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "787bd3a73541827af7de0a594594d64bfa043c5d0c4872ea1adf3e2e39900eef",
"cross_cats_sorted": [
"cs.AI",
"cs.CY",
"cs.NI"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CR",
"submitted_at": "2026-05-13T15:53:57Z",
"title_canon_sha256": "e4e558889f707ef54200c3ee5e57c4b4530b3047ce021a45b1bfd1b046a18cec"
},
"schema_version": "1.0",
"source": {
"id": "2605.13706",
"kind": "arxiv",
"version": 1
}
}