pith:XNC5N7RT
GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
A public system enables semantic and visual searches over 10 million federal government PDFs at roughly $1,500 in preprocessing cost.
arxiv:2511.11010 v2 · 2025-11-14 · cs.IR · cs.DL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{XNC5N7RT5EM2TACSV6F2WZJCEX}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
We introduce GovScape, a public search system that supports four primary forms of search over these 10 million PDFs: ... semantic text search and visual search against the PDFs across individual pages... total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute.
That the chosen embedding models and visual search components produce sufficiently accurate results for the intended use cases without extensive user studies or quantitative evaluation of retrieval quality reported in the abstract.
GovScape delivers multimodal search over 10 million government PDFs using metadata, exact text, semantic embeddings, and visual page features at an estimated $1,500 preprocessing cost.
References
Formal links
Receipt and verification
| First computed | 2026-05-18T03:09:33.253682Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
bb45d6fe33e919a98052af8bab652225c57060321173420488435b4b7661c984
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/XNC5N7RT5EM2TACSV6F2WZJCEX \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: bb45d6fe33e919a98052af8bab652225c57060321173420488435b4b7661c984
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "f0ca5adb20be8b703074165071c0771a274bdeb6faaaa54436042e11cc476101",
"cross_cats_sorted": [
"cs.DL"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.IR",
"submitted_at": "2025-11-14T06:54:48Z",
"title_canon_sha256": "b9e3ab69407080e883de8c39f3e1f4e7962c186fa6e00e25666ec936589fd6a1"
},
"schema_version": "1.0",
"source": {
"id": "2511.11010",
"kind": "arxiv",
"version": 2
}
}