pith. sign in
Pith Number

pith:XNC5N7RT

pith:2025:XNC5N7RT5EM2TACSV6F2WZJCEX
not attested not anchored not stored refs resolved

GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

Alison Yan, Benjamin Charles Germain Lee, Claire Gong, Kyle Deeds, Leslie Harka, Mark Phillips, Samuel J Klein, Shannon Zejiang Shen, Shreya Shaji, Trevor Owens, Ying-Hsiang Huang

A public system enables semantic and visual searches over 10 million federal government PDFs at roughly $1,500 in preprocessing cost.

arxiv:2511.11010 v2 · 2025-11-14 · cs.IR · cs.DL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{XNC5N7RT5EM2TACSV6F2WZJCEX}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We introduce GovScape, a public search system that supports four primary forms of search over these 10 million PDFs: ... semantic text search and visual search against the PDFs across individual pages... total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute.

C2weakest assumption

That the chosen embedding models and visual search components produce sufficiently accurate results for the intended use cases without extensive user studies or quantitative evaluation of retrieval quality reported in the abstract.

C3one line summary

GovScape delivers multimodal search over 10 million government PDFs using metadata, exact text, semantic embeddings, and visual page features at an estimated $1,500 preprocessing cost.

References

42 extracted · 42 resolved · 3 Pith anchors

[1] History in the age of abundance? : how the web is transforming historical research, 2019
[2] End of term web archive dataset: Longitudinal web archive of .gov and .mil domains, 2023
[3] ‘go fish’: Conceptualising the challenges of engaging national web archives for digital research, 2021
[4] Collection search
[5] Learning transferable visual models from natural language supervi- sion, 2021

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-18T03:09:33.253682Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

bb45d6fe33e919a98052af8bab652225c57060321173420488435b4b7661c984

Aliases

arxiv: 2511.11010 · arxiv_version: 2511.11010v2 · doi: 10.48550/arxiv.2511.11010 · pith_short_12: XNC5N7RT5EM2 · pith_short_16: XNC5N7RT5EM2TACS · pith_short_8: XNC5N7RT
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/XNC5N7RT5EM2TACSV6F2WZJCEX \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: bb45d6fe33e919a98052af8bab652225c57060321173420488435b4b7661c984
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "f0ca5adb20be8b703074165071c0771a274bdeb6faaaa54436042e11cc476101",
    "cross_cats_sorted": [
      "cs.DL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.IR",
    "submitted_at": "2025-11-14T06:54:48Z",
    "title_canon_sha256": "b9e3ab69407080e883de8c39f3e1f4e7962c186fa6e00e25666ec936589fd6a1"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2511.11010",
    "kind": "arxiv",
    "version": 2
  }
}