pith. sign in

arxiv: 2511.11010 · v2 · pith:XNC5N7RTnew · submitted 2025-11-14 · 💻 cs.IR · cs.DL

GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

Pith reviewed 2026-05-17 22:50 UTC · model grok-4.3

classification 💻 cs.IR cs.DL
keywords government PDFsmultimodal searchsemantic searchvisual searchweb archivesinformation retrievalPDF processing
0
0 comments X

The pith

A public system enables semantic and visual searches over 10 million federal government PDFs at roughly $1,500 in preprocessing cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GovScape as a searchable public interface to the End of Term web archive's collection of federal PDFs. It adds semantic text search and visual element search on individual pages to the usual metadata filters and exact text lookup. This combination lets users issue queries such as locating redacted pages or pie charts without downloading files one by one. The reported preprocessing expense of about $1,500 for 70 million pages shows that the approach is cheap enough to apply at the scale of existing web archives. The authors also describe the open-source components and early steps toward expanding the same methods to more than 100 million PDFs.

Core claim

GovScape is a public multimodal search system for 10,015,993 federal government PDFs (70,958,487 total pages) drawn from the 2020 End of Term crawl. It supports four search modes: metadata facet filters, exact text search, semantic text search, and visual search performed at the level of individual PDF pages. The system was built with an embedding pipeline whose entire preprocessing cost is estimated at $1,500, or 47,000 pages per dollar, and the authors have begun work to extend the same pipeline to the 100-million-PDF scale.

What carries the argument

The embedding pipeline that produces semantic text vectors and visual feature vectors for every page, allowing both meaning-based and appearance-based queries across the full collection.

Load-bearing premise

The embedding models chosen for semantic text and visual search return results accurate enough for the intended government-document use cases, even though the paper reports no quantitative retrieval tests or user studies.

What would settle it

A side-by-side relevance test in which human raters judge whether pages returned for a visual query such as 'pie chart' are actually relevant at rates no better than random selection would indicate that the visual search component does not work as claimed.

Figures

Figures reproduced from arXiv: 2511.11010 by Alison Yan, Benjamin Charles Germain Lee, Claire Gong, Kyle Deeds, Leslie Harka, Mark Phillips, Samuel J Klein, Shannon Zejiang Shen, Shreya Shaji, Trevor Owens, Ying-Hsiang Huang.

Figure 1
Figure 1. Figure 1: An overview of GovScape. Our public search system supports three types of search over 10,015,993 million government [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the GovScape pre-processing pipeline, showing how a single PDF in GovScape is parsed and semantified. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of semantic text search (Figure 3a) and visual search (Figure 3b) in GovScape. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: An overview of the GovScape architecture, showing how the constituent parts of the system interact with one another. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A screenshot showing the selected PDF view for detailed document inspection (in this case, the fourth page of a [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces GovScape, a public multimodal search system for 10,015,993 federal government PDFs (70,958,487 pages) from the 2020 End of Term crawl. It supports metadata facet filters, exact text search, semantic text search, and visual search over individual pages (e.g., queries for 'redacted documents' or 'pie charts'), details the embedding pipeline and system architecture, releases an open-source codebase, and reports a total preprocessing compute cost of approximately $1,500.

Significance. If the search components function as described, this would represent a practical contribution to improving access and discoverability in large web archives of government documents. The reported low cost per page and plans for scaling to 100+ million PDFs highlight feasibility for broader adoption in digital preservation and information retrieval applications.

major comments (1)
  1. The manuscript provides no quantitative retrieval metrics (e.g., precision@K, recall, mAP), baseline comparisons, error analysis, or user studies for the semantic text search or visual search components. This is load-bearing for the central claim of functional multimodal search, as the effectiveness of the chosen embedding models and visual components for the stated use cases remains unverified.
minor comments (2)
  1. Abstract: the filtering criterion of 'all renderable PDFs in the 2020 crawl that are 50 pages or under' would benefit from explicit details on how renderability and page count were determined to assess potential selection biases.
  2. System description: specify the exact embedding models (including versions and any fine-tuning) used for semantic and visual search to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address the major concern below and will revise the manuscript to incorporate additional material that strengthens the presentation of the search components.

read point-by-point responses
  1. Referee: The manuscript provides no quantitative retrieval metrics (e.g., precision@K, recall, mAP), baseline comparisons, error analysis, or user studies for the semantic text search or visual search components. This is load-bearing for the central claim of functional multimodal search, as the effectiveness of the chosen embedding models and visual components for the stated use cases remains unverified.

    Authors: We agree that the absence of quantitative retrieval metrics represents a limitation in verifying the effectiveness of the semantic text and visual search components. The manuscript's primary contributions center on the end-to-end system architecture, the public deployment over 10 million PDFs, the open-source codebase, and the low preprocessing cost of approximately $1,500, which demonstrates practical feasibility for large-scale government archives. Nevertheless, to directly address the referee's concern, we will revise the paper by adding a new section on search component validation. This will include: (1) qualitative examples with actual query results for the highlighted use cases such as 'redacted documents' and 'pie charts'; (2) a description of the specific embedding models and visual feature extractors used, along with references to their established performance on related benchmarks in the literature; and (3) a small-scale quantitative evaluation on a sampled subset of pages (e.g., precision-oriented checks against manually labeled examples). We believe these additions will provide sufficient evidence of functionality while preserving the paper's focus on scalable system design rather than comprehensive IR benchmarking. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes the design and implementation of the GovScape search system for 10 million government PDFs, including metadata filters, exact text search, semantic embeddings, and visual search components, along with reported empirical preprocessing costs of approximately $1,500. No mathematical derivations, fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations appear in the abstract or system description. Claims rest on the built artifact and measured compute metrics rather than any chain that reduces outputs to inputs by construction. The work is self-contained as an engineering report against external benchmarks of cost and scale.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is primarily an engineering system description rather than a theoretical derivation. No free parameters are fitted to produce a scientific claim; the reported $1,500 cost is an empirical measurement. No new axioms or invented entities are introduced.

pith-pipeline@v0.9.0 · 5672 in / 1123 out tokens · 29970 ms · 2026-05-17T22:50:52.307049+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    History in the age of abundance? : how the web is transforming historical research,

    I. Milligan, “History in the age of abundance? : how the web is transforming historical research,” Montreal, 2019

  2. [2]

    End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,

    M. E. Phillips, K. K. Phillips, and S. Alam, “End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,” in2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2023, pp. 98–101

  3. [3]

    ‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,

    J. Ogden and E. Maemura, “‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,”International journal of digital humanities, vol. 2, no. 1-3, pp. 43–63, 2021

  4. [4]

    Collection search

    I. Archive, “Collection search.” [Online]. Available: https://web.archive. org/collection-search

  5. [5]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine...

  6. [6]

    Integrating visual and textual inputs for searching large-scale map collections with clip,

    J. Mahowald and B. C. G. Lee, “Integrating visual and textual inputs for searching large-scale map collections with clip,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01190

  7. [7]

    A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,

    T. Smits, B. Warner, P. Fyfe, and B. C. G. Lee, “A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,” Journal of Open Humanities Data, 2025. [Online]. Available: https://doi.org/10.5334/johd.284

  8. [8]

    A multimodal turn in digital humanities. using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections,

    T. Smits and M. Wevers, “A multimodal turn in digital humanities. using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections,”Digital Scholarship in the Humanities, vol. 38, no. 3, pp. 1267–1280, 03 2023. [Online]. Available: https://doi.org/10.1093/llc/fqad008 [9]Towards multimodal computational humani...

  9. [9]

    Blind dates: Examining the expression of temporality in historical photographs,

    A. Barancov ´a, M. Wevers, and N. van Noord, “Blind dates: Examining the expression of temporality in historical photographs,” 2023

  10. [10]

    Grappling with the scale of born-digital government publications: Toward pipelines for processing and searching millions of pdfs,

    B. C. G. Lee and T. Owens, “Grappling with the scale of born-digital government publications: Toward pipelines for processing and searching millions of pdfs,”International Journal of Digital Humanities, vol. 3, pp. 91 – 114, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:257159777

  11. [12]

    Averting the digital dark age : How archivists, librarians, and technologists built the web a memory,

    I. Milligan, “Averting the digital dark age : How archivists, librarians, and technologists built the web a memory,” Baltimore, 2024

  12. [13]

    Web archive search as research: Methodological and theoretical implications,

    A. Ben-David and H. Huurdeman, “Web archive search as research: Methodological and theoretical implications,”Alexandria, vol. 25, no. 1-2, pp. 93–111, 2014. [Online]. Available: https://doi.org/10.7227/ ALX.0022

  13. [14]

    Cargnelutti, K

    M. Cargnelutti, K. Mukk, and C. Stanton, February

  14. [15]

    Available: https://lil.law.harvard.edu/blog/2024/02/12/ warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

    [Online]. Available: https://lil.law.harvard.edu/blog/2024/02/12/ warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

  15. [16]

    The archives unleashed project: Technology, process, and community to improve scholarly access to web archives,

    N. Ruest, J. Lin, I. Milligan, and S. Fritz, “The archives unleashed project: Technology, process, and community to improve scholarly access to web archives,” inProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, ser. JCDL ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 157–166. [Online]. Available: https://do...

  16. [17]

    The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,

    R. Deschamps, N. Ruest, J. Lin, S. Fritz, and I. Milligan, “The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,” in2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Piscataway, NJ, USA: IEEE Press, 2019, pp. 337–338

  17. [18]

    Fostering community en- gagement through datathon events: The archives unleashed experience,

    S. Fritz, I. Milligan, N. Ruest, and J. Lin, “Fostering community en- gagement through datathon events: The archives unleashed experience,” Digital humanities quarterly, vol. 15, no. 1, 2021

  18. [19]

    jpl-safedocs/file- observatory: V1.6.1,

    R. Stonebraker, M. Milano, and A. Mensikova, “jpl-safedocs/file- observatory: V1.6.1,” Jul. 2023. [Online]. Available: https://doi.org/10. 5281/zenodo.8132495

  19. [20]

    Solrwayback

    SolrWayback, “Solrwayback.” [Online]. Available: https://github.com/ netarchivesuite/solrwayback

  20. [21]

    Gitelman,Paper Knowledge: Toward a Media History of Documents, ser

    L. Gitelman,Paper Knowledge: Toward a Media History of Documents, ser. Sign, storage, transmission. Duke University Press, 2014

  21. [22]

    Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets

    M. Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets. Pantheon, 2023

  22. [23]

    Diplomatic documents data for international relations: the freedom of information archive database,

    M. J. Connelly, R. Hicks, R. Jervis, A. Spirling, and C. H. Suong, “Diplomatic documents data for international relations: the freedom of information archive database,”Conflict Management and Peace Science, vol. 38, no. 6, pp. 762–781, 2021. [Online]. Available: https://doi.org/10.1177/0738894220930326

  23. [24]

    The data liberation project

    D. L. Project, “The data liberation project.” [Online]. Available: https://www.data-liberation-project.org/

  24. [25]

    Using Artificial Intelligence to Identify State Secrets

    R. R. Souza, F. C. Coelho, R. Shah, and M. Connelly, “Using artificial intelligence to identify state secrets,” 2016. [Online]. Available: https://arxiv.org/abs/1611.00356

  25. [26]

    New evidence and new methods for analyzing the iranian revolution as an intelligence failure,

    M. Connelly, R. Hicks, R. Jervis, and A. Spirling, “New evidence and new methods for analyzing the iranian revolution as an intelligence failure,”Intelligence and National Security, vol. 36, no. 6, pp. 781–806, 2021. [Online]. Available: https://doi.org/10.1080/02684527. 2021.1946959

  26. [27]

    Amazing military infographics,

    P. Ford, “Amazing military infographics,” May

  27. [28]

    Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7

    [Online]. Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7

  28. [29]

    Powell.pps: Close & distant reading of primary sources in web archives,

    T. Owens, B. C. G. Lee, and J. Estess, “Powell.pps: Close & distant reading of primary sources in web archives,” 2024

  29. [30]

    Slide decks as government publications: exploring two decades of powerpoint files archived from us government websites

    T. Owens and J. Estess, “Slide decks as government publications: exploring two decades of powerpoint files archived from us government websites.”Archival Science, vol. 23, pp. 223–246, 2023

  30. [31]

    Moving the end of term web archive to the cloud to encourage research use and reuse,

    M. Phillips and S. Alam, “Moving the end of term web archive to the cloud to encourage research use and reuse,”2022 Web Archiving and Digital Libraries Virtual Workshop, 2022. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc1998717/

  31. [32]

    Improving access to web archives through innovative analysis of pdf content,

    M. Phillips and K. Murray, “Improving access to web archives through innovative analysis of pdf content,”Archiving (IS & T’s Archiving Conference), vol. 10, no. 1, pp. 186–192, 2013. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc155622/

  32. [33]

    The Faiss library

    M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,”CoRR, vol. abs/2401.08281, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2401.08281

  33. [34]

    Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,

    S. Gollapudi, N. Karia, V . Sivashankar, R. Krishnaswamy, N. Begwani, S. Raz, Y . Lin, Y . Zhang, N. Mahapatro, P. Srinivasan, A. Singh, and H. V . Simhadri, “Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,” inProceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Y . Din...

  34. [35]

    The diskann library: Graph-based indices for fast, fresh and filtered vector search,

    R. Krishnaswamy, M. D. Manohar, and H. V . Simhadri, “The diskann library: Graph-based indices for fast, fresh and filtered vector search,” IEEE Data Eng. Bull., vol. 48, no. 3, pp. 20–42, 2024. [Online]. Available: http://sites.computer.org/debull/A24sept/p20.pdf

  35. [36]

    C-pack: Packaged resources to advance general chinese embedding,

    S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023

  36. [37]

    MTEB: Massive Text Embedding Benchmark

    N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” 2023. [Online]. Available: https://arxiv.org/abs/2210.07316

  37. [38]

    arXiv preprint arXiv:2502.13595 , year=

    K. Enevoldsen, I. Chung, I. Kerboua, M. Kardos, A. Mathur, D. Stap, J. Gala, W. Siblini, D. Krzemi ´nski, G. I. Winata, S. Sturua, S. Utpala, M. Ciancone, M. Schaeffer, G. Sequeira, D. Misra, S. Dhakal, J. Rystrøm, R. Solomatin, ¨Omer C ¸ a˘gatan, A. Kundu, M. Bernstorff, S. Xiao, A. Sukhlecha, B. Pahwa, R. Po ´swiata, K. K. GV , S. Ashraf, D. Auras, B. P...

  38. [39]

    Digital collections explorer: An open-source, multimodal viewer for searching digital collections,

    Y .-H. Huang and B. C. G. Lee, “Digital collections explorer: An open-source, multimodal viewer for searching digital collections,” 2025. [Online]. Available: https://arxiv.org/abs/2507.00961

  39. [40]

    S. J. Subramanya, Devvrit, R. Kadekodi, R. Krishaswamy, and H. V . Simhadri,DiskANN: fast accurate billion-point nearest neighbor search on a single node. Red Hook, NY , USA: Curran Associates Inc., 2019

  40. [41]

    olmocr: Unlocking trillions of tokens in pdfs with vision language models,

    J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, A. Rangapur, C. Wilhelm, K. Lo, and L. Soldaini, “olmocr: Unlocking trillions of tokens in pdfs with vision language models,”

  41. [42]
  42. [43]

    olmocr 2: Unit test rewards for document ocr,

    J. Poznanski, L. Soldaini, and K. Lo, “olmocr 2: Unit test rewards for document ocr,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 19817