GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs
Pith reviewed 2026-05-17 22:50 UTC · model grok-4.3
The pith
A public system enables semantic and visual searches over 10 million federal government PDFs at roughly $1,500 in preprocessing cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GovScape is a public multimodal search system for 10,015,993 federal government PDFs (70,958,487 total pages) drawn from the 2020 End of Term crawl. It supports four search modes: metadata facet filters, exact text search, semantic text search, and visual search performed at the level of individual PDF pages. The system was built with an embedding pipeline whose entire preprocessing cost is estimated at $1,500, or 47,000 pages per dollar, and the authors have begun work to extend the same pipeline to the 100-million-PDF scale.
What carries the argument
The embedding pipeline that produces semantic text vectors and visual feature vectors for every page, allowing both meaning-based and appearance-based queries across the full collection.
Load-bearing premise
The embedding models chosen for semantic text and visual search return results accurate enough for the intended government-document use cases, even though the paper reports no quantitative retrieval tests or user studies.
What would settle it
A side-by-side relevance test in which human raters judge whether pages returned for a visual query such as 'pie chart' are actually relevant at rates no better than random selection would indicate that the visual search component does not work as claimed.
Figures
read the original abstract
Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GovScape, a public multimodal search system for 10,015,993 federal government PDFs (70,958,487 pages) from the 2020 End of Term crawl. It supports metadata facet filters, exact text search, semantic text search, and visual search over individual pages (e.g., queries for 'redacted documents' or 'pie charts'), details the embedding pipeline and system architecture, releases an open-source codebase, and reports a total preprocessing compute cost of approximately $1,500.
Significance. If the search components function as described, this would represent a practical contribution to improving access and discoverability in large web archives of government documents. The reported low cost per page and plans for scaling to 100+ million PDFs highlight feasibility for broader adoption in digital preservation and information retrieval applications.
major comments (1)
- The manuscript provides no quantitative retrieval metrics (e.g., precision@K, recall, mAP), baseline comparisons, error analysis, or user studies for the semantic text search or visual search components. This is load-bearing for the central claim of functional multimodal search, as the effectiveness of the chosen embedding models and visual components for the stated use cases remains unverified.
minor comments (2)
- Abstract: the filtering criterion of 'all renderable PDFs in the 2020 crawl that are 50 pages or under' would benefit from explicit details on how renderability and page count were determined to assess potential selection biases.
- System description: specify the exact embedding models (including versions and any fine-tuning) used for semantic and visual search to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comments. We address the major concern below and will revise the manuscript to incorporate additional material that strengthens the presentation of the search components.
read point-by-point responses
-
Referee: The manuscript provides no quantitative retrieval metrics (e.g., precision@K, recall, mAP), baseline comparisons, error analysis, or user studies for the semantic text search or visual search components. This is load-bearing for the central claim of functional multimodal search, as the effectiveness of the chosen embedding models and visual components for the stated use cases remains unverified.
Authors: We agree that the absence of quantitative retrieval metrics represents a limitation in verifying the effectiveness of the semantic text and visual search components. The manuscript's primary contributions center on the end-to-end system architecture, the public deployment over 10 million PDFs, the open-source codebase, and the low preprocessing cost of approximately $1,500, which demonstrates practical feasibility for large-scale government archives. Nevertheless, to directly address the referee's concern, we will revise the paper by adding a new section on search component validation. This will include: (1) qualitative examples with actual query results for the highlighted use cases such as 'redacted documents' and 'pie charts'; (2) a description of the specific embedding models and visual feature extractors used, along with references to their established performance on related benchmarks in the literature; and (3) a small-scale quantitative evaluation on a sampled subset of pages (e.g., precision-oriented checks against manually labeled examples). We believe these additions will provide sufficient evidence of functionality while preserving the paper's focus on scalable system design rather than comprehensive IR benchmarking. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes the design and implementation of the GovScape search system for 10 million government PDFs, including metadata filters, exact text search, semantic embeddings, and visual search components, along with reported empirical preprocessing costs of approximately $1,500. No mathematical derivations, fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations appear in the abstract or system description. Claims rest on the built artifact and measured compute metrics rather than any chain that reduces outputs to inputs by construction. The work is self-contained as an engineering report against external benchmarks of cost and scale.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GovScape, a public search system that supports four primary forms of search... semantic text search and visual search... using BAAI/bge-base-en-v1.5 and openai/clip-vit-base-patch32 with Faiss
-
IndisputableMonolith/Foundation/AlphaCoordinateFixationJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
total estimated compute cost... $1,500, equivalent to 47,000 PDF pages per dollar
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
History in the age of abundance? : how the web is transforming historical research,
I. Milligan, “History in the age of abundance? : how the web is transforming historical research,” Montreal, 2019
work page 2019
-
[2]
End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,
M. E. Phillips, K. K. Phillips, and S. Alam, “End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,” in2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2023, pp. 98–101
work page 2023
-
[3]
‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,
J. Ogden and E. Maemura, “‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,”International journal of digital humanities, vol. 2, no. 1-3, pp. 43–63, 2021
work page 2021
-
[4]
I. Archive, “Collection search.” [Online]. Available: https://web.archive. org/collection-search
-
[5]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine...
work page 2021
-
[6]
Integrating visual and textual inputs for searching large-scale map collections with clip,
J. Mahowald and B. C. G. Lee, “Integrating visual and textual inputs for searching large-scale map collections with clip,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01190
-
[7]
A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,
T. Smits, B. Warner, P. Fyfe, and B. C. G. Lee, “A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,” Journal of Open Humanities Data, 2025. [Online]. Available: https://doi.org/10.5334/johd.284
-
[8]
T. Smits and M. Wevers, “A multimodal turn in digital humanities. using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections,”Digital Scholarship in the Humanities, vol. 38, no. 3, pp. 1267–1280, 03 2023. [Online]. Available: https://doi.org/10.1093/llc/fqad008 [9]Towards multimodal computational humani...
-
[9]
Blind dates: Examining the expression of temporality in historical photographs,
A. Barancov ´a, M. Wevers, and N. van Noord, “Blind dates: Examining the expression of temporality in historical photographs,” 2023
work page 2023
-
[10]
B. C. G. Lee and T. Owens, “Grappling with the scale of born-digital government publications: Toward pipelines for processing and searching millions of pdfs,”International Journal of Digital Humanities, vol. 3, pp. 91 – 114, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:257159777
work page 2021
-
[12]
I. Milligan, “Averting the digital dark age : How archivists, librarians, and technologists built the web a memory,” Baltimore, 2024
work page 2024
-
[13]
Web archive search as research: Methodological and theoretical implications,
A. Ben-David and H. Huurdeman, “Web archive search as research: Methodological and theoretical implications,”Alexandria, vol. 25, no. 1-2, pp. 93–111, 2014. [Online]. Available: https://doi.org/10.7227/ ALX.0022
work page 2014
- [14]
-
[15]
[Online]. Available: https://lil.law.harvard.edu/blog/2024/02/12/ warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/
work page 2024
-
[16]
N. Ruest, J. Lin, I. Milligan, and S. Fritz, “The archives unleashed project: Technology, process, and community to improve scholarly access to web archives,” inProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, ser. JCDL ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 157–166. [Online]. Available: https://do...
-
[17]
The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,
R. Deschamps, N. Ruest, J. Lin, S. Fritz, and I. Milligan, “The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,” in2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Piscataway, NJ, USA: IEEE Press, 2019, pp. 337–338
work page 2019
-
[18]
Fostering community en- gagement through datathon events: The archives unleashed experience,
S. Fritz, I. Milligan, N. Ruest, and J. Lin, “Fostering community en- gagement through datathon events: The archives unleashed experience,” Digital humanities quarterly, vol. 15, no. 1, 2021
work page 2021
-
[19]
jpl-safedocs/file- observatory: V1.6.1,
R. Stonebraker, M. Milano, and A. Mensikova, “jpl-safedocs/file- observatory: V1.6.1,” Jul. 2023. [Online]. Available: https://doi.org/10. 5281/zenodo.8132495
work page 2023
-
[20]
SolrWayback, “Solrwayback.” [Online]. Available: https://github.com/ netarchivesuite/solrwayback
-
[21]
Gitelman,Paper Knowledge: Toward a Media History of Documents, ser
L. Gitelman,Paper Knowledge: Toward a Media History of Documents, ser. Sign, storage, transmission. Duke University Press, 2014
work page 2014
-
[22]
Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets
M. Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets. Pantheon, 2023
work page 2023
-
[23]
Diplomatic documents data for international relations: the freedom of information archive database,
M. J. Connelly, R. Hicks, R. Jervis, A. Spirling, and C. H. Suong, “Diplomatic documents data for international relations: the freedom of information archive database,”Conflict Management and Peace Science, vol. 38, no. 6, pp. 762–781, 2021. [Online]. Available: https://doi.org/10.1177/0738894220930326
-
[24]
D. L. Project, “The data liberation project.” [Online]. Available: https://www.data-liberation-project.org/
-
[25]
Using Artificial Intelligence to Identify State Secrets
R. R. Souza, F. C. Coelho, R. Shah, and M. Connelly, “Using artificial intelligence to identify state secrets,” 2016. [Online]. Available: https://arxiv.org/abs/1611.00356
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[26]
New evidence and new methods for analyzing the iranian revolution as an intelligence failure,
M. Connelly, R. Hicks, R. Jervis, and A. Spirling, “New evidence and new methods for analyzing the iranian revolution as an intelligence failure,”Intelligence and National Security, vol. 36, no. 6, pp. 781–806, 2021. [Online]. Available: https://doi.org/10.1080/02684527. 2021.1946959
- [27]
-
[28]
Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7
[Online]. Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7
-
[29]
Powell.pps: Close & distant reading of primary sources in web archives,
T. Owens, B. C. G. Lee, and J. Estess, “Powell.pps: Close & distant reading of primary sources in web archives,” 2024
work page 2024
-
[30]
T. Owens and J. Estess, “Slide decks as government publications: exploring two decades of powerpoint files archived from us government websites.”Archival Science, vol. 23, pp. 223–246, 2023
work page 2023
-
[31]
Moving the end of term web archive to the cloud to encourage research use and reuse,
M. Phillips and S. Alam, “Moving the end of term web archive to the cloud to encourage research use and reuse,”2022 Web Archiving and Digital Libraries Virtual Workshop, 2022. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc1998717/
work page 2022
-
[32]
Improving access to web archives through innovative analysis of pdf content,
M. Phillips and K. Murray, “Improving access to web archives through innovative analysis of pdf content,”Archiving (IS & T’s Archiving Conference), vol. 10, no. 1, pp. 186–192, 2013. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc155622/
work page 2013
-
[33]
M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,”CoRR, vol. abs/2401.08281, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2401.08281
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,
S. Gollapudi, N. Karia, V . Sivashankar, R. Krishnaswamy, N. Begwani, S. Raz, Y . Lin, Y . Zhang, N. Mahapatro, P. Srinivasan, A. Singh, and H. V . Simhadri, “Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,” inProceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Y . Din...
-
[35]
The diskann library: Graph-based indices for fast, fresh and filtered vector search,
R. Krishnaswamy, M. D. Manohar, and H. V . Simhadri, “The diskann library: Graph-based indices for fast, fresh and filtered vector search,” IEEE Data Eng. Bull., vol. 48, no. 3, pp. 20–42, 2024. [Online]. Available: http://sites.computer.org/debull/A24sept/p20.pdf
work page 2024
-
[36]
C-pack: Packaged resources to advance general chinese embedding,
S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023
work page 2023
-
[37]
MTEB: Massive Text Embedding Benchmark
N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” 2023. [Online]. Available: https://arxiv.org/abs/2210.07316
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
arXiv preprint arXiv:2502.13595 , year=
K. Enevoldsen, I. Chung, I. Kerboua, M. Kardos, A. Mathur, D. Stap, J. Gala, W. Siblini, D. Krzemi ´nski, G. I. Winata, S. Sturua, S. Utpala, M. Ciancone, M. Schaeffer, G. Sequeira, D. Misra, S. Dhakal, J. Rystrøm, R. Solomatin, ¨Omer C ¸ a˘gatan, A. Kundu, M. Bernstorff, S. Xiao, A. Sukhlecha, B. Pahwa, R. Po ´swiata, K. K. GV , S. Ashraf, D. Auras, B. P...
-
[39]
Digital collections explorer: An open-source, multimodal viewer for searching digital collections,
Y .-H. Huang and B. C. G. Lee, “Digital collections explorer: An open-source, multimodal viewer for searching digital collections,” 2025. [Online]. Available: https://arxiv.org/abs/2507.00961
-
[40]
S. J. Subramanya, Devvrit, R. Kadekodi, R. Krishaswamy, and H. V . Simhadri,DiskANN: fast accurate billion-point nearest neighbor search on a single node. Red Hook, NY , USA: Curran Associates Inc., 2019
work page 2019
-
[41]
olmocr: Unlocking trillions of tokens in pdfs with vision language models,
J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, A. Rangapur, C. Wilhelm, K. Lo, and L. Soldaini, “olmocr: Unlocking trillions of tokens in pdfs with vision language models,”
-
[42]
[Online]. Available: https://arxiv.org/abs/2502.18443
-
[43]
olmocr 2: Unit test rewards for document ocr,
J. Poznanski, L. Soldaini, and K. Lo, “olmocr 2: Unit test rewards for document ocr,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 19817
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.