GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

Alison Yan; Benjamin Charles Germain Lee; Claire Gong; Kyle Deeds; Leslie Harka; Mark Phillips; Samuel J Klein; Shannon Zejiang Shen; Shreya Shaji; Trevor Owens

arxiv: 2511.11010 · v2 · pith:XNC5N7RTnew · submitted 2025-11-14 · 💻 cs.IR · cs.DL

GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs

Kyle Deeds , Ying-Hsiang Huang , Claire Gong , Shreya Shaji , Alison Yan , Leslie Harka , Samuel J Klein , Shannon Zejiang Shen

show 3 more authors

Mark Phillips Trevor Owens Benjamin Charles Germain Lee

This is my paper

Pith reviewed 2026-05-17 22:50 UTC · model grok-4.3

classification 💻 cs.IR cs.DL

keywords government PDFsmultimodal searchsemantic searchvisual searchweb archivesinformation retrievalPDF processing

0 comments

The pith

A public system enables semantic and visual searches over 10 million federal government PDFs at roughly $1,500 in preprocessing cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GovScape as a searchable public interface to the End of Term web archive's collection of federal PDFs. It adds semantic text search and visual element search on individual pages to the usual metadata filters and exact text lookup. This combination lets users issue queries such as locating redacted pages or pie charts without downloading files one by one. The reported preprocessing expense of about $1,500 for 70 million pages shows that the approach is cheap enough to apply at the scale of existing web archives. The authors also describe the open-source components and early steps toward expanding the same methods to more than 100 million PDFs.

Core claim

GovScape is a public multimodal search system for 10,015,993 federal government PDFs (70,958,487 total pages) drawn from the 2020 End of Term crawl. It supports four search modes: metadata facet filters, exact text search, semantic text search, and visual search performed at the level of individual PDF pages. The system was built with an embedding pipeline whose entire preprocessing cost is estimated at $1,500, or 47,000 pages per dollar, and the authors have begun work to extend the same pipeline to the 100-million-PDF scale.

What carries the argument

The embedding pipeline that produces semantic text vectors and visual feature vectors for every page, allowing both meaning-based and appearance-based queries across the full collection.

Load-bearing premise

The embedding models chosen for semantic text and visual search return results accurate enough for the intended government-document use cases, even though the paper reports no quantitative retrieval tests or user studies.

What would settle it

A side-by-side relevance test in which human raters judge whether pages returned for a visual query such as 'pie chart' are actually relevant at rates no better than random selection would indicate that the visual search component does not work as claimed.

Figures

Figures reproduced from arXiv: 2511.11010 by Alison Yan, Benjamin Charles Germain Lee, Claire Gong, Kyle Deeds, Leslie Harka, Mark Phillips, Samuel J Klein, Shannon Zejiang Shen, Shreya Shaji, Trevor Owens, Ying-Hsiang Huang.

**Figure 1.** Figure 1: An overview of GovScape. Our public search system supports three types of search over 10,015,993 million government [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: An overview of the GovScape pre-processing pipeline, showing how a single PDF in GovScape is parsed and semantified. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of semantic text search (Figure 3a) and visual search (Figure 3b) in GovScape. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: An overview of the GovScape architecture, showing how the constituent parts of the system interact with one another. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: A screenshot showing the selected PDF view for detailed document inspection (in this case, the fourth page of a [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Efforts over the past three decades have produced web archives containing billions of webpage snapshots and petabytes of data. The End of Term Web Archive alone contains, among other file types, millions of PDFs produced by the federal government. While preservation with web archives has been successful, significant challenges for access and discoverability remain. For example, current affordances for browsing the End of Term PDFs are limited to downloading and browsing individual PDFs, as well as performing basic keyword search across them. In this paper, we introduce GovScape, a public search system that supports multimodal searches across 10,015,993 federal government PDFs from the 2020 End of Term crawl (70,958,487 total PDF pages) - to our knowledge, all renderable PDFs in the 2020 crawl that are 50 pages or under. GovScape supports four primary forms of search over these 10 million PDFs: in addition to providing (1) filter conditions over metadata facets including domain and crawl date and (2) exact text search against the PDF text, we provide (3) semantic text search and (4) visual search against the PDFs across individual pages, enabling users to structure queries such as "redacted documents" or "pie charts." We detail the constituent components of GovScape, including the search affordances, embedding pipeline, system architecture, and open source codebase. Significantly, the total estimated compute cost for GovScape's pre-processing pipeline for 10 million PDFs was approximately $1,500, equivalent to 47,000 PDF pages per dollar spent on compute, demonstrating the potential for immediate scalability. Accordingly, we outline steps that we have already begun pursuing toward multimodal search at the 100+ million PDF scale. GovScape can be found at https://www.govscape.net.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GovScape shows a real deployed system for semantic and visual search over 10 million government PDFs at low reported cost, but leaves search accuracy unmeasured.

read the letter

The main point is that this paper describes a public, working search system called GovScape that adds semantic text search and visual search to 10 million federal government PDFs from the 2020 End of Term archive, covering about 71 million pages. They report getting the full preprocessing pipeline done for roughly $1500, which works out to 47,000 pages per dollar. That concrete cost figure and the fact that the system is live at govscape.net are the parts that stand out as useful data points for anyone scaling archive search tools. They also open-source the codebase and outline plans for pushing to 100 million PDFs next. The paper walks through the architecture, metadata filters, exact text search, and how they apply embeddings for the semantic and visual layers, with examples like querying for redacted documents or pie charts on individual pages. This is a straightforward systems description of taking an existing large collection and layering on multimodal capabilities without reinventing the underlying models. The engineering choices around efficiency and the measured spend give the scalability claim some grounding that many papers lack. The soft spot is the complete absence of any retrieval evaluation. There are no precision, recall, or mAP numbers, no baseline comparisons, and no checks on whether the visual search reliably finds charts or the semantic search surfaces relevant redacted material. The paper treats the off-the-shelf components as sufficient for the stated use cases but provides no evidence on actual performance. For a systems paper focused on deployment and cost this is not unusual, but it does mean readers have to assume the search works well enough in practice. This work is aimed at people in digital archives, government transparency projects, and applied information retrieval who need practical examples of handling large PDF collections. A reader interested in cost-effective ways to improve discoverability in web archives would get value from the pipeline details and the reported numbers. I would send it to peer review. The implementation and cost measurements are concrete enough to merit referee time, even if the review process would likely ask for some basic validation metrics in revision.

Referee Report

1 major / 2 minor

Summary. The paper introduces GovScape, a public multimodal search system for 10,015,993 federal government PDFs (70,958,487 pages) from the 2020 End of Term crawl. It supports metadata facet filters, exact text search, semantic text search, and visual search over individual pages (e.g., queries for 'redacted documents' or 'pie charts'), details the embedding pipeline and system architecture, releases an open-source codebase, and reports a total preprocessing compute cost of approximately $1,500.

Significance. If the search components function as described, this would represent a practical contribution to improving access and discoverability in large web archives of government documents. The reported low cost per page and plans for scaling to 100+ million PDFs highlight feasibility for broader adoption in digital preservation and information retrieval applications.

major comments (1)

The manuscript provides no quantitative retrieval metrics (e.g., precision@K, recall, mAP), baseline comparisons, error analysis, or user studies for the semantic text search or visual search components. This is load-bearing for the central claim of functional multimodal search, as the effectiveness of the chosen embedding models and visual components for the stated use cases remains unverified.

minor comments (2)

Abstract: the filtering criterion of 'all renderable PDFs in the 2020 crawl that are 50 pages or under' would benefit from explicit details on how renderability and page count were determined to assess potential selection biases.
System description: specify the exact embedding models (including versions and any fine-tuning) used for semantic and visual search to improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address the major concern below and will revise the manuscript to incorporate additional material that strengthens the presentation of the search components.

read point-by-point responses

Referee: The manuscript provides no quantitative retrieval metrics (e.g., precision@K, recall, mAP), baseline comparisons, error analysis, or user studies for the semantic text search or visual search components. This is load-bearing for the central claim of functional multimodal search, as the effectiveness of the chosen embedding models and visual components for the stated use cases remains unverified.

Authors: We agree that the absence of quantitative retrieval metrics represents a limitation in verifying the effectiveness of the semantic text and visual search components. The manuscript's primary contributions center on the end-to-end system architecture, the public deployment over 10 million PDFs, the open-source codebase, and the low preprocessing cost of approximately $1,500, which demonstrates practical feasibility for large-scale government archives. Nevertheless, to directly address the referee's concern, we will revise the paper by adding a new section on search component validation. This will include: (1) qualitative examples with actual query results for the highlighted use cases such as 'redacted documents' and 'pie charts'; (2) a description of the specific embedding models and visual feature extractors used, along with references to their established performance on related benchmarks in the literature; and (3) a small-scale quantitative evaluation on a sampled subset of pages (e.g., precision-oriented checks against manually labeled examples). We believe these additions will provide sufficient evidence of functionality while preserving the paper's focus on scalable system design rather than comprehensive IR benchmarking. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes the design and implementation of the GovScape search system for 10 million government PDFs, including metadata filters, exact text search, semantic embeddings, and visual search components, along with reported empirical preprocessing costs of approximately $1,500. No mathematical derivations, fitted parameters renamed as predictions, self-definitional equations, or load-bearing self-citations appear in the abstract or system description. Claims rest on the built artifact and measured compute metrics rather than any chain that reduces outputs to inputs by construction. The work is self-contained as an engineering report against external benchmarks of cost and scale.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is primarily an engineering system description rather than a theoretical derivation. No free parameters are fitted to produce a scientific claim; the reported $1,500 cost is an empirical measurement. No new axioms or invented entities are introduced.

pith-pipeline@v0.9.0 · 5672 in / 1123 out tokens · 29970 ms · 2026-05-17T22:50:52.307049+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce GovScape, a public search system that supports four primary forms of search... semantic text search and visual search... using BAAI/bge-base-en-v1.5 and openai/clip-vit-base-patch32 with Faiss
IndisputableMonolith/Foundation/AlphaCoordinateFixation J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

total estimated compute cost... $1,500, equivalent to 47,000 PDF pages per dollar

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

[1]

History in the age of abundance? : how the web is transforming historical research,

I. Milligan, “History in the age of abundance? : how the web is transforming historical research,” Montreal, 2019

work page 2019
[2]

End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,

M. E. Phillips, K. K. Phillips, and S. Alam, “End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,” in2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2023, pp. 98–101

work page 2023
[3]

‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,

J. Ogden and E. Maemura, “‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,”International journal of digital humanities, vol. 2, no. 1-3, pp. 43–63, 2021

work page 2021
[4]

Collection search

I. Archive, “Collection search.” [Online]. Available: https://web.archive. org/collection-search

work page
[5]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine...

work page 2021
[6]

Integrating visual and textual inputs for searching large-scale map collections with clip,

J. Mahowald and B. C. G. Lee, “Integrating visual and textual inputs for searching large-scale map collections with clip,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01190

work page arXiv 2024
[7]

A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,

T. Smits, B. Warner, P. Fyfe, and B. C. G. Lee, “A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,” Journal of Open Humanities Data, 2025. [Online]. Available: https://doi.org/10.5334/johd.284

work page doi:10.5334/johd.284 2025
[8]

A multimodal turn in digital humanities. using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections,

T. Smits and M. Wevers, “A multimodal turn in digital humanities. using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections,”Digital Scholarship in the Humanities, vol. 38, no. 3, pp. 1267–1280, 03 2023. [Online]. Available: https://doi.org/10.1093/llc/fqad008 [9]Towards multimodal computational humani...

work page doi:10.1093/llc/fqad008 2023
[9]

Blind dates: Examining the expression of temporality in historical photographs,

A. Barancov ´a, M. Wevers, and N. van Noord, “Blind dates: Examining the expression of temporality in historical photographs,” 2023

work page 2023
[10]

Grappling with the scale of born-digital government publications: Toward pipelines for processing and searching millions of pdfs,

B. C. G. Lee and T. Owens, “Grappling with the scale of born-digital government publications: Toward pipelines for processing and searching millions of pdfs,”International Journal of Digital Humanities, vol. 3, pp. 91 – 114, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:257159777

work page 2021
[12]

Averting the digital dark age : How archivists, librarians, and technologists built the web a memory,

I. Milligan, “Averting the digital dark age : How archivists, librarians, and technologists built the web a memory,” Baltimore, 2024

work page 2024
[13]

Web archive search as research: Methodological and theoretical implications,

A. Ben-David and H. Huurdeman, “Web archive search as research: Methodological and theoretical implications,”Alexandria, vol. 25, no. 1-2, pp. 93–111, 2014. [Online]. Available: https://doi.org/10.7227/ ALX.0022

work page 2014
[14]

Cargnelutti, K

M. Cargnelutti, K. Mukk, and C. Stanton, February

work page
[15]

Available: https://lil.law.harvard.edu/blog/2024/02/12/ warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

[Online]. Available: https://lil.law.harvard.edu/blog/2024/02/12/ warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

work page 2024
[16]

The archives unleashed project: Technology, process, and community to improve scholarly access to web archives,

N. Ruest, J. Lin, I. Milligan, and S. Fritz, “The archives unleashed project: Technology, process, and community to improve scholarly access to web archives,” inProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, ser. JCDL ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 157–166. [Online]. Available: https://do...

work page doi:10.1145/3383583.3398513 2020
[17]

The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,

R. Deschamps, N. Ruest, J. Lin, S. Fritz, and I. Milligan, “The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,” in2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Piscataway, NJ, USA: IEEE Press, 2019, pp. 337–338

work page 2019
[18]

Fostering community en- gagement through datathon events: The archives unleashed experience,

S. Fritz, I. Milligan, N. Ruest, and J. Lin, “Fostering community en- gagement through datathon events: The archives unleashed experience,” Digital humanities quarterly, vol. 15, no. 1, 2021

work page 2021
[19]

jpl-safedocs/file- observatory: V1.6.1,

R. Stonebraker, M. Milano, and A. Mensikova, “jpl-safedocs/file- observatory: V1.6.1,” Jul. 2023. [Online]. Available: https://doi.org/10. 5281/zenodo.8132495

work page 2023
[20]

Solrwayback

SolrWayback, “Solrwayback.” [Online]. Available: https://github.com/ netarchivesuite/solrwayback

work page
[21]

Gitelman,Paper Knowledge: Toward a Media History of Documents, ser

L. Gitelman,Paper Knowledge: Toward a Media History of Documents, ser. Sign, storage, transmission. Duke University Press, 2014

work page 2014
[22]

Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets

M. Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets. Pantheon, 2023

work page 2023
[23]

Diplomatic documents data for international relations: the freedom of information archive database,

M. J. Connelly, R. Hicks, R. Jervis, A. Spirling, and C. H. Suong, “Diplomatic documents data for international relations: the freedom of information archive database,”Conflict Management and Peace Science, vol. 38, no. 6, pp. 762–781, 2021. [Online]. Available: https://doi.org/10.1177/0738894220930326

work page doi:10.1177/0738894220930326 2021
[24]

The data liberation project

D. L. Project, “The data liberation project.” [Online]. Available: https://www.data-liberation-project.org/

work page
[25]

Using Artificial Intelligence to Identify State Secrets

R. R. Souza, F. C. Coelho, R. Shah, and M. Connelly, “Using artificial intelligence to identify state secrets,” 2016. [Online]. Available: https://arxiv.org/abs/1611.00356

work page internal anchor Pith review Pith/arXiv arXiv 2016
[26]

New evidence and new methods for analyzing the iranian revolution as an intelligence failure,

M. Connelly, R. Hicks, R. Jervis, and A. Spirling, “New evidence and new methods for analyzing the iranian revolution as an intelligence failure,”Intelligence and National Security, vol. 36, no. 6, pp. 781–806, 2021. [Online]. Available: https://doi.org/10.1080/02684527. 2021.1946959

work page doi:10.1080/02684527 2021
[27]

Amazing military infographics,

P. Ford, “Amazing military infographics,” May

work page
[28]

Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7

[Online]. Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7

work page
[29]

Powell.pps: Close & distant reading of primary sources in web archives,

T. Owens, B. C. G. Lee, and J. Estess, “Powell.pps: Close & distant reading of primary sources in web archives,” 2024

work page 2024
[30]

Slide decks as government publications: exploring two decades of powerpoint files archived from us government websites

T. Owens and J. Estess, “Slide decks as government publications: exploring two decades of powerpoint files archived from us government websites.”Archival Science, vol. 23, pp. 223–246, 2023

work page 2023
[31]

Moving the end of term web archive to the cloud to encourage research use and reuse,

M. Phillips and S. Alam, “Moving the end of term web archive to the cloud to encourage research use and reuse,”2022 Web Archiving and Digital Libraries Virtual Workshop, 2022. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc1998717/

work page 2022
[32]

Improving access to web archives through innovative analysis of pdf content,

M. Phillips and K. Murray, “Improving access to web archives through innovative analysis of pdf content,”Archiving (IS & T’s Archiving Conference), vol. 10, no. 1, pp. 186–192, 2013. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc155622/

work page 2013
[33]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,”CoRR, vol. abs/2401.08281, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2401.08281

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,

S. Gollapudi, N. Karia, V . Sivashankar, R. Krishnaswamy, N. Begwani, S. Raz, Y . Lin, Y . Zhang, N. Mahapatro, P. Srinivasan, A. Singh, and H. V . Simhadri, “Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,” inProceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Y . Din...

work page doi:10.1145/3543507.3583552 2023
[35]

The diskann library: Graph-based indices for fast, fresh and filtered vector search,

R. Krishnaswamy, M. D. Manohar, and H. V . Simhadri, “The diskann library: Graph-based indices for fast, fresh and filtered vector search,” IEEE Data Eng. Bull., vol. 48, no. 3, pp. 20–42, 2024. [Online]. Available: http://sites.computer.org/debull/A24sept/p20.pdf

work page 2024
[36]

C-pack: Packaged resources to advance general chinese embedding,

S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023

work page 2023
[37]

MTEB: Massive Text Embedding Benchmark

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” 2023. [Online]. Available: https://arxiv.org/abs/2210.07316

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

arXiv preprint arXiv:2502.13595 , year=

K. Enevoldsen, I. Chung, I. Kerboua, M. Kardos, A. Mathur, D. Stap, J. Gala, W. Siblini, D. Krzemi ´nski, G. I. Winata, S. Sturua, S. Utpala, M. Ciancone, M. Schaeffer, G. Sequeira, D. Misra, S. Dhakal, J. Rystrøm, R. Solomatin, ¨Omer C ¸ a˘gatan, A. Kundu, M. Bernstorff, S. Xiao, A. Sukhlecha, B. Pahwa, R. Po ´swiata, K. K. GV , S. Ashraf, D. Auras, B. P...

work page arXiv 2025
[39]

Digital collections explorer: An open-source, multimodal viewer for searching digital collections,

Y .-H. Huang and B. C. G. Lee, “Digital collections explorer: An open-source, multimodal viewer for searching digital collections,” 2025. [Online]. Available: https://arxiv.org/abs/2507.00961

work page arXiv 2025
[40]

S. J. Subramanya, Devvrit, R. Kadekodi, R. Krishaswamy, and H. V . Simhadri,DiskANN: fast accurate billion-point nearest neighbor search on a single node. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019
[41]

olmocr: Unlocking trillions of tokens in pdfs with vision language models,

J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, A. Rangapur, C. Wilhelm, K. Lo, and L. Soldaini, “olmocr: Unlocking trillions of tokens in pdfs with vision language models,”

work page
[42]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

[Online]. Available: https://arxiv.org/abs/2502.18443

work page arXiv
[43]

olmocr 2: Unit test rewards for document ocr,

J. Poznanski, L. Soldaini, and K. Lo, “olmocr 2: Unit test rewards for document ocr,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 19817

work page 2025

[1] [1]

History in the age of abundance? : how the web is transforming historical research,

I. Milligan, “History in the age of abundance? : how the web is transforming historical research,” Montreal, 2019

work page 2019

[2] [2]

End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,

M. E. Phillips, K. K. Phillips, and S. Alam, “End of term web archive dataset: Longitudinal web archive of .gov and .mil domains,” in2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 2023, pp. 98–101

work page 2023

[3] [3]

‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,

J. Ogden and E. Maemura, “‘go fish’: Conceptualising the challenges of engaging national web archives for digital research,”International journal of digital humanities, vol. 2, no. 1-3, pp. 43–63, 2021

work page 2021

[4] [4]

Collection search

I. Archive, “Collection search.” [Online]. Available: https://web.archive. org/collection-search

work page

[5] [5]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, ser. Proceedings of Machine...

work page 2021

[6] [6]

Integrating visual and textual inputs for searching large-scale map collections with clip,

J. Mahowald and B. C. G. Lee, “Integrating visual and textual inputs for searching large-scale map collections with clip,” 2024. [Online]. Available: https://arxiv.org/abs/2410.01190

work page arXiv 2024

[7] [7]

A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,

T. Smits, B. Warner, P. Fyfe, and B. C. G. Lee, “A fully-searchable multimodal dataset of the illustrated london news, 1842–1890,” Journal of Open Humanities Data, 2025. [Online]. Available: https://doi.org/10.5334/johd.284

work page doi:10.5334/johd.284 2025

[8] [8]

A multimodal turn in digital humanities. using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections,

T. Smits and M. Wevers, “A multimodal turn in digital humanities. using contrastive machine learning models to explore, enrich, and analyze digital visual historical collections,”Digital Scholarship in the Humanities, vol. 38, no. 3, pp. 1267–1280, 03 2023. [Online]. Available: https://doi.org/10.1093/llc/fqad008 [9]Towards multimodal computational humani...

work page doi:10.1093/llc/fqad008 2023

[9] [9]

Blind dates: Examining the expression of temporality in historical photographs,

A. Barancov ´a, M. Wevers, and N. van Noord, “Blind dates: Examining the expression of temporality in historical photographs,” 2023

work page 2023

[10] [10]

Grappling with the scale of born-digital government publications: Toward pipelines for processing and searching millions of pdfs,

B. C. G. Lee and T. Owens, “Grappling with the scale of born-digital government publications: Toward pipelines for processing and searching millions of pdfs,”International Journal of Digital Humanities, vol. 3, pp. 91 – 114, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:257159777

work page 2021

[11] [12]

Averting the digital dark age : How archivists, librarians, and technologists built the web a memory,

I. Milligan, “Averting the digital dark age : How archivists, librarians, and technologists built the web a memory,” Baltimore, 2024

work page 2024

[12] [13]

Web archive search as research: Methodological and theoretical implications,

A. Ben-David and H. Huurdeman, “Web archive search as research: Methodological and theoretical implications,”Alexandria, vol. 25, no. 1-2, pp. 93–111, 2014. [Online]. Available: https://doi.org/10.7227/ ALX.0022

work page 2014

[13] [14]

Cargnelutti, K

M. Cargnelutti, K. Mukk, and C. Stanton, February

work page

[14] [15]

Available: https://lil.law.harvard.edu/blog/2024/02/12/ warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

[Online]. Available: https://lil.law.harvard.edu/blog/2024/02/12/ warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/

work page 2024

[15] [16]

The archives unleashed project: Technology, process, and community to improve scholarly access to web archives,

N. Ruest, J. Lin, I. Milligan, and S. Fritz, “The archives unleashed project: Technology, process, and community to improve scholarly access to web archives,” inProceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, ser. JCDL ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 157–166. [Online]. Available: https://do...

work page doi:10.1145/3383583.3398513 2020

[16] [17]

The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,

R. Deschamps, N. Ruest, J. Lin, S. Fritz, and I. Milligan, “The archives unleashed notebook: madlibs for jumpstarting scholarly exploration of web archives,” in2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Piscataway, NJ, USA: IEEE Press, 2019, pp. 337–338

work page 2019

[17] [18]

Fostering community en- gagement through datathon events: The archives unleashed experience,

S. Fritz, I. Milligan, N. Ruest, and J. Lin, “Fostering community en- gagement through datathon events: The archives unleashed experience,” Digital humanities quarterly, vol. 15, no. 1, 2021

work page 2021

[18] [19]

jpl-safedocs/file- observatory: V1.6.1,

R. Stonebraker, M. Milano, and A. Mensikova, “jpl-safedocs/file- observatory: V1.6.1,” Jul. 2023. [Online]. Available: https://doi.org/10. 5281/zenodo.8132495

work page 2023

[19] [20]

Solrwayback

SolrWayback, “Solrwayback.” [Online]. Available: https://github.com/ netarchivesuite/solrwayback

work page

[20] [21]

Gitelman,Paper Knowledge: Toward a Media History of Documents, ser

L. Gitelman,Paper Knowledge: Toward a Media History of Documents, ser. Sign, storage, transmission. Duke University Press, 2014

work page 2014

[21] [22]

Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets

M. Connelly,The Declassification Engine: What History Reveals About America’s Top Secrets. Pantheon, 2023

work page 2023

[22] [23]

Diplomatic documents data for international relations: the freedom of information archive database,

M. J. Connelly, R. Hicks, R. Jervis, A. Spirling, and C. H. Suong, “Diplomatic documents data for international relations: the freedom of information archive database,”Conflict Management and Peace Science, vol. 38, no. 6, pp. 762–781, 2021. [Online]. Available: https://doi.org/10.1177/0738894220930326

work page doi:10.1177/0738894220930326 2021

[23] [24]

The data liberation project

D. L. Project, “The data liberation project.” [Online]. Available: https://www.data-liberation-project.org/

work page

[24] [25]

Using Artificial Intelligence to Identify State Secrets

R. R. Souza, F. C. Coelho, R. Shah, and M. Connelly, “Using artificial intelligence to identify state secrets,” 2016. [Online]. Available: https://arxiv.org/abs/1611.00356

work page internal anchor Pith review Pith/arXiv arXiv 2016

[25] [26]

New evidence and new methods for analyzing the iranian revolution as an intelligence failure,

M. Connelly, R. Hicks, R. Jervis, and A. Spirling, “New evidence and new methods for analyzing the iranian revolution as an intelligence failure,”Intelligence and National Security, vol. 36, no. 6, pp. 781–806, 2021. [Online]. Available: https://doi.org/10.1080/02684527. 2021.1946959

work page doi:10.1080/02684527 2021

[26] [27]

Amazing military infographics,

P. Ford, “Amazing military infographics,” May

work page

[27] [28]

Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7

[Online]. Available: https://medium.com/message/ amazing-military-infographics-1ba60bdc32e7

work page

[28] [29]

Powell.pps: Close & distant reading of primary sources in web archives,

T. Owens, B. C. G. Lee, and J. Estess, “Powell.pps: Close & distant reading of primary sources in web archives,” 2024

work page 2024

[29] [30]

Slide decks as government publications: exploring two decades of powerpoint files archived from us government websites

T. Owens and J. Estess, “Slide decks as government publications: exploring two decades of powerpoint files archived from us government websites.”Archival Science, vol. 23, pp. 223–246, 2023

work page 2023

[30] [31]

Moving the end of term web archive to the cloud to encourage research use and reuse,

M. Phillips and S. Alam, “Moving the end of term web archive to the cloud to encourage research use and reuse,”2022 Web Archiving and Digital Libraries Virtual Workshop, 2022. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc1998717/

work page 2022

[31] [32]

Improving access to web archives through innovative analysis of pdf content,

M. Phillips and K. Murray, “Improving access to web archives through innovative analysis of pdf content,”Archiving (IS & T’s Archiving Conference), vol. 10, no. 1, pp. 186–192, 2013. [Online]. Available: https://digital.library.unt.edu/ark:/67531/metadc155622/

work page 2013

[32] [33]

The Faiss library

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P. Mazar ´e, M. Lomeli, L. Hosseini, and H. J ´egou, “The faiss library,”CoRR, vol. abs/2401.08281, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2401.08281

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [34]

Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,

S. Gollapudi, N. Karia, V . Sivashankar, R. Krishnaswamy, N. Begwani, S. Raz, Y . Lin, Y . Zhang, N. Mahapatro, P. Srinivasan, A. Singh, and H. V . Simhadri, “Filtered-diskann: Graph algorithms for approximate nearest neighbor search with filters,” inProceedings of the ACM Web Conference 2023, WWW 2023, Austin, TX, USA, 30 April 2023 - 4 May 2023, Y . Din...

work page doi:10.1145/3543507.3583552 2023

[34] [35]

The diskann library: Graph-based indices for fast, fresh and filtered vector search,

R. Krishnaswamy, M. D. Manohar, and H. V . Simhadri, “The diskann library: Graph-based indices for fast, fresh and filtered vector search,” IEEE Data Eng. Bull., vol. 48, no. 3, pp. 20–42, 2024. [Online]. Available: http://sites.computer.org/debull/A24sept/p20.pdf

work page 2024

[35] [36]

C-pack: Packaged resources to advance general chinese embedding,

S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-pack: Packaged resources to advance general chinese embedding,” 2023

work page 2023

[36] [37]

MTEB: Massive Text Embedding Benchmark

N. Muennighoff, N. Tazi, L. Magne, and N. Reimers, “Mteb: Massive text embedding benchmark,” 2023. [Online]. Available: https://arxiv.org/abs/2210.07316

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [38]

arXiv preprint arXiv:2502.13595 , year=

K. Enevoldsen, I. Chung, I. Kerboua, M. Kardos, A. Mathur, D. Stap, J. Gala, W. Siblini, D. Krzemi ´nski, G. I. Winata, S. Sturua, S. Utpala, M. Ciancone, M. Schaeffer, G. Sequeira, D. Misra, S. Dhakal, J. Rystrøm, R. Solomatin, ¨Omer C ¸ a˘gatan, A. Kundu, M. Bernstorff, S. Xiao, A. Sukhlecha, B. Pahwa, R. Po ´swiata, K. K. GV , S. Ashraf, D. Auras, B. P...

work page arXiv 2025

[38] [39]

Digital collections explorer: An open-source, multimodal viewer for searching digital collections,

Y .-H. Huang and B. C. G. Lee, “Digital collections explorer: An open-source, multimodal viewer for searching digital collections,” 2025. [Online]. Available: https://arxiv.org/abs/2507.00961

work page arXiv 2025

[39] [40]

S. J. Subramanya, Devvrit, R. Kadekodi, R. Krishaswamy, and H. V . Simhadri,DiskANN: fast accurate billion-point nearest neighbor search on a single node. Red Hook, NY , USA: Curran Associates Inc., 2019

work page 2019

[40] [41]

olmocr: Unlocking trillions of tokens in pdfs with vision language models,

J. Poznanski, A. Rangapur, J. Borchardt, J. Dunkelberger, R. Huff, D. Lin, A. Rangapur, C. Wilhelm, K. Lo, and L. Soldaini, “olmocr: Unlocking trillions of tokens in pdfs with vision language models,”

work page

[41] [42]

olmocr: Unlocking trillions of tokens in pdfs with vi- sion language models.arXiv preprint arXiv:2502.18443, 2025a

[Online]. Available: https://arxiv.org/abs/2502.18443

work page arXiv

[42] [43]

olmocr 2: Unit test rewards for document ocr,

J. Poznanski, L. Soldaini, and K. Lo, “olmocr 2: Unit test rewards for document ocr,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 19817

work page 2025