pith. sign in

arxiv: 2606.28344 · v1 · pith:J3TZXGCKnew · submitted 2026-06-01 · 💻 cs.IR · cs.AI· cs.CL· cs.CV· cs.LG

PIXELRAG: Web Screenshots Beat Text for Retrieval-Augmented Generation

Pith reviewed 2026-06-30 11:29 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.CVcs.LG
keywords PixelRAGretrieval-augmented generationweb screenshotsvisual retrievalmultimodal RAGWikipedia screenshotsvision-language models
0
0 comments X

The pith

PixelRAG shows that retrieving and reading web pages as screenshots outperforms text extraction for retrieval-augmented generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that web retrieval-augmented generation can work directly on the visual form of pages instead of converting them to text. Current systems parse HTML into linear text and lose layout, formatting, and structure in the process. PixelRAG builds a visual index over 30 million Wikipedia screenshots, fine-tunes a visual embedding model on contrastive screenshot pairs, and passes the retrieved images straight to a vision-language model. The approach beats both no-retrieval and text RAG baselines on text-heavy tasks such as NQ and SimpleQA, as well as on multimodal and agentic benchmarks. If the result holds, it implies that the web's native visual representation can replace text pipelines for both higher accuracy and lower token cost through image compression.

Core claim

PixelRAG is a retrieval-augmented generation pipeline that represents websites as their native screenshot images rather than extracted text. It scales a visual embedding index to a full 30-million-image Wikipedia datastore, fine-tunes the embedding model on curated contrastive screenshot pairs, and supplies the retrieved pixels directly to a vision-language model. This end-to-end visual pipeline improves accuracy over text-based RAG by up to 18.1 percent on tasks ranging from open-domain QA to noisy news and agentic benchmarks, while also enabling up to 3x token reduction via lower-resolution compression.

What carries the argument

A visual retrieval index over full-page screenshot images, fine-tuned on contrastive pairs and fed directly as pixels to a vision-language model without any text conversion step.

If this is right

  • RAG systems can preserve page layout and visual structure without complex HTML parsing pipelines.
  • Image compression offers a practical way to cut token usage while holding accuracy steady.
  • Performance advantages appear even on tasks that have historically been treated as purely textual.
  • The same visual index supports both text-centric and multimodal question answering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar visual pipelines could be tested on other document collections where layout carries meaning, such as scientific papers or product pages.
  • Hybrid systems that retrieve both text and pixels might combine the strengths of each representation.
  • The result invites direct measurement of how much information vision-language models extract from screenshots versus parsed text on identical questions.

Load-bearing premise

That any accuracy gains come from the pixel representation of pages rather than from differences in model capacity or training data between the visual and text retrieval pipelines.

What would settle it

A controlled comparison in which a text RAG system is given the same downstream vision-language model and equivalent contrastive training data on the same corpus, yet matches or exceeds PixelRAG accuracy on NQ and SimpleQA.

Figures

Figures reproduced from arXiv: 2606.28344 by Joseph E. Gonzalez, Lesheng Jin, Matei Zaharia, Paul Teiletche, Sewon Min, Yichuan Wang, Zhifei Li, Zirui Wang.

Figure 1
Figure 1. Figure 1: Overview of PIXELRAG. Text-based RAG (top) parses HTML into a text index and retrieves text chunks for the reader model. PIXELRAG (bottom, ours) renders each webpage, builds a visual index, and retrieves screenshot tiles for the reader — no parser required, fully visual. is complex, heavily engineered, and error-prone [10–12]. Even state-of-the-art parsers [13–15] are brittle and inherently lossy, discardi… view at source ↗
Figure 2
Figure 2. Figure 2: Hard-negative mining with false-negative filtering. The LLM correctly answers the query from both the positive tile (left, kept) and a second tile (center, dropped as false negative), but cannot answer from a topically adjacent page (right, kept as hard negative). as hard-negative candidates. However, the same knowledge often appears in more than one page; a top-K neighbor may therefore also answer q, maki… view at source ↗
Figure 3
Figure 3. Figure 3: SimpleQA accuracy versus average input tokens across four reader models ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: SimpleQA accuracy vs. input tokens under image [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A Wikipedia page as it appears online in a browser (left) and after our rendering pipeline [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: VLM reading ability across model generations on SimpleQA. Q = Qwen; L3.2 = Llama [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual information loss during HTML parsing. Each row shows a rendered Wikipedia page [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Retrieval signal loss under text linearization. The left panel shows ranked text chunks for the [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Parser loss. The 2010 Champions League Final article’s match-statistics table is destroyed by HTML-to-text linearization, so no text chunk in the corpus contains the answer. The pixel retriever surfaces the rendered statistics table as the top tile. This is the same example shown in [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Rank loss (paragraph evidence). Once the infobox is linearized, its flattened key–value text out-ranks the answer-bearing body paragraph (which falls to rank 12) — the infobox lists Dalí’s own birth/death, not his mother’s, yet matches the query on the entity name. The visual embedding keeps the infobox sidebar structurally distinct from the body section, surfacing the relevant tile in the top-3 [PITH_FU… view at source ↗
Figure 12
Figure 12. Figure 12: Rank loss (extreme rank gap). Text retrieval places the Shepard infobox chunk at rank 1 — the correct article, but the infobox does not list the nominating President. The body paragraph that does contain the answer falls all the way to rank 66. The pixel retriever surfaces the answer-bearing tile at rank 3. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reader loss. Both modalities retrieve the same gold article at rank 1, and the answer appears verbatim in the text chunk. However, the linearized list flattens the year–role–name hierarchy into uniform dash-prefixed lines, and the text reader attributes the 1995 honorable mention to 1996. The rendered tile preserves the visual grouping by year, allowing the VLM to locate the correct entry. linearizes HTML… view at source ↗
Figure 14
Figure 14. Figure 14: Example of synthetic query generation (Stage 1). The rendered tile (left) is sent together [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Synthetic query generation prompt (Stage 1). The model is sent this text together with the [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Self-contained-query filter prompt (Stage 1, first false-positive filter). Queries labelled [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Evaluation prompts. Hard-negative Stage A (answer, blue): the VLM sees only the candidate tile(s) and the query, returning a short answer or CANNOT_ANSWER. Stage B (judge, blue): classifies the candidate as CORRECT (false negative, dropped), WRONG, or CANNOT_ANSWER (hard negative, kept). Evidence QA (gray): reader system prompts for text-only (top) and multimodal (bottom) query benchmarks. 35 [PITH_FULL_… view at source ↗
read the original abstract

Augmenting large language models (LLMs) with retrieved web text has become a dominant paradigm, yet the web is not natively textual: existing systems depend on complex parsing pipelines that linearize HTML and discard layout, visual structure, and formatting. We introduce PixelRAG, a new retrieval-augmented method that represents websites in their native visual form and performs retrieval and reading entirely in pixel space, enabling an end-to-end architecture that eliminates text abstraction. PixelRAG is, to our knowledge, the first pipeline to operate over a full Wikipedia corpus in this form, scaling to a datastore of 30 million screenshot images with an efficient visual retrieval index. Built on an existing visual embedding model (i.e., Qwen3-VL-Embedding), PixelRAG further fine-tunes this model on screenshot data with carefully curated contrastive training data. Retrieved screenshots are then fed directly as pixel inputs to a VLM, without intermediate text conversion. PixelRAG consistently outperforms both no-retrieval and text-based RAG baselines, most surprisingly on widely studied text-centric tasks such as NQ and SimpleQA. It also achieves strong gains on multimodal open-domain QA (e.g., MMSearch), benchmarks over noisy news corpora (e.g., LiveVQA), and agentic benchmarks (e.g., MoNaCo), improving accuracy by up to 18.1% over text-based baselines. Finally, pixel representations enable a new efficiency lever for RAG through image compression, achieving up to 3x token cost reduction at lower resolutions while maintaining accuracy. Our results challenge the necessity of text representations in web retrieval, suggesting that web RAG can operate directly in the web's native visual form while improving both performance and efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PixelRAG, a RAG pipeline that indexes and retrieves web content as native screenshots (pixels) rather than parsed text. It fine-tunes Qwen3-VL-Embedding on curated contrastive screenshot pairs, scales the index to 30 million Wikipedia screenshots, feeds retrieved images directly to a VLM, and reports consistent accuracy gains over no-retrieval and text-based RAG baselines (up to 18.1 %) on text-centric tasks (NQ, SimpleQA), multimodal QA (MMSearch), noisy news (LiveVQA), and agentic benchmarks (MoNaCo). It also claims efficiency gains via image compression that reduce token cost by up to 3x.

Significance. If the central empirical claim survives controls that isolate the pixel representation from model-capacity and fine-tuning differences, the result would be significant: it would challenge the long-standing assumption that text linearization is necessary or optimal for web RAG and would demonstrate that a purely visual pipeline can improve both accuracy and efficiency at web scale. The reported scaling to a 30-million-image datastore and the compression-based token reduction are concrete strengths that would be of immediate practical interest.

major comments (2)
  1. [Abstract / PixelRAG construction paragraph] Abstract (paragraph describing PixelRAG construction and fine-tuning) and the baseline description: the headline claim that screenshot retrieval outperforms text RAG 'because of the native visual representation' is not isolated. PixelRAG fine-tunes Qwen3-VL-Embedding on contrastive screenshot pairs and uses a VLM reader, while the text baselines are described only as 'standard text-embedding pipelines' without any indication that they employ the same base model or equivalent contrastive fine-tuning. Consequently the observed lifts (including the 18.1 % figure on NQ/SimpleQA) cannot yet be attributed to the pixel format rather than to differences in model capacity, pre-training, or training data.
  2. [Experimental section] Experimental section (where baselines and ablations are presented): no ablation is described that holds the embedding model and fine-tuning procedure fixed while varying only the input representation (pixels vs. text). Without such a controlled comparison the central causal claim remains under-supported.
minor comments (1)
  1. The abstract states that PixelRAG is 'the first pipeline to operate over a full Wikipedia corpus in this form'; a brief related-work paragraph clarifying how prior visual-retrieval or screenshot-based systems differ would strengthen the novelty claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the current experiments do not fully isolate the contribution of the pixel representation. We respond point-by-point below and indicate planned changes to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / PixelRAG construction paragraph] Abstract (paragraph describing PixelRAG construction and fine-tuning) and the baseline description: the headline claim that screenshot retrieval outperforms text RAG 'because of the native visual representation' is not isolated. PixelRAG fine-tunes Qwen3-VL-Embedding on contrastive screenshot pairs and uses a VLM reader, while the text baselines are described only as 'standard text-embedding pipelines' without any indication that they employ the same base model or equivalent contrastive fine-tuning. Consequently the observed lifts (including the 18.1 % figure on NQ/SimpleQA) cannot yet be attributed to the pixel format rather than to differences in model capacity, pre-training, or training data.

    Authors: We agree that the manuscript does not isolate the pixel representation from model capacity and fine-tuning differences. The text baselines follow common practice in the RAG literature but are not matched to the same base model or contrastive procedure. In the revision we will (1) explicitly name the text embedding models used, (2) add a limitations paragraph discussing this confound, and (3) tone down causal language attributing gains solely to pixels. We cannot retroactively change the existing experiments without new runs. revision: yes

  2. Referee: [Experimental section] Experimental section (where baselines and ablations are presented): no ablation is described that holds the embedding model and fine-tuning procedure fixed while varying only the input representation (pixels vs. text). Without such a controlled comparison the central causal claim remains under-supported.

    Authors: We acknowledge the absence of this controlled ablation. The contrastive fine-tuning data is constructed from screenshot pairs, making an exactly parallel text-only fine-tuning non-trivial to construct. In the revised manuscript we will add an ablation that re-uses the Qwen3-VL backbone (which accepts both modalities) with text-only inputs derived from the same pages, holding the fine-tuning objective and data scale as close as possible. If resource constraints prevent a full re-run, we will report the best feasible comparison and note the remaining gap. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external baselines, not self-referential definitions or fits.

full rationale

The paper contains no equations, derivations, or fitted parameters that are renamed as predictions. All performance claims are presented as direct empirical comparisons against external text-RAG baselines on public benchmarks (NQ, SimpleQA, MMSearch, etc.). The fine-tuning step on contrastive screenshot pairs is described as a construction choice, not a quantity that is then 'predicted' from itself. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim therefore does not reduce to its own inputs by construction and remains falsifiable against independent text pipelines.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified effectiveness of the visual embedding fine-tuning step and on the assumption that screenshot pixels preserve all information needed for the downstream tasks. No free parameters or invented entities are quantified in the abstract.

axioms (1)
  • domain assumption Fine-tuning a visual embedding model on curated contrastive screenshot pairs yields retrieval quality superior to text-based methods for web QA tasks.
    Invoked in the description of PixelRAG construction and the source of the reported accuracy gains.

pith-pipeline@v0.9.1-grok · 5887 in / 1432 out tokens · 51540 ms · 2026-06-30T11:29:28.795207+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 33 canonical work pages · 17 internal anchors

  1. [1]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  2. [2]

    REALM: Retrieval-augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. REALM: Retrieval-augmented language model pre-training. InInternational Conference on Machine Learning, pages 3929–3938, 2020

  3. [3]

    Dense passage retrieval for open-domain question answering

    Vladimir Karpukhin, Barlas O ˘guz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6769–6781, 2020

  4. [4]

    Leveraging passage retrieval with generative models for open domain question answering

    Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 874–880, 2021

  5. [5]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332, 2021

  6. [6]

    Frustratingly simple re- trieval improves challenging, reasoning-intensive benchmarks.arXiv preprint arXiv:2507.01297, 2025

    Xinxi Lyu, Michael Duan, Rulin Shao, Pang Wei Koh, and Sewon Min. Frustratingly simple re- trieval improves challenging, reasoning-intensive benchmarks.arXiv preprint arXiv:2507.01297, 2025

  7. [7]

    Scaling retrieval-based language models with a trillion-token datastore

    Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettle- moyer, and Pang W Koh. Scaling retrieval-based language models with a trillion-token datastore. Advances in Neural Information Processing Systems, 37:91260–91299, 2024

  8. [8]

    Reusing pre- training data at test time is a compute multiplier.arXiv preprint arXiv:2511.04234, 2025

    Alex Fang, Thomas V oice, Ruoming Pang, Ludwig Schmidt, and Tom Gunter. Reusing pre- training data at test time is a compute multiplier.arXiv preprint arXiv:2511.04234, 2025. 11

  9. [9]

    WebSailor: Navigating Super-human Reasoning for Web Agent

    Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, et al. Websailor: Navigating super-human reasoning for web agent.arXiv preprint arXiv:2507.02592, 2025

  10. [10]

    Beyond a single extractor: Re-thinking html-to-text extraction for llm pre-training

    Jeffrey Li, Joshua P Gardner, Doug Kang, Fangping Shi, Karanjeet Singh, Chun-Liang Li, Herumb Shandilya, David Leo Wright Hall, Oncel Tuzel, Percy Liang, et al. Beyond a single extractor: Re-thinking html-to-text extraction for llm pre-training. InFindings of the Association for Computational Linguistics: EACL 2026, pages 5836–5861, 2026

  11. [11]

    The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

    Guilherme Penedo, Hynek Kydlíˇcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

  12. [12]

    Html- RAG: HTML is better than plain text for modeling retrieved knowledge in RAG systems

    Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, and Ji-Rong Wen. Html- RAG: HTML is better than plain text for modeling retrieved knowledge in RAG systems. In Proceedings of the ACM Web Conference 2025 (WWW), 2025

  13. [13]

    Trafilatura: A web scraping library and command-line tool for text discovery and extraction

    Adrien Barbaresi. Trafilatura: A web scraping library and command-line tool for text discovery and extraction. In Heng Ji, Jong C. Park, and Rui Xia, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 122–13...

  14. [14]

    ReaderLM-v2: Small language model for HTML to markdown and JSON.arXiv preprint arXiv:2503.01151, 2025

    Feng Wang, Zesheng Shi, Bo Wang, Nan Wang, and Han Xiao. ReaderLM-v2: Small language model for HTML to markdown and JSON.arXiv preprint arXiv:2503.01151, 2025

  15. [15]

    Dripper: Token-efficient main HTML extraction with a lightweight LM.arXiv preprint arXiv:2511.23119, 2025

    Mengjie Liu, Jiahui Peng, Wenchang Ning, et al. Dripper: Token-efficient main HTML extraction with a lightweight LM.arXiv preprint arXiv:2511.23119, 2025

  16. [16]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  17. [17]

    GPT-4o System Card

    OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

  18. [18]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  19. [19]

    The Claude 3 model family: Opus, Sonnet, Haiku

    Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. Technical report, 2024. Technical Report

  20. [20]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  21. [21]

    Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

    Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, et al. Charxiv: Charting gaps in realistic chart understanding in multimodal llms.Advances in Neural Information Processing Systems, 37:113569–113697, 2024

  22. [22]

    Colpali: Efficient document retrieval with vision language models

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. In International Conference on Learning Representations, 2025

  23. [23]

    Unifying multi- modal retrieval via document screenshot embedding

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multi- modal retrieval via document screenshot embedding. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6492–6505, 2024

  24. [24]

    VisRAG: Vision-based retrieval-augmented generation on multi-modality documents

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. VisRAG: Vision-based retrieval-augmented generation on multi-modality documents. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. 12

  25. [25]

    M3docrag: Multi- modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952, 2024

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3docrag: Multi- modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952, 2024

  26. [26]

    Unlocking multimodal document intelligence: From current triumphs to future frontiers of visual document retrieval.arXiv preprint arXiv:2602.19961, 2026

    Yibo Yan, Jiahao Huo, Guanbo Feng, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Yuanhuiyi Lyu, Yu Huang, Jungang Li, et al. Unlocking multimodal document intelligence: From current triumphs to future frontiers of visual document retrieval.arXiv preprint arXiv:2602.19961, 2026

  27. [27]

    CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

    Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Kening Zheng, Wei-Chieh Huang, Yi Cao, Mingdong Ou, Philip S. Yu, and Xuming Hu. CausalEmbed: Auto-regressive multi-vector generation in latent space for visual document embedding.arXiv preprint arXiv:2601.21262, 2026

  28. [28]

    Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, et al. Qwen3-vl-embedding and qwen3-vl-reranker: A unified framework for state-of-the-art multimodal retrieval and ranking.arXiv preprint arXiv:2601.04720, 2026

  29. [29]

    Bennett, Junaid Ahmed, and Arnold Overwijk

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. InInternational Conference on Learning Representations, 2021

  30. [30]

    Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: A benchmark for question answering research.Transactions of the Association for Computational Linguistics, 7:453–466, 2019

  31. [31]

    Measuring short-form factuality in large language models

    Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368, 2024

  32. [32]

    Yu, and Ranjay Krishna

    Mingyang Fu, Yuyang Peng, Dongping Chen, Zetong Zhou, Benlin Liu, Yao Wan, Zhou Zhao, Philip S. Yu, and Ranjay Krishna. Seeking and updating with live visual knowledge.arXiv preprint arXiv:2504.05288, 2025

  33. [33]

    MMSearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

    Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanmin Wu, Jiayi Lei, Pengshuo Qiu, Pan Lu, Zehui Chen, Chaoyou Fu, Guanglu Song, Peng Gao, Yu Liu, Chunyuan Li, and Hongsheng Li. MMSearch: Benchmarking the potential of large models as multi-modal search engines.arXiv preprint arXiv:2409.12959, 2024

  34. [34]

    DeepSeek-OCR: Contexts Optical Compression

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

  35. [35]

    Glyph: Scaling context windows via visual-text compression

    Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, et al. Glyph: Scaling context windows via visual-text compression. arXiv preprint arXiv:2510.17800, 2025

  36. [36]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning.arXiv preprint arXiv:2503.09516, 2025

  37. [37]

    Tongyi DeepResearch Technical Report

    Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, et al. Tongyi deepresearch technical report.arXiv preprint arXiv:2510.24701, 2025

  38. [38]

    Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

    Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, et al. Datacomp-lm: In search of the next generation of training sets for language models.Advances in Neural Information Processing Systems, 37:14200–14282, 2024

  39. [39]

    Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl

    Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. Elastic ChatNoir: Search Engine for the ClueWeb and the Common Crawl. In Leif Azzopardi, Allan Hanbury, Gabriella Pasi, and Benjamin Piwowarski, editors,Advances in Information Retrieval. 40th European Conference on IR Research (ECIR 2018), Lecture Notes in Computer Science, Berlin Heidel...

  40. [40]

    mwparserfromhell: A python parser for mediawiki wikicode

    Ben Kurtovic and contributors. mwparserfromhell: A python parser for mediawiki wikicode. https://github.com/earwig/mwparserfromhell, 2026

  41. [41]

    From text to pixel: Advancing long-context understanding in mllms.arXiv preprint arXiv:2405.14213, 2024

    Yujie Lu, Xiujun Li, Tsu-Jui Fu, Miguel Eckstein, and William Yang Wang. From text to pixel: Advancing long-context understanding in mllms.arXiv preprint arXiv:2405.14213, 2024

  42. [42]

    Reading, Not Thinking: Understanding and Bridging the Modality Gap When Text Becomes Pixels in Multimodal LLMs

    Kaiser Sun, Xiaochuang Yuan, Hongjun Liu, Chen Zhao, Cheng Zhang, Mark Dredze, and Fan Bai. Reading, not thinking: Understanding and bridging the modality gap when text becomes pixels in multimodal llms.arXiv preprint arXiv:2603.09095, 2026

  43. [43]

    Pixelworld: How far are we from perceiving everything as pixels?arXiv preprint arXiv:2501.19339, 2025

    Zhiheng Lyu, Xueguang Ma, and Wenhu Chen. Pixelworld: How far are we from perceiving everything as pixels?arXiv preprint arXiv:2501.19339, 2025

  44. [44]

    Nemotron colembed v2: Top-performing late interaction embedding models for visual document retrieval

    Gabriel de Souza P Moreira, Ronay Ak, Mengyao Xu, Oliver Holworthy, Benedikt Schifferer, Zhiding Yu, Yauhen Babakhin, Radek Osmulski, Jiarui Cai, Ryan Chesler, et al. Nemotron colembed v2: Top-performing late interaction embedding models for visual document retrieval. arXiv preprint arXiv:2602.03992, 2026

  45. [45]

    Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

    Quentin Macé, António Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval.arXiv preprint arXiv:2505.17166, 2025

  46. [46]

    Mmdocbench: Benchmarking large vision-language models for fine-grained visual document understanding.arXiv preprint arXiv:2410.21311, 2024

    Fengbin Zhu, Ziyang Liu, Xiang Yao Ng, Haohui Wu, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, and Tat Seng Chua. Mmdocbench: Benchmarking large vision-language models for fine-grained visual document understanding.arXiv preprint arXiv:2410.21311, 2024

  47. [47]

    Irpapers: A visual document benchmark for scientific retrieval and question answering.arXiv preprint arXiv:2602.17687, 2026

    Connor Shorten, Augustas Skaburskas, Daniel M Jones, Charles Pierse, Roberto Esposito, John Trengrove, Etienne Dilocker, and Bob van Luijt. Irpapers: A visual document benchmark for scientific retrieval and question answering.arXiv preprint arXiv:2602.17687, 2026

  48. [48]

    Billion-Scale Similarity Search with GPUs

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-Scale Similarity Search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2021

  49. [49]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  50. [50]

    Swift:a scalable lightweight infrastructure for fine-tuning, 2024

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Hong Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. Swift:a scalable lightweight infrastructure for fine-tuning, 2024

  51. [51]

    Open domain question answering over tables via dense retrieval

    Jonathan Herzig, Thomas Müller, Syrine Krichene, and Julian Martin Eisenschlos. Open domain question answering over tables via dense retrieval. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics, pages 512–519, 2021

  52. [52]

    Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories

    Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, and Vittorio Ferrari. Encyclopedic VQA: Visual questions about detailed properties of fine-grained categories. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3082–3092, 2023

  53. [53]

    Online language modelling data pipeline.https://github.com/huggingface/olm-datasets, 2022

    Tristan Thrush, Helen Ngo, Nathan Lambert, and Douwe Kiela. Online language modelling data pipeline.https://github.com/huggingface/olm-datasets, 2022

  54. [54]

    Neuml/wikipedia: Wikipedia text dataset

    NeuML. Neuml/wikipedia: Wikipedia text dataset. https://huggingface.co/ datasets/NeuML/wikipedia, 2024. Text extracted from Wikipedia XML dumps via mwparserfromhell

  55. [55]

    Qwen3.5: Towards native multimodal agents

    Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id= qwen3.5, 2026

  56. [56]

    MoNaCo: More natural and complex questions for reasoning across dozens of documents.Transactions of the Association for Computational Linguistics, 2025

    Tomer Wolfson, Harsh Trivedi, Mor Geva, Yoav Goldberg, Dan Roth, Tushar Khot, Ashish Sabharwal, and Reut Tsarfaty. MoNaCo: More natural and complex questions for reasoning across dozens of documents.Transactions of the Association for Computational Linguistics, 2025. 14

  57. [57]

    SerpApi: Google search api.https://serpapi.com, 2025

    SerpApi. SerpApi: Google search api.https://serpapi.com, 2025

  58. [58]

    Gonzalez, Matei Zaharia, and Sewon Min

    Jinjian Liu, Yichuan Wang, Xinxi Lyu, Rulin Shao, Joseph E. Gonzalez, Matei Zaharia, and Sewon Min. DS SERVE: A framework for efficient and scalable neural retrieval. InFortieth AAAI Conference on Artificial Intelligence (AAAI), pages 41631–41633, 2026

  59. [59]

    Lanczos resampling

    Wikipedia contributors. Lanczos resampling. https://en.wikipedia.org/wiki/ Lanczos_resampling, 2025. Accessed: 2026-04-29

  60. [60]

    Improved Baselines with Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning.arXiv preprint arXiv:2310.03744, 2023

  61. [61]

    Llama 3.2: Lightweight text and multimodal models

    Meta AI. Llama 3.2: Lightweight text and multimodal models. https://ai.meta.com/ blog/llama-3-2-connect-2024-vision-edge-mobile-devices/, 2024

  62. [62]

    The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

    Meta AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation. https://ai.meta.com/blog/llama-4-multimodal-intelligence/, 2026. Blog post

  63. [63]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  64. [64]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  65. [65]

    Qwen3.6: Towards real world agents

    Qwen Team. Qwen3.6: Towards real world agents. https://qwen.ai/blog?id=qwen3.6/, 2026

  66. [66]

    Leann: A low-storage vector index.arXiv preprint arXiv:2506.08276, 2025

    Yichuan Wang, Zhifei Li, Shu Liu, Yongji Wu, Ziming Mao, Yilong Zhao, Xiao Yan, Zhiy- ing Xu, Yang Zhou, Ion Stoica, et al. Leann: A low-storage vector index.arXiv preprint arXiv:2506.08276, 2025

  67. [67]

    Google Landmarks Dataset v2 — a large-scale benchmark for instance-level recognition and retrieval

    Tobias Weyand, André Araujo, Bingyi Cao, and Jack Sim. Google Landmarks Dataset v2 — a large-scale benchmark for instance-level recognition and retrieval. InCVPR, 2020

  68. [68]

    Introducing GPT-4.1 in the API, April 2025

    OpenAI. Introducing GPT-4.1 in the API, April 2025. Blog post

  69. [69]

    See also

    Kiwix Association. Kiwix — offline reader for web content. https://kiwix.org, 2007. Open-source offline browser using the ZIM archive format. 15 Technical appendices and supplementary material Contents A System & Implementation Details A.1 Rendering Pipeline 16 A.2 Datastore Fetching 16 A.3 Embedding Training: Data Recipe Details and Prompts 18 A.4 Reader...

  70. [70]

    Fetch HTML.The article HTML is served from a local Kiwix ZIM archive [69] viakiwix-serve, eliminating network latency

  71. [71]

    The first line (article title) is skipped to avoid matching the <h1> heading

    Extract search keys.Distinctive phrases are extracted from the text chunk: table cell values (e.g., codes likeB01AC06, numbers with units) for table-heavy chunks, mid-line prose fragments for paragraph-heavy chunks. The first line (article title) is skipped to avoid matching the <h1> heading

  72. [72]

    Locate in DOM.Each key is searched within the text_content() of every element under the article’smw-parser-output container. Both the key and element text are normalized (non- breaking spaces, dash variants, and diacritics are collapsed) to handle encoding mismatches between Trafilatura output and raw HTML. The tightest-matching element is selected

  73. [73]

    Resolve to contiguous span.Each matched element is walked up to its nearest direct-child ancestor ofmw-parser-output. The final result is the contiguous range of direct children from the first matched child to the last—preserving all intermediate elements (tables, paragraphs, lists) that the original text chunk spanned. 29

  74. [74]

    query":

    Clean and return.Inline <style>, <script>, and navigation-box ( navbox) elements are stripped. The serialized HTML is returned to the reader. If no key matches in the DOM, the original flat text is used as fallback. The reader (Qwen3-VL-4B,max_model_len=65536) receives the concatenated HTML of allk=3 retrieved passages, separated by<hr>delimiters. Results...

  75. [75]

    Who composed the music for the film?

    SELF-CONTAINED. The question must be understandable on its own; every entity must be named explicitly. BAD: "Who composed the music for the film?" (missing film name) BAD: "On what date was Lerew awarded the DFC?" (surname only + acronym) BAD: "Which cyclist placed second in the Tempo race?" (missing event/year) BAD: "Which mission is shown in the screens...

  76. [76]

    The answer must be fully visible in this chunk

    EVIDENCE COMPLETE. The answer must be fully visible in this chunk. The source span (S:) must be a complete, untruncated sentence

  77. [77]

    Include enough specifics (names, dates, locations, titles) to distinguish this chunk from similar pages

    DISTINCTIVE. Include enough specifics (names, dates, locations, titles) to distinguish this chunk from similar pages. ANSWER: prefer a single concise entity -- name, date, place, number, title, or short phrase. SKIP (write exactly: SKIP) if any of the following holds: - Content is a raw vote count, track listing, census table, or episode list. - The answe...

  78. [78]

    What was the final score of the basketball game between THE TEAM and Marquette?

    The subject is a vague pronoun or generic noun without a proper name: NO: "What was the final score of the basketball game between THE TEAM and Marquette?" ("the team" unnamed) NO: "Who directed the episode of THE TELEVISION SERIES titled'X'?" ("the television series" unnamed) NO: "In what year did THE SUBJECT OF THE ARTICLE move to Tokyo?" ("the subject"...

  79. [79]

    Which item IS LISTED IN THE TABLE as X?

    The question explicitly references document structure: NO: "Which item IS LISTED IN THE TABLE as X?" NO: "What is shown IN THE INFOBOX?" NO: "According to THE PROVIDED TABLE, which..."

  80. [80]

    Who was THE CAPTAIN of HMS Defence?

    A role/position question where no year or identifying event is given and the role has had many holders: NO: "Who was THE CAPTAIN of HMS Defence?" (no year, hundreds of captains over centuries)

Showing first 80 references.