DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents

Jiamian Wang; Jing Shi; Rajiv Jain; Ruiyi Zhang; Samyadeep Basu; Tong Sun; Tong Yu; Zhiqiang Tao

arxiv: 2606.26122 · v1 · pith:37WHJ2U4new · submitted 2026-05-27 · 💻 cs.CV

DocArena: Turning Raw Documents into Controllable Training Environments for Document Search Agents

Jiamian Wang , Ruiyi Zhang , Tong Yu , Jing Shi , Samyadeep Basu , Rajiv Jain , Zhiqiang Tao , Tong Sun This is my paper

Pith reviewed 2026-06-29 12:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords DocArenadocument search agentsmultimodal documentsautomated data curationtraining environmentsreinforcement learningQA pair generationMLLM perception

0 comments

The pith

Agents trained on DocArena data achieve the best retrieval accuracy and QA quality across multimodal document scenarios and text benchmarks under unified evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DocArena as a fully automated pipeline that converts raw document collections into training environments consisting of question-answer-evidence tuples for reinforcement learning of search agents. The process structures documents with MLLM visual perception, builds reasoning-intensive QA pairs by profiling cross-page information distributions, and applies cascaded MLLM quality checks, yielding the DocArena-79K dataset from 8,336 documents across 16 domains and 49 languages. A decoupled Doc-Search infrastructure lets text-based LLMs handle reasoning while perception remains separate. Experiments show agents trained on this data outperform alternatives on six multimodal document scenarios and seven text QA benchmarks when only the policy model changes. A reader would care because existing training environments have been mostly text-only and hard to scale or control for multimodal cases.

Core claim

DocArena is a fully automated data curation pipeline that structures and indexes raw documents through MLLM-based visual perception, profiles and leverages cross-page information distribution to construct reasoning-intensive QA pairs, and performs cascaded quality assurance operations via MLLM. It produces DocArena-79K with QA pairs from 8,336 documents spanning 16 domains and 49 languages. The accompanying Doc-Search agent infrastructure decouples visual perception from the policy model so text-based LLMs can act as the reasoning backbone. Under a unified evaluation framework where only the policy model differs, agents trained on DocArena data achieve the best performance on both retrieval

What carries the argument

The DocArena pipeline, which automates creation of controllable (question, answer, evidence) training tuples from raw multimodal documents via MLLM perception and cross-page profiling.

If this is right

Search agents develop more effective strategies and better generalization because the training tuples are reasoning-intensive and multimodal.
The decoupled infrastructure allows text LLMs to serve as effective reasoning backbones even for visual document tasks.
Training environments become scalable and controllable without needing expert trajectories or human annotation.
Performance advantages appear consistently on both retrieval accuracy and downstream QA quality metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline structure could be adapted to generate training environments for agents that operate on other structured data sources such as scientific papers or legal records.
The controllability of the generated environments opens the possibility of systematically varying properties like reasoning depth or language distribution to study their effects on agent behavior.
If the MLLM steps prove robust across new document collections, the approach could reduce dependence on manually curated datasets for training specialized retrieval agents.

Load-bearing premise

MLLM-based visual perception, cross-page profiling, and cascaded quality assurance can reliably generate QA pairs that accurately reflect the real information distribution in raw documents without systematic biases or errors.

What would settle it

A direct replication of the unified evaluation experiments in which agents trained on DocArena data do not rank first on retrieval accuracy and QA quality, or a manual audit that finds frequent mismatches between generated QA pairs and the actual content of the source documents.

Figures

Figures reproduced from arXiv: 2606.26122 by Jiamian Wang, Jing Shi, Rajiv Jain, Ruiyi Zhang, Samyadeep Basu, Tong Sun, Tong Yu, Zhiqiang Tao.

**Figure 1.** Figure 1: Illustration of the training data used for RL training of search agents. Each sample is a triplet (question, answer, evidence) without intermediate trajectories (left). Constructing a highquality training environment is non-trivial, as it must simultaneously satisfy multiple retrievalrelated quality dimensions (right). 3 Method We present the search agent and the scope of our study in Section 3.1. We dis… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed DocArena data curation pipeline. Stage 1 converts raw PDFs into page images, MLLM-extracted structured text, and a dense retrieval index. Stage 2 profiles cross-page information distribution and identifies irreplaceable evidence (w=1) to guarantee evidence exclusivity. Stage 3 constructs diverse, reasoning-intensive QA pairs grounded in the distribution profile with template-cont… view at source ↗

**Figure 3.** Figure 3: Dataset statistics of DOCARENA. Top row: distributions of evidence pages per question, modality elements per page, question length, and answer length. Bottom row: distributions of document type, content domain, language, and modality combination. Doc-Search Agent Infrastructure (Training & Inference) Question 𝑞 LLM Policy 𝑓𝜑(⋅) <think> <search> <information> <answer> Doc-Search, Search-r1, etc. Online OCR … view at source ↗

**Figure 4.** Figure 4: Overview of the Doc-Search agent infrastructure. The system addresses multimodal document retrieval and QA tasks. We adopt a multimodal retriever (ColPali), an OCR tool , and a LLM-based policy model for multi-turn interaction, which decouples the visual perception from the policy model and allows different policy model under identical system configurations. During training (top), the policy interacts wit… view at source ↗

**Figure 5.** Figure 5: Search turn discussion on multi-page scenarios, i.e., MMLongBench-Doc (MP) and SlideVQA at top. (1) On both of the datasets, Search performance (Recall) scales consistently with both training (different curve colors) and test-time (from left to right) search budgets. (2) For each training-time max search turn (different red colors), we compute the QA improvement (EM gain) toward the smallest testing-time … view at source ↗

**Figure 6.** Figure 6: Cascaded filtering funnel of the DocArena curation pipeline. From 16,156 candidate seeds, each gate progressively filters low-quality samples, yielding 250 valid QA pairs (1.55% yield rate). Red annotations indicate the number and percentage of samples rejected at each gate. B Data Curation Pipeline B.1 More Illustrations on Irreplaceable Evidence Stage II selects evidence pages based on the distribution p… view at source ↗

**Figure 7.** Figure 7: Left: Reasoning template distribution among valid QA pairs. Right: Distribution width w(c) of factual units. 98.2% of units are exclusive to a single page (w=1). (broad). The fact that the majority of factual units are page-exclusive indicates that the distribution profiling stage (Stage II) identifies page-unique information, providing a reliable foundation for the evidence exclusivity condition described… view at source ↗

**Figure 8.** Figure 8: Left: MLLM call count per seed page for successful (green) vs. failed (blue) generations. Successful seed pages require more calls (17.2 mean) as they pass through all cascaded gates. Right: Per-seed-page processing time. The spike near 0s reflects seed pages rejected by the retrieval pre-filter without any MLLM call. Dashed lines indicate means [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Left: Per-seed-page processing time by pipeline stage for successful (green) and failed (blue) generations. Retrieval takes < 0.1s. Right: Proportion of each reasoning template within each outcome group (blue: failed; green: successful), with yield rate per template (red line, right axis). Both success/fail distributions and yield rates (1.8–2.6%) are balanced across templates. Factual Unit Extraction Prom… view at source ↗

**Figure 10.** Figure 10: Distribution of QA pairs per document in DocArena-79K (mean 9.6, median 7). diverse websites across the Internet, covering business, legal, scientific, and technical domains, with the majority created after 2010. Compared to prior publicly available document corpora such as IIT-CDIP (6.5M documents from a single domain in the 1990s) and OCR-IDL (4.6M single-domain documents), CCpdf provides broader domai… view at source ↗

**Figure 11.** Figure 11: Data scaling analysis on MMLongBench-Doc MP. Left: F1 by evidence page count. Middle: F1/Precision on figure-source queries. Right: F1 on low-EM (≤0.5) multi-evidence queries. 25% 50% 75% 100% Training Data Percentage 60.8 61.0 61.2 61.4 61.6 61.8 62.0 62.2 F1 (%) SlideVQA F1 & Precision Scaling F1 Precision 61.12 61.42 61.87 62.04 51.6 51.8 52.0 52.2 52.4 52.6 52.8 53.0 53.2 Precision (%) 51.91 52.00 52.… view at source ↗

**Figure 12.** Figure 12: Data scaling analysis on SlideVQA MP. Left: F1 and Precision scaling. Middle: EM on queries where the full-data agent achieves EM>0.5 (which are partially solved queries at the boundary of the agent’s capability where more training data is more likely to make a difference). Right: EM on queries with partial retrieval recall at the 25% data level. that require retrieving and reasoning over multiple pages. … view at source ↗

**Figure 13.** Figure 13: Data scaling on text-based QA benchmarks. Average EM across seven benchmarks (Natural Questions, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, MuSiQue, Bamboogle) [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

read the original abstract

Recent methods train search agents via reinforcement learning from (question, answer, evidence) tuples without requiring expert trajectories. The tuples serve as the training environment, and whose properties directly shape what search strategies and generalization abilities the agent can develop. While prior works have made encouraging progress in improving training data quality, existing environments remain predominantly text-based and existing approaches can struggle to construct training environments that are controllable, scalable, and account for multimodal data. Given this, we propose DocArena, a fully automated data curation pipeline building on the practical need for multimodal document search and question-answering. It transforms raw document collections into training environments for search agents without any human annotation. The pipeline first structures and indexes documents through MLLM-based visual perception, then profiles and leverage the cross-page information distribution to construct reasoning-intensive QA pairs, as well as performs cascaded quality assurance operations via MLLM. We introduce DocArena-79K with QA pairs from 8,336 documents spanning 16 domains and 49 languages. We further design a Doc-Search agent infrastructure that decouples visual perception from the policy model, allowing text-based LLMs to serve as the reasoning backbone for multimodal document retrieval and QA. Under a unified evaluation framework where only the policy model differs, experiments on six multimodal document scenarios and seven text-based QA benchmarks show that agents trained on DocArena data achieve the best performance on both retrieval accuracy and QA quality. Further analysis on agent search behaviors confirms the effectiveness and controllability of the constructed training environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DocArena gives a workable automated pipeline for turning raw multimodal docs into agent training tuples, with a clean unified eval showing gains, but the data quality validation looks thin.

read the letter

DocArena builds an MLLM pipeline that indexes documents visually, pulls cross-page info to make reasoning-heavy QA pairs, runs cascaded checks, and outputs a 79K dataset from 8336 docs across 16 domains and 49 languages. The key move is decoupling perception from the policy so text LLMs can run the search agent on multimodal material.

What stands out is the scale and the no-human-annotation claim, plus the evaluation design: one framework, six multimodal scenarios, seven text QA benchmarks, only the policy model changes, and the DocArena-trained agents come out on top for retrieval accuracy and QA quality. That isolates the data effect better than most prior tuple-based RL setups.

The soft spot is the lack of visible checks on whether the generated tuples actually match real document distributions. The abstract and stress-test note both flag that MLLM perception, profiling, and QA construction could introduce undetected biases or errors, yet no error rates, human validation percentages, or distributional comparisons appear in the provided summary. If those are missing from the full paper too, the performance edge could partly reflect artifacts rather than better search strategies.

This is for groups working on document agents, multimodal search, or scalable RL environments. It supplies a concrete method and dataset they can test directly. The citation pattern follows the RL-from-tuples line without obvious gaps.

I would send it to peer review. The engineering is practical and the eval setup is reasonable, even if the validation section needs strengthening.

Referee Report

2 major / 2 minor

Summary. The paper introduces DocArena, a fully automated MLLM-based pipeline that converts raw multimodal document collections into controllable (question, answer, evidence) training environments for RL-trained search agents without human annotation. It structures documents via visual perception, profiles cross-page information to generate reasoning-intensive QA pairs, applies cascaded MLLM quality assurance, and releases DocArena-79K (from 8,336 documents across 16 domains and 49 languages). A Doc-Search agent decouples visual perception from the policy model (allowing text LLMs as backbone). Under a unified framework isolating the policy, agents trained on DocArena data outperform baselines on six multimodal document scenarios and seven text-based QA benchmarks; further analysis examines search behaviors.

Significance. If the generated QA tuples are free of systematic MLLM artifacts, the work provides a scalable, annotation-free route to multimodal document search training data and a clean evaluation protocol that isolates policy effects. The fully automated nature, cross-lingual/domain coverage, and decoupling of perception from reasoning are concrete strengths that could accelerate progress in agent-based document retrieval.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the central performance claim—that DocArena-trained agents achieve the best retrieval accuracy and QA quality—rests on the assumption that the MLLM pipeline produces unbiased (Q,A,E) tuples matching real document distributions. No quantitative error rates, human validation percentages, or distributional comparisons (e.g., answer-evidence alignment statistics) are reported, leaving open the possibility that reported gains reflect data artifacts rather than improved search strategies.
[§3.2] §3.2 (QA pair construction): the cross-page profiling and cascaded quality assurance steps are described at a high level but lack concrete metrics (e.g., rejection rates per cascade stage, inter-MLLM agreement, or ablation on perception accuracy) that would demonstrate controllability and absence of selection bias across the six multimodal scenarios.

minor comments (2)

[Abstract] Abstract, sentence 2: the phrasing “The tuples serve as the training environment, and whose properties” is grammatically awkward and should be revised for clarity.
[§4] The manuscript would benefit from an explicit table listing the exact baselines, metrics (e.g., retrieval@K, QA F1), and statistical significance tests used in the unified evaluation framework.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on the validation of our MLLM-generated training data and the need for additional concrete metrics. We address each major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the central performance claim—that DocArena-trained agents achieve the best retrieval accuracy and QA quality—rests on the assumption that the MLLM pipeline produces unbiased (Q,A,E) tuples matching real document distributions. No quantitative error rates, human validation percentages, or distributional comparisons (e.g., answer-evidence alignment statistics) are reported, leaving open the possibility that reported gains reflect data artifacts rather than improved search strategies.

Authors: We agree that direct quantitative validation of the generated tuples would strengthen the central claim. The unified evaluation framework (isolating the policy model across both multimodal document scenarios and text-based QA benchmarks) provides supporting evidence that gains arise from improved search strategies rather than artifacts alone, since text-only benchmarks are unlikely to be influenced by multimodal-specific MLLM biases. We will revise §4 to incorporate available pipeline metrics such as rejection rates from cascaded quality assurance and inter-MLLM consistency statistics. However, human validation percentages were not collected to preserve the fully automated design; we will add an explicit discussion of this limitation and its implications for interpreting the results. revision: partial
Referee: [§3.2] §3.2 (QA pair construction): the cross-page profiling and cascaded quality assurance steps are described at a high level but lack concrete metrics (e.g., rejection rates per cascade stage, inter-MLLM agreement, or ablation on perception accuracy) that would demonstrate controllability and absence of selection bias across the six multimodal scenarios.

Authors: We will expand §3.2 with the requested metrics, including rejection rates per cascade stage, inter-MLLM agreement rates, and an ablation on perception accuracy components, using data from our pipeline execution logs. These additions will more clearly demonstrate controllability and help address concerns about selection bias. revision: yes

standing simulated objections not resolved

Human validation percentages for the (Q,A,E) tuples, as obtaining them would require manual annotation contrary to the fully automated pipeline design.

Circularity Check

0 steps flagged

No circularity: empirical pipeline and benchmark comparisons

full rationale

The paper describes an automated MLLM-based pipeline to generate QA pairs from raw documents and reports empirical results showing superior agent performance under a unified evaluation where only the policy model varies. No equations, derivations, fitted parameters presented as predictions, or self-citations appear in the provided text. The central claims rest on external benchmark comparisons rather than reducing to self-definitional inputs or load-bearing self-references. The derivation chain is self-contained as a standard data-generation-plus-evaluation workflow.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all such elements are unknown.

pith-pipeline@v0.9.1-grok · 5827 in / 1160 out tokens · 34748 ms · 2026-06-29T12:47:23.832335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 30 canonical work pages · 13 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. In: arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: CVPR (2025)

Caffagni, D., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Recurrence-enhanced vision- and-language transformers for robust multimodal document retrieval. In: CVPR (2025)

2025
[3]

arXiv preprint arXiv:2505.19683 (2025)

Cao, P., Men, T., Liu, W., Zhang, J., Li, X., Lin, X., Sui, D., Cao, Y ., Liu, K., Zhao, J.: Large language models for planning: A comprehensive and systematic survey. arXiv preprint arXiv:2505.19683 (2025)

work page arXiv 2025
[4]

In: arXiv preprint arXiv:2508.07493 (2025)

Chen, J., Li, M., Kil, J., Wang, C., Yu, T., Rossi, R., Zhou, T., Chen, C., Zhang, R.: Visr- bench: An empirical study on visual retrieval-augmented generation for multilingual long document understanding. In: arXiv preprint arXiv:2508.07493 (2025)

work page arXiv 2025
[5]

In: CoRR (2024)

Chen, J., Zhang, R., Zhou, Y ., Rossi, R., Gu, J., Chen, C.: Mmr: Evaluating reading ability of large multimodal models. In: CoRR (2024)

2024
[6]

In: ICLR (2025)

Chen, J., Zhang, R., Zhou, Y ., Yu, T., Dernoncourt, F., Gu, J., Rossi, R.A., Chen, C., Sun, T.: Sv-rag: Lora-contextualizing adaptation of mllms for long document understanding. In: ICLR (2025)

2025
[7]

In: CVPR (2025)

Chen, J., Xu, D., Fei, J., Feng, C.M., Elhoseiny, M.: Document haystacks: Vision-language reasoning over piles of 1000+ documents. In: CVPR (2025)

2025
[8]

NeurIPS (2025)

Chen, M., Sun, L., Li, T., Sun, H., Zhou, Y ., Zhu, C., Wang, H., Pan, J.Z., Zhang, W., Chen, H., et al.: Learning to reason with search for llms via reinforcement learning. NeurIPS (2025)

2025
[9]

arXiv preprint arXiv:2411.04952 (2024)

Cho, J., Mahata, D., Irsoy, O., He, Y ., Bansal, M.: M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv preprint arXiv:2411.04952 (2024)

work page arXiv 2024
[10]

arXiv preprint arXiv:2602.14234 (2026) 32 J

Chu, Z., Wang, X., Hong, J., Fan, H., Huang, Y ., Yang, Y ., Xu, G., Zhao, C., Xiang, C., Hu, S., et al.: Redsearcher: A scalable and cost-efficient framework for long-horizon search agents. arXiv preprint arXiv:2602.14234 (2026) 32 J. Wang et al

work page arXiv 2026
[11]

arXiv preprint arXiv:2510.12979 (2025)

Fan, W., Yao, W., Li, Z., Yao, F., Liu, X., Qiu, L., Yin, Q., Song, Y ., Yin, B.: Deepplanner: Scaling planning capability for deep research agents via advantage shaping. arXiv preprint arXiv:2510.12979 (2025)

work page arXiv 2025
[12]

In: ICLR (2025)

Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., Colombo, P.: Colpali: Efficient document retrieval with vision language models. In: ICLR (2025)

2025
[13]

In: arXiv preprint arXiv:2508.07976 (2025)

Gao, J., Fu, W., Xie, M., Xu, S., He, C., Mei, Z., Zhu, B., Wu, Y .: Beyond ten turns: Un- locking long-horizon agentic search with large-scale asynchronous rl. In: arXiv preprint arXiv:2508.07976 (2025)

work page arXiv 2025
[14]

arXiv preprint arXiv:2504.04736 (2025)

Goldie, A., Mirhoseini, A., Zhou, H., Cai, I., Manning, C.D.: Synthetic data generation & multi-step rl for reasoning & tool use. arXiv preprint arXiv:2504.04736 (2025)

work page arXiv 2025
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. In: arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: EMNLP (2024)

Han, R., Zhang, Y ., Qi, P., Xu, Y ., Wang, J., Liu, L., Wang, W.Y ., Min, B., Castelli, V .: Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. In: EMNLP (2024)

2024
[17]

CoRR (2025)

Han, S., Xia, P., Zhang, R., Sun, T., Li, Y ., Zhu, H., Yao, H.: Mdocagent: A multi-modal multi-agent framework for document understanding. CoRR (2025)

2025
[18]

In: COLING (2020)

Ho, X., Nguyen, A.K.D., Sugawara, S., Aizawa, A.: Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In: COLING (2020)

2020
[19]

Hu, T., Zhao, Y ., Zhang, C., Cohan, A., Zhao, C.: Sage: Benchmarking and improving re- trieval for deep research agents (2026)

2026
[20]

In: arXiv preprint arXiv:2505.07596 (2025)

Huang, Z., Yuan, X., Ju, Y ., Zhao, J., Liu, K.: Reinforced internal-external knowledge syn- ergistic reasoning for efficient adaptive search agent. In: arXiv preprint arXiv:2505.07596 (2025)

work page arXiv 2025
[21]

In: arXiv preprint arXiv:2505.15117 (2025)

Jin, B., Yoon, J., Kargupta, P., Arik, S.O., Han, J.: An empirical study on reinforcement learning for reasoning-search interleaved llm agents. In: arXiv preprint arXiv:2505.15117 (2025)

work page arXiv 2025
[22]

In: COLM (2025)

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., Han, J.: Search-r1: Training llms to reason and leverage search engines with reinforcement learning. In: COLM (2025)

2025
[23]

In: ACL (2017)

Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In: ACL (2017)

2017
[24]

In: TACL

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., et al.: Natural questions: a benchmark for question answering research. In: TACL. pp. 453–466 (2019)

2019
[25]

In: EMNLP (2025)

Lee, J., Kwon, D., Jin, K.: Grade: Generating multi-hop qa and fine-grained difficulty matrix for rag evaluation. In: EMNLP (2025)

2025
[26]

arXiv preprint arXiv:2506.01710 (2025)

Lei, F., Meng, J., Huang, Y ., Chen, T., Zhang, Y ., He, S., Zhao, J., Liu, K.: Reasoning- table: Exploring reinforcement learning for table reasoning. arXiv preprint arXiv:2506.01710 (2025)

work page arXiv 2025
[27]

In: COLING (2020)

Li, M., Xu, Y ., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: Docbank: A benchmark dataset for document layout analysis. In: COLING (2020)

2020
[28]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Li, X., Jin, J., Dong, G., Qian, H., Zhu, Y ., Wu, Y ., Wen, J.R., Dou, Z.: Webthinker: Empowering large reasoning models with deep research capability. In: arXiv preprint arXiv:2504.21776 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

In: CVPR (2025)

Liao, W., Wang, J., Li, H., Wang, C., Huang, J., Jin, L.: Doclayllm: An efficient multi-modal extension of large language models for text-rich document understanding. In: CVPR (2025)

2025
[30]

In: CoRR (2024) DocArena 33

Liu, Y ., Yang, B., Liu, Q., Li, Z., Ma, Z., Zhang, S., Bai, X.: Textmonkey: An ocr-free large multimodal model for understanding document. In: CoRR (2024) DocArena 33

2024
[31]

In: AAAI (2025)

Livathinos, N., Auer, C., Lysak, M., Nassar, A., Dolfi, M., Vagenas, P., Ramis, C.B., Omenetti, M., Dinkla, K., Kim, Y ., et al.: Docling: An efficient open-source toolkit for ai- driven document conversion. In: AAAI (2025)

2025
[32]

In: arXiv preprint arXiv:2505.16282 (2025)

Lu, F., Zhong, Z., Liu, S., Fu, C.W., Jia, J.: Arpo: End-to-end policy optimization for gui agents with experience replay. In: arXiv preprint arXiv:2505.16282 (2025)

work page arXiv 2025
[33]

In: NeurIPS (2024)

Ma, Y ., Zang, Y ., Chen, L., Chen, M., Jiao, Y ., Li, X., Lu, X., Liu, Z., Ma, Y ., Dong, X., et al.: Mmlongbench-doc: Benchmarking long-context document understanding with visual- izations. In: NeurIPS (2024)

2024
[34]

In: ACL (2023)

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H.: When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In: ACL (2023)

2023
[35]

In: arXiv preprint arXiv:2505.16582 (2025)

Mei, J., Hu, T., Fu, D., Wen, L., Yang, X., Wu, R., Cai, P., Cai, X., Gao, X., Yang, Y ., et al.: O2-searcher: A searching-based agent model for open-domain open-ended question answering. In: arXiv preprint arXiv:2505.16582 (2025)

work page arXiv 2025
[36]

In: ICLR (2026)

Miroyan, M., Wu, T.H., King, L., Li, T., Pan, J., Hu, X., Chiang, W.L., Angelopoulos, A.N., Darrell, T., Norouzi, N., Gonzalez, J.E.: Search arena: Analyzing search-augmented llms. In: ICLR (2026)

2026
[37]

In: CVPR (2025)

Ouyang, L., Qu, Y ., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., et al.: Omnidocbench: Benchmarking diverse pdf document parsing with compre- hensive annotations. In: CVPR (2025)

2025
[38]

In: EMNLP

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and narrow- ing the compositionality gap in language models. In: EMNLP. pp. 5687–5711 (2023)

2023
[39]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., Wu, C.: Hy- bridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

In: NeurIPS (2025)

Shi, Y ., Li, S., Wu, C., Liu, Z., Fang, J., Cai, H., Zhang, A., Wang, X.: Search and refine during think: Autonomous retrieval-augmented reasoning of llms. In: NeurIPS (2025)

2025
[42]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W.X., Fang, L., Wen, J.R.: R1-searcher: Incentivizing the search capability in llms via reinforcement learning. In: arXiv preprint arXiv:2503.05592 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

NeurIPS (2025)

Stojanovski, Z., Stanley, O., Sharratt, J., Jones, R., Adefioye, A., Kaddour, J., Köpf, A.: Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. NeurIPS (2025)

2025
[44]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Sun, H., Qiao, Z., Guo, J., Fan, X., Hou, Y ., Jiang, Y ., Xie, P., Zhang, Y ., Huang, F., Zhou, J.: Zerosearch: Incentivize the search capability of llms without searching. In: arXiv preprint arXiv:2505.04588 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

In: CVPR (2025)

Tanaka, R., Iki, T., Hasegawa, T., Nishida, K., Saito, K., Suzuki, J.: Vdocrag: Retrieval- augmented generation over visually-rich documents. In: CVPR (2025)

2025
[46]

In: AAAI (2023)

Tanaka, R., Nishida, K., Nishida, K., Hasegawa, T., Saito, I., Saito, K.: Slidevqa: A dataset for document visual question answering on multiple images. In: AAAI (2023)

2023
[47]

In: COLM (2024)

Tang, Y ., Yang, Y .: Multihop-rag: Benchmarking retrieval-augmented generation for multi- hop queries. In: COLM (2024)

2024
[48]

Qwen2 Technical Report

Team, Q., et al.: Qwen2 technical report. arXiv preprint arXiv:2407.106712(3) (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

In: ICCV (2025)

Tian, Y ., Lu, Z., Gao, M., Liu, Z., Zhao, B.: Mmcr: Benchmarking cross-source reasoning in scientific papers. In: ICCV (2025)

2025
[50]

In: TACL

Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: Musique: Multihop questions via single-hop question composition. In: TACL. vol. 10, pp. 539–554 (2022)

2022
[51]

In: ICDAR (2023) 34 J

Turski, M., Stanisławek, T., Kaczmarek, K., Dyda, P., Grali ´nski, F.: Ccpdf: Building a high quality corpus for visually rich documents from web crawl data. In: ICDAR (2023) 34 J. Wang et al

2023
[52]

In: ACL (2024)

Wang, D., Raman, N., Sibue, M., Ma, Z., Babkin, P., Kaur, S., Pei, Y ., Nourbakhsh, A., Liu, X.: DocLLM: A layout-aware generative language model for multimodal document understanding. In: ACL (2024)

2024
[53]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., Wei, F.: Text em- beddings by weakly-supervised contrastive pre-training. In: arXiv preprint arXiv:2212.03533 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

In: NeurIPS (2025)

Wang, Q., Ding, R., Zeng, Y ., Chen, Z., Chen, L., Wang, S., Xie, P., Huang, F., Zhao, F.: Vrag-rl: Empower vision-perception-based rag for visually rich information understanding via iterative reasoning with reinforcement learning. In: NeurIPS (2025)

2025
[55]

In: EMNLP (2025)

Wang, Z., Zheng, X., An, K., Ouyang, C., Cai, J., Wang, Y ., Wu, Y .: Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. In: EMNLP (2025)

2025
[56]

In: CVPR (2025)

Wang, Z., Guan, T., Fu, P., Duan, C., Jiang, Q., Guo, Z., Guo, S., Luo, J., Shen, W., Yang, X.: Marten: Visual question answering with mask generation for multi-modal document un- derstanding. In: CVPR (2025)

2025
[57]

In: arXiv preprint arXiv:2505.16421 (2025)

Wei, Z., Yao, W., Liu, Y ., Zhang, W., Lu, Q., Qiu, L., Yu, C., Xu, P., Zhang, C., Yin, B., et al.: Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. In: arXiv preprint arXiv:2505.16421 (2025)

work page arXiv 2025
[58]

MMSearch-R1: Incentivizing LMMs to Search

Wu, J., Deng, Z., Li, W., Liu, Y ., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. In: arXiv preprint arXiv:2506.20670 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

In: ACL (2025)

Wu, J., Xia, Y ., Yu, T., Chen, X., Harsha, S.S., Maharaj, A.V ., Zhang, R., Bursztyn, V ., Kim, S., Rossi, R.A., McAuley, J., Li, Y ., Sinha, R.: Doc-react: Multi-page heterogeneous document question-answering. In: ACL (2025)

2025
[60]

In: arXiv preprint arXiv:2505.20285 (2025)

Wu, W., Guan, X., Huang, S., Jiang, Y ., Xie, P., Huang, F., Cao, J., Zhao, H., Zhou, J.: Masksearch: A universal pre-training framework to enhance agentic search capability. In: arXiv preprint arXiv:2505.20285 (2025)

work page arXiv 2025
[61]

In: EMNLP (2025)

Wu, X., Tan, Y ., Hou, N., Zhang, R., Cheng, H.: Molorag: Bootstrapping document under- standing via multi-modal logic-aware retrieval. In: EMNLP (2025)

2025
[62]

In: CVPR (2025)

Xiao, H., Xie, Y ., Tan, G., Chen, Y ., Hu, R., Wang, K., Zhou, A., Li, H., Shao, H., Lu, X., et al.: Adaptive markup language generation for contextually-grounded visual document understanding. In: CVPR (2025)

2025
[63]

In: ICCV (2025)

Yang, Z., Tang, J., Li, Z., Wang, P., Wan, J., Zhong, H., Liu, X., Yang, M., Wang, P., Bai, S., et al.: Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multi- modal models in literacy. In: ICCV (2025)

2025
[64]

In: EMNLP (2018)

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W.W., Salakhutdinov, R., Manning, C.D.: Hot- potqa: A dataset for diverse, explainable multi-hop question answering. In: EMNLP (2018)

2018
[65]

Structured In-context Environment Scaling for Large Language Model Reasoning

Yu, P., Zhao, Z., Zhang, S., Fu, L., Wang, X., Wen, Y .: Learning to reason in structured in- context environments with reinforcement learning. arXiv preprint arXiv:2509.23330 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. In: arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

In: NeurIPS (2024)

Yu, Y ., Ping, W., Liu, Z., Wang, B., You, J., Zhang, C., Shoeybi, M., Catanzaro, B.: Rankrag: Unifying context ranking with retrieval-augmented generation in llms. In: NeurIPS (2024)

2024
[68]

arXiv preprint arXiv:2506.00789 (2025)

Zeng, Y ., Cao, T., Wang, D., Zhao, X., Qiu, Z., Ziyadi, M., Wu, T., Li, L.: Rare: Retrieval- aware robustness evaluation for retrieval-augmented generation systems. arXiv preprint arXiv:2506.00789 (2025)

work page arXiv 2025
[69]

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Zeng, Z., Ivison, H., Wang, Y ., Yuan, L., Li, S.S., Ye, Z., Li, S., He, J., Zhou, R., Chen, T., et al.: Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments. arXiv preprint arXiv:2511.07317 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[70]

In: NeurIPS (2025) DocArena 35

Zhang, H., Feng, T., You, J.: Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. In: NeurIPS (2025) DocArena 35

2025
[71]

arXiv preprint arXiv:2601.05163 (2026)

Zhang, Q., Lv, X., Wu, J., Li, B., Tao, Z., Yan, G., Zhang, H., Wang, B., Xu, J., Mi, H., et al.: Docdancer: Towards agentic document-grounded information seeking. arXiv preprint arXiv:2601.05163 (2026)

work page arXiv 2026
[72]

Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Jian Zhang, Yanfeng Wang, Ya Zhang, and 1 others

Zhao, Q., Wang, R., Xu, D., Zha, D., Liu, L.: R-search: Empowering llm reasoning with search via multi-reward reinforcement learning. In: arXiv preprint arXiv:2506.04185 (2025)

work page arXiv 2025
[73]

In: EMNLP (2025)

Zheng, Y ., Fu, D., Hu, X., Cai, X., Ye, L., Lu, P., Liu, P.: Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In: EMNLP (2025)

2025
[74]

In: CVPR (2025)

Zhu, Z., Luo, C., Shao, Z., Gao, F., Xing, H., Zheng, Q., Zhang, J.: A simple yet effective layout token in large language models for document understanding. In: CVPR (2025)

2025

[1] [1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. In: arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

In: CVPR (2025)

Caffagni, D., Sarto, S., Cornia, M., Baraldi, L., Cucchiara, R.: Recurrence-enhanced vision- and-language transformers for robust multimodal document retrieval. In: CVPR (2025)

2025

[3] [3]

arXiv preprint arXiv:2505.19683 (2025)

Cao, P., Men, T., Liu, W., Zhang, J., Li, X., Lin, X., Sui, D., Cao, Y ., Liu, K., Zhao, J.: Large language models for planning: A comprehensive and systematic survey. arXiv preprint arXiv:2505.19683 (2025)

work page arXiv 2025

[4] [4]

In: arXiv preprint arXiv:2508.07493 (2025)

Chen, J., Li, M., Kil, J., Wang, C., Yu, T., Rossi, R., Zhou, T., Chen, C., Zhang, R.: Visr- bench: An empirical study on visual retrieval-augmented generation for multilingual long document understanding. In: arXiv preprint arXiv:2508.07493 (2025)

work page arXiv 2025

[5] [5]

In: CoRR (2024)

Chen, J., Zhang, R., Zhou, Y ., Rossi, R., Gu, J., Chen, C.: Mmr: Evaluating reading ability of large multimodal models. In: CoRR (2024)

2024

[6] [6]

In: ICLR (2025)

Chen, J., Zhang, R., Zhou, Y ., Yu, T., Dernoncourt, F., Gu, J., Rossi, R.A., Chen, C., Sun, T.: Sv-rag: Lora-contextualizing adaptation of mllms for long document understanding. In: ICLR (2025)

2025

[7] [7]

In: CVPR (2025)

Chen, J., Xu, D., Fei, J., Feng, C.M., Elhoseiny, M.: Document haystacks: Vision-language reasoning over piles of 1000+ documents. In: CVPR (2025)

2025

[8] [8]

NeurIPS (2025)

Chen, M., Sun, L., Li, T., Sun, H., Zhou, Y ., Zhu, C., Wang, H., Pan, J.Z., Zhang, W., Chen, H., et al.: Learning to reason with search for llms via reinforcement learning. NeurIPS (2025)

2025

[9] [9]

arXiv preprint arXiv:2411.04952 (2024)

Cho, J., Mahata, D., Irsoy, O., He, Y ., Bansal, M.: M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv preprint arXiv:2411.04952 (2024)

work page arXiv 2024

[10] [10]

arXiv preprint arXiv:2602.14234 (2026) 32 J

Chu, Z., Wang, X., Hong, J., Fan, H., Huang, Y ., Yang, Y ., Xu, G., Zhao, C., Xiang, C., Hu, S., et al.: Redsearcher: A scalable and cost-efficient framework for long-horizon search agents. arXiv preprint arXiv:2602.14234 (2026) 32 J. Wang et al

work page arXiv 2026

[11] [11]

arXiv preprint arXiv:2510.12979 (2025)

Fan, W., Yao, W., Li, Z., Yao, F., Liu, X., Qiu, L., Yin, Q., Song, Y ., Yin, B.: Deepplanner: Scaling planning capability for deep research agents via advantage shaping. arXiv preprint arXiv:2510.12979 (2025)

work page arXiv 2025

[12] [12]

In: ICLR (2025)

Faysse, M., Sibille, H., Wu, T., Omrani, B., Viaud, G., Hudelot, C., Colombo, P.: Colpali: Efficient document retrieval with vision language models. In: ICLR (2025)

2025

[13] [13]

In: arXiv preprint arXiv:2508.07976 (2025)

Gao, J., Fu, W., Xie, M., Xu, S., He, C., Mei, Z., Zhu, B., Wu, Y .: Beyond ten turns: Un- locking long-horizon agentic search with large-scale asynchronous rl. In: arXiv preprint arXiv:2508.07976 (2025)

work page arXiv 2025

[14] [14]

arXiv preprint arXiv:2504.04736 (2025)

Goldie, A., Mirhoseini, A., Zhou, H., Cai, I., Manning, C.D.: Synthetic data generation & multi-step rl for reasoning & tool use. arXiv preprint arXiv:2504.04736 (2025)

work page arXiv 2025

[15] [15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. In: arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

In: EMNLP (2024)

Han, R., Zhang, Y ., Qi, P., Xu, Y ., Wang, J., Liu, L., Wang, W.Y ., Min, B., Castelli, V .: Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. In: EMNLP (2024)

2024

[17] [17]

CoRR (2025)

Han, S., Xia, P., Zhang, R., Sun, T., Li, Y ., Zhu, H., Yao, H.: Mdocagent: A multi-modal multi-agent framework for document understanding. CoRR (2025)

2025

[18] [18]

In: COLING (2020)

Ho, X., Nguyen, A.K.D., Sugawara, S., Aizawa, A.: Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In: COLING (2020)

2020

[19] [19]

Hu, T., Zhao, Y ., Zhang, C., Cohan, A., Zhao, C.: Sage: Benchmarking and improving re- trieval for deep research agents (2026)

2026

[20] [20]

In: arXiv preprint arXiv:2505.07596 (2025)

Huang, Z., Yuan, X., Ju, Y ., Zhao, J., Liu, K.: Reinforced internal-external knowledge syn- ergistic reasoning for efficient adaptive search agent. In: arXiv preprint arXiv:2505.07596 (2025)

work page arXiv 2025

[21] [21]

In: arXiv preprint arXiv:2505.15117 (2025)

Jin, B., Yoon, J., Kargupta, P., Arik, S.O., Han, J.: An empirical study on reinforcement learning for reasoning-search interleaved llm agents. In: arXiv preprint arXiv:2505.15117 (2025)

work page arXiv 2025

[22] [22]

In: COLM (2025)

Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., Zamani, H., Han, J.: Search-r1: Training llms to reason and leverage search engines with reinforcement learning. In: COLM (2025)

2025

[23] [23]

In: ACL (2017)

Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In: ACL (2017)

2017

[24] [24]

In: TACL

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Devlin, J., Lee, K., et al.: Natural questions: a benchmark for question answering research. In: TACL. pp. 453–466 (2019)

2019

[25] [25]

In: EMNLP (2025)

Lee, J., Kwon, D., Jin, K.: Grade: Generating multi-hop qa and fine-grained difficulty matrix for rag evaluation. In: EMNLP (2025)

2025

[26] [26]

arXiv preprint arXiv:2506.01710 (2025)

Lei, F., Meng, J., Huang, Y ., Chen, T., Zhang, Y ., He, S., Zhao, J., Liu, K.: Reasoning- table: Exploring reinforcement learning for table reasoning. arXiv preprint arXiv:2506.01710 (2025)

work page arXiv 2025

[27] [27]

In: COLING (2020)

Li, M., Xu, Y ., Cui, L., Huang, S., Wei, F., Li, Z., Zhou, M.: Docbank: A benchmark dataset for document layout analysis. In: COLING (2020)

2020

[28] [28]

WebThinker: Empowering Large Reasoning Models with Deep Research Capability

Li, X., Jin, J., Dong, G., Qian, H., Zhu, Y ., Wu, Y ., Wen, J.R., Dou, Z.: Webthinker: Empowering large reasoning models with deep research capability. In: arXiv preprint arXiv:2504.21776 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

In: CVPR (2025)

Liao, W., Wang, J., Li, H., Wang, C., Huang, J., Jin, L.: Doclayllm: An efficient multi-modal extension of large language models for text-rich document understanding. In: CVPR (2025)

2025

[30] [30]

In: CoRR (2024) DocArena 33

Liu, Y ., Yang, B., Liu, Q., Li, Z., Ma, Z., Zhang, S., Bai, X.: Textmonkey: An ocr-free large multimodal model for understanding document. In: CoRR (2024) DocArena 33

2024

[31] [31]

In: AAAI (2025)

Livathinos, N., Auer, C., Lysak, M., Nassar, A., Dolfi, M., Vagenas, P., Ramis, C.B., Omenetti, M., Dinkla, K., Kim, Y ., et al.: Docling: An efficient open-source toolkit for ai- driven document conversion. In: AAAI (2025)

2025

[32] [32]

In: arXiv preprint arXiv:2505.16282 (2025)

Lu, F., Zhong, Z., Liu, S., Fu, C.W., Jia, J.: Arpo: End-to-end policy optimization for gui agents with experience replay. In: arXiv preprint arXiv:2505.16282 (2025)

work page arXiv 2025

[33] [33]

In: NeurIPS (2024)

Ma, Y ., Zang, Y ., Chen, L., Chen, M., Jiao, Y ., Li, X., Lu, X., Liu, Z., Ma, Y ., Dong, X., et al.: Mmlongbench-doc: Benchmarking long-context document understanding with visual- izations. In: NeurIPS (2024)

2024

[34] [34]

In: ACL (2023)

Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., Hajishirzi, H.: When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In: ACL (2023)

2023

[35] [35]

In: arXiv preprint arXiv:2505.16582 (2025)

Mei, J., Hu, T., Fu, D., Wen, L., Yang, X., Wu, R., Cai, P., Cai, X., Gao, X., Yang, Y ., et al.: O2-searcher: A searching-based agent model for open-domain open-ended question answering. In: arXiv preprint arXiv:2505.16582 (2025)

work page arXiv 2025

[36] [36]

In: ICLR (2026)

Miroyan, M., Wu, T.H., King, L., Li, T., Pan, J., Hu, X., Chiang, W.L., Angelopoulos, A.N., Darrell, T., Norouzi, N., Gonzalez, J.E.: Search arena: Analyzing search-augmented llms. In: ICLR (2026)

2026

[37] [37]

In: CVPR (2025)

Ouyang, L., Qu, Y ., Zhou, H., Zhu, J., Zhang, R., Lin, Q., Wang, B., Zhao, Z., Jiang, M., Zhao, X., et al.: Omnidocbench: Benchmarking diverse pdf document parsing with compre- hensive annotations. In: CVPR (2025)

2025

[38] [38]

In: EMNLP

Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N.A., Lewis, M.: Measuring and narrow- ing the compositionality gap in language models. In: EMNLP. pp. 5687–5711 (2023)

2023

[39] [39]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y ., Wu, Y ., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. In: arXiv preprint arXiv:2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y ., Lin, H., Wu, C.: Hy- bridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

In: NeurIPS (2025)

Shi, Y ., Li, S., Wu, C., Liu, Z., Fang, J., Cai, H., Zhang, A., Wang, X.: Search and refine during think: Autonomous retrieval-augmented reasoning of llms. In: NeurIPS (2025)

2025

[42] [42]

R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning

Song, H., Jiang, J., Min, Y ., Chen, J., Chen, Z., Zhao, W.X., Fang, L., Wen, J.R.: R1-searcher: Incentivizing the search capability in llms via reinforcement learning. In: arXiv preprint arXiv:2503.05592 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

NeurIPS (2025)

Stojanovski, Z., Stanley, O., Sharratt, J., Jones, R., Adefioye, A., Kaddour, J., Köpf, A.: Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards. NeurIPS (2025)

2025

[44] [44]

ZeroSearch: Incentivize the Search Capability of LLMs without Searching

Sun, H., Qiao, Z., Guo, J., Fan, X., Hou, Y ., Jiang, Y ., Xie, P., Zhang, Y ., Huang, F., Zhou, J.: Zerosearch: Incentivize the search capability of llms without searching. In: arXiv preprint arXiv:2505.04588 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[45] [45]

In: CVPR (2025)

Tanaka, R., Iki, T., Hasegawa, T., Nishida, K., Saito, K., Suzuki, J.: Vdocrag: Retrieval- augmented generation over visually-rich documents. In: CVPR (2025)

2025

[46] [46]

In: AAAI (2023)

Tanaka, R., Nishida, K., Nishida, K., Hasegawa, T., Saito, I., Saito, K.: Slidevqa: A dataset for document visual question answering on multiple images. In: AAAI (2023)

2023

[47] [47]

In: COLM (2024)

Tang, Y ., Yang, Y .: Multihop-rag: Benchmarking retrieval-augmented generation for multi- hop queries. In: COLM (2024)

2024

[48] [48]

Qwen2 Technical Report

Team, Q., et al.: Qwen2 technical report. arXiv preprint arXiv:2407.106712(3) (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

In: ICCV (2025)

Tian, Y ., Lu, Z., Gao, M., Liu, Z., Zhao, B.: Mmcr: Benchmarking cross-source reasoning in scientific papers. In: ICCV (2025)

2025

[50] [50]

In: TACL

Trivedi, H., Balasubramanian, N., Khot, T., Sabharwal, A.: Musique: Multihop questions via single-hop question composition. In: TACL. vol. 10, pp. 539–554 (2022)

2022

[51] [51]

In: ICDAR (2023) 34 J

Turski, M., Stanisławek, T., Kaczmarek, K., Dyda, P., Grali ´nski, F.: Ccpdf: Building a high quality corpus for visually rich documents from web crawl data. In: ICDAR (2023) 34 J. Wang et al

2023

[52] [52]

In: ACL (2024)

Wang, D., Raman, N., Sibue, M., Ma, Z., Babkin, P., Kaur, S., Pei, Y ., Nourbakhsh, A., Liu, X.: DocLLM: A layout-aware generative language model for multimodal document understanding. In: ACL (2024)

2024

[53] [53]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Wang, L., Yang, N., Huang, X., Jiao, B., Yang, L., Jiang, D., Majumder, R., Wei, F.: Text em- beddings by weakly-supervised contrastive pre-training. In: arXiv preprint arXiv:2212.03533 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [54]

In: NeurIPS (2025)

Wang, Q., Ding, R., Zeng, Y ., Chen, Z., Chen, L., Wang, S., Xie, P., Huang, F., Zhao, F.: Vrag-rl: Empower vision-perception-based rag for visually rich information understanding via iterative reasoning with reinforcement learning. In: NeurIPS (2025)

2025

[55] [55]

In: EMNLP (2025)

Wang, Z., Zheng, X., An, K., Ouyang, C., Cai, J., Wang, Y ., Wu, Y .: Stepsearch: Igniting llms search ability via step-wise proximal policy optimization. In: EMNLP (2025)

2025

[56] [56]

In: CVPR (2025)

Wang, Z., Guan, T., Fu, P., Duan, C., Jiang, Q., Guo, Z., Guo, S., Luo, J., Shen, W., Yang, X.: Marten: Visual question answering with mask generation for multi-modal document un- derstanding. In: CVPR (2025)

2025

[57] [57]

In: arXiv preprint arXiv:2505.16421 (2025)

Wei, Z., Yao, W., Liu, Y ., Zhang, W., Lu, Q., Qiu, L., Yu, C., Xu, P., Zhang, C., Yin, B., et al.: Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning. In: arXiv preprint arXiv:2505.16421 (2025)

work page arXiv 2025

[58] [58]

MMSearch-R1: Incentivizing LMMs to Search

Wu, J., Deng, Z., Li, W., Liu, Y ., You, B., Li, B., Ma, Z., Liu, Z.: Mmsearch-r1: Incentivizing lmms to search. In: arXiv preprint arXiv:2506.20670 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

In: ACL (2025)

Wu, J., Xia, Y ., Yu, T., Chen, X., Harsha, S.S., Maharaj, A.V ., Zhang, R., Bursztyn, V ., Kim, S., Rossi, R.A., McAuley, J., Li, Y ., Sinha, R.: Doc-react: Multi-page heterogeneous document question-answering. In: ACL (2025)

2025

[60] [60]

In: arXiv preprint arXiv:2505.20285 (2025)

Wu, W., Guan, X., Huang, S., Jiang, Y ., Xie, P., Huang, F., Cao, J., Zhao, H., Zhou, J.: Masksearch: A universal pre-training framework to enhance agentic search capability. In: arXiv preprint arXiv:2505.20285 (2025)

work page arXiv 2025

[61] [61]

In: EMNLP (2025)

Wu, X., Tan, Y ., Hou, N., Zhang, R., Cheng, H.: Molorag: Bootstrapping document under- standing via multi-modal logic-aware retrieval. In: EMNLP (2025)

2025

[62] [62]

In: CVPR (2025)

Xiao, H., Xie, Y ., Tan, G., Chen, Y ., Hu, R., Wang, K., Zhou, A., Li, H., Shao, H., Lu, X., et al.: Adaptive markup language generation for contextually-grounded visual document understanding. In: CVPR (2025)

2025

[63] [63]

In: ICCV (2025)

Yang, Z., Tang, J., Li, Z., Wang, P., Wan, J., Zhong, H., Liu, X., Yang, M., Wang, P., Bai, S., et al.: Cc-ocr: A comprehensive and challenging ocr benchmark for evaluating large multi- modal models in literacy. In: ICCV (2025)

2025

[64] [64]

In: EMNLP (2018)

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W.W., Salakhutdinov, R., Manning, C.D.: Hot- potqa: A dataset for diverse, explainable multi-hop question answering. In: EMNLP (2018)

2018

[65] [65]

Structured In-context Environment Scaling for Large Language Model Reasoning

Yu, P., Zhao, Z., Zhang, S., Fu, L., Wang, X., Wen, Y .: Learning to reason in structured in- context environments with reinforcement learning. arXiv preprint arXiv:2509.23330 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y ., Zuo, X., Yue, Y ., Dai, W., Fan, T., Liu, G., Liu, L., et al.: Dapo: An open-source llm reinforcement learning system at scale. In: arXiv preprint arXiv:2503.14476 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[67] [67]

In: NeurIPS (2024)

Yu, Y ., Ping, W., Liu, Z., Wang, B., You, J., Zhang, C., Shoeybi, M., Catanzaro, B.: Rankrag: Unifying context ranking with retrieval-augmented generation in llms. In: NeurIPS (2024)

2024

[68] [68]

arXiv preprint arXiv:2506.00789 (2025)

Zeng, Y ., Cao, T., Wang, D., Zhao, X., Qiu, Z., Ziyadi, M., Wu, T., Li, L.: Rare: Retrieval- aware robustness evaluation for retrieval-augmented generation systems. arXiv preprint arXiv:2506.00789 (2025)

work page arXiv 2025

[69] [69]

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

Zeng, Z., Ivison, H., Wang, Y ., Yuan, L., Li, S.S., Ye, Z., Li, S., He, J., Zhou, R., Chen, T., et al.: Rlve: Scaling up reinforcement learning for language models with adaptive verifiable environments. arXiv preprint arXiv:2511.07317 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[70] [70]

In: NeurIPS (2025) DocArena 35

Zhang, H., Feng, T., You, J.: Router-r1: Teaching llms multi-round routing and aggregation via reinforcement learning. In: NeurIPS (2025) DocArena 35

2025

[71] [71]

arXiv preprint arXiv:2601.05163 (2026)

Zhang, Q., Lv, X., Wu, J., Li, B., Tao, Z., Yan, G., Zhang, H., Wang, B., Xu, J., Mi, H., et al.: Docdancer: Towards agentic document-grounded information seeking. arXiv preprint arXiv:2601.05163 (2026)

work page arXiv 2026

[72] [72]

Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Jian Zhang, Yanfeng Wang, Ya Zhang, and 1 others

Zhao, Q., Wang, R., Xu, D., Zha, D., Liu, L.: R-search: Empowering llm reasoning with search via multi-reward reinforcement learning. In: arXiv preprint arXiv:2506.04185 (2025)

work page arXiv 2025

[73] [73]

In: EMNLP (2025)

Zheng, Y ., Fu, D., Hu, X., Cai, X., Ye, L., Lu, P., Liu, P.: Deepresearcher: Scaling deep research via reinforcement learning in real-world environments. In: EMNLP (2025)

2025

[74] [74]

In: CVPR (2025)

Zhu, Z., Luo, C., Shao, Z., Gao, F., Xing, H., Zheng, Q., Zhang, J.: A simple yet effective layout token in large language models for document understanding. In: CVPR (2025)

2025