PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Hao Yan; Jing Ding; Liang Yin; Minghui Liao; Wei Chen; Xiang Bai; Xudong Xie; Yang Liu; Yuliang Liu

arxiv: 2410.05970 · v3 · submitted 2024-10-08 · 💻 cs.CV · cs.AI· cs.CL

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Xudong Xie , Hao Yan , Liang Yin , Yang Liu , Jing Ding , Minghui Liao , Yuliang Liu , Wei Chen

show 1 more author

Xiang Bai

This is my paper

Pith reviewed 2026-05-23 19:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords multimodal document understandingsparse samplinglong PDF readingmultimodal QAlarge language modelsacademic papersend-to-end model

0 comments

The pith

PDF-WuKong adds an end-to-end sparse sampler to multimodal models so they can read and answer questions about long PDFs that mix text and images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PDF-WuKong, a multimodal large language model that uses a sparse sampler to pick only the most relevant paragraphs and diagrams from lengthy PDF documents. Existing models typically handle either plain text or a small number of images and therefore fail on long academic papers with interleaved content. The authors build the PaperPDF dataset of English and Chinese papers and generate 1.1 million QA pairs with evidence sources to train and test the model. Experiments show the approach is both more accurate and more efficient than prior methods, exceeding proprietary models by 8.6 percent on average F1 for long multimodal document understanding.

Core claim

PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM by selecting the paragraphs or diagrams most pertinent to user queries.

What carries the argument

The sparse sampler that selects pertinent paragraphs or diagrams from text and image representations of long PDFs.

If this is right

The model processes long PDFs containing interleaved text and images without being limited to plain text or a small number of images.
It achieves higher F1 scores than prior open and proprietary models on multimodal document QA while using less computation.
Training on the 1.1 million PaperPDF QA pairs enables the sampler to identify evidence sources relevant to user queries.
The same architecture supports both English and Chinese academic papers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The sparse selection mechanism could be tested on other long multimodal inputs such as slide decks or technical reports.
Releasing the dataset and code allows direct measurement of how much the sampler reduces token usage at inference time.
If the end-to-end training of the sampler generalizes, similar sparse modules could be added to existing MLLMs without full retraining.

Load-bearing premise

The 1.1 million QA pairs constructed via the proposed strategies constitute high-quality, unbiased training and evaluation data that generalizes beyond the PaperPDF collection to arbitrary long PDFs.

What would settle it

Performance on a fresh collection of long PDFs whose QA pairs were built with entirely different methods falls to the level of baseline models.

Figures

Figures reproduced from arXiv: 2410.05970 by Hao Yan, Jing Ding, Liang Yin, Minghui Liao, Wei Chen, Xiang Bai, Xudong Xie, Yang Liu, Yuliang Liu.

**Figure 2.** Figure 2: The overall structure of PDF-WuKong consists of a document parser, a sparse sampler and a large language model. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The construction process of PaperPDF based on single evidence and multiple evidence. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study of different document length [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison between PDF-WuKong with other proprietary products. The red box indicates the evidence that the [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of PDF-WuKong on Chinese documents. The red box indicates the evidence that the correct answer depends on. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Text-only Q-E-A triplets generation prompt and data example. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 4.** Figure 4: Number of benchmarks on which each technique has the lead in coverage at each hour. A [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 9.** Figure 9: Text-image Q-E-A triplets generation prompt and data example. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 3.** Figure 3: The word clouds representing summaries generated by PKUSUMSUM-Centroid method (left) [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 1.** Figure 1: On the right: the current path for selecting some algorithm(s), ... [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗

**Figure 11.** Figure 11: Cross-paragraph Q-E-A triplets generation prompt and data example. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of English and Chinese academic papers. Multiple strategies are proposed to build high-quality 1.1 million QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal document understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PDF-WuKong's joint sparse sampler for text and images in long PDFs is a practical efficiency move, but the 8.6% F1 gain sits on 1.1M self-generated QA pairs whose quality and lack of bias are not demonstrated in the abstract.

read the letter

The new element is the end-to-end sparse sampler that jointly selects text paragraphs and image diagrams from interleaved long PDFs, aimed at academic papers. This avoids full-context processing while keeping the model multimodal. They pair it with PaperPDF, a collection of English and Chinese papers, and generate 1.1 million QA pairs plus evidence using several strategies. The model is reported to beat other MLLMs and proprietary systems by 8.6% average F1 while staying efficient, and code plus data are promised for release.

Referee Report

2 major / 1 minor

Summary. The paper introduces PDF-WuKong, a multimodal large language model equipped with an end-to-end sparse sampler operating on text and image representations to enable efficient question answering over long PDF documents with interleaved content. It constructs the PaperPDF dataset comprising English and Chinese academic papers and generates 1.1 million QA pairs via multiple proposed strategies, reporting that the model surpasses other approaches—including proprietary products—by an average of 8.6% F1 on long multimodal document understanding tasks. Code and dataset release is promised.

Significance. If the central claims hold after addressing evaluation concerns, the work would offer a practical advance in scalable multimodal document understanding for lengthy academic PDFs, with the sparse sampling mechanism providing efficiency gains. The planned public release of code and the 1.1M-pair dataset constitutes a concrete contribution to reproducibility and community benchmarking in this area.

major comments (2)

[Abstract] Abstract: the headline claim of an average 8.6% F1 improvement over proprietary models is stated without error bars, baseline implementation details, dataset split statistics, or statistical significance tests, rendering it impossible to evaluate whether the reported superiority is robust or load-bearing for the central experimental conclusion.
[Abstract] Dataset construction paragraph (Abstract): the 1.1 million QA pairs are generated from the same PaperPDF collection using the authors' proposed strategies, yet no validation (human evaluation, inter-annotator agreement, leakage checks, or external test sets) is described to demonstrate that the pairs are free of construction artifacts and support generalization beyond the collection; this directly underpins the superiority and generalization claims.

minor comments (1)

[Abstract] Abstract: the phrase 'high-quality 1.1 million QA pairs' is asserted without supporting metrics; moving any available quality statistics or ablation results on the construction strategies into the main text would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the abstract requires clarification to better support its claims and will revise it accordingly while preserving the core contributions. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of an average 8.6% F1 improvement over proprietary models is stated without error bars, baseline implementation details, dataset split statistics, or statistical significance tests, rendering it impossible to evaluate whether the reported superiority is robust or load-bearing for the central experimental conclusion.

Authors: We acknowledge that the abstract presents the 8.6% average F1 gain without accompanying statistical qualifiers. The full manuscript (Section 4 and associated tables) details the baselines, dataset splits (train/val/test), and per-task F1 scores from which the average is computed. In the revised version we will update the abstract to qualify the claim (e.g., “surpassing … by an average of 8.6% F1 across the reported tasks; see Section 4 for per-task results, splits, and implementation details”) and will add a brief reference to the statistical measures already present in the experimental section. We will not recompute new significance tests if they were not originally performed, but the existing results will be presented more transparently. revision: partial
Referee: [Abstract] Dataset construction paragraph (Abstract): the 1.1 million QA pairs are generated from the same PaperPDF collection using the authors' proposed strategies, yet no validation (human evaluation, inter-annotator agreement, leakage checks, or external test sets) is described to demonstrate that the pairs are free of construction artifacts and support generalization beyond the collection; this directly underpins the superiority and generalization claims.

Authors: The abstract’s space constraints limited description of validation steps. Section 3 of the manuscript details the multi-strategy generation process and states that the pairs are accompanied by evidence sources. We agree that explicit validation metrics strengthen the claims. In the revision we will add a concise summary paragraph (or subsection) reporting any internal quality checks performed during construction, including any human spot-checks, leakage mitigation steps, and the use of held-out external test sets if available. If certain validation procedures (e.g., full inter-annotator agreement) were not conducted, we will state this transparently and discuss potential limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is an empirical contribution introducing a multimodal model and a self-constructed dataset (PaperPDF) for training and evaluation. No mathematical derivation, equations, or first-principles chain is present that reduces to its own inputs by construction. The abstract describes dataset construction via 'multiple strategies' to produce 'high-quality 1.1 million QA pairs' and reports experimental F1 gains, but this is standard self-supervised or self-generated benchmark practice rather than a self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation of a uniqueness theorem. No ansatz smuggling, renaming of known results, or other enumerated patterns apply. The central performance claim is measured on the authors' data but does not equate to the inputs by definition; external validation is not required for the circularity analysis per the rules.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the sparse sampler and dataset construction strategies are treated as new contributions whose internal assumptions are not stated.

pith-pipeline@v0.9.0 · 5801 in / 1083 out tokens · 40592 ms · 2026-05-23T19:37:02.712222+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
cs.AI 2026-04 unverdicted novelty 5.0

DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
cs.CV 2025-07 unverdicted novelty 3.0

A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
cs.MM 2024-10 unverdicted novelty 3.0

Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · cited by 3 Pith papers · 15 internal anchors

[1]

Pdftriage: Question answering over long, structured documents

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A Rossi, and Franck Dernon- court. Pdftriage: Question answering over long, structured documents. arXiv preprint arXiv:2309.08872, 2023. 1

work page arXiv 2023
[2]

Prem Jacob, Beatriz Lucia Salvador Bizotto, and Mithi- leysh Sathiyanarayanan

T. Prem Jacob, Beatriz Lucia Salvador Bizotto, and Mithi- leysh Sathiyanarayanan. Constructing the chatgpt for pdf files with langchain – ai. In 2024 International Conference on Inventive Computation Technologies (ICICT), pages 835– 839, 2024. 1

work page 2024
[3]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. 1, 2, 3

work page 2024
[4]

LongloRA: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhi- jian Liu, Song Han, and Jiaya Jia. LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representa- tions, 2024. 2, 3

work page 2024
[5]

Fo- cused transformer: Contrastive training for context scaling

Szymon Tworkowski, Konrad Staniszewski, Mikoł aj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Mił o ´s. Fo- cused transformer: Contrastive training for context scaling. In Advances in Neural Information Processing Systems, vol- ume 36, pages 42661–42688, 2023. 1, 2, 3

work page 2023
[6]

Disc-lawllm: Fine-tuning large language models for intelligent legal services.arXiv preprint arXiv:2309.11325, 2023

Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, et al. Disc-lawllm: Fine-tuning large language models for intelligent legal services.arXiv preprint arXiv:2309.11325, 2023. 1, 2, 3

work page arXiv 2023
[7]

Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning

Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xi- ang Bai, Xuanjing Huang, et al. Disc-finllm: A chinese fi- nancial large language model based on multiple experts fine- tuning. arXiv preprint arXiv:2310.15205, 2023

work page arXiv 2023
[8]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Lar- son. From local to global: A graph rag approach to query- focused summarization. arXiv preprint arXiv:2404.16130 ,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Textmonkey: An ocr-free large multimodal model for understanding document

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024. 2, 3, 7

work page arXiv 2024
[11]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Vary: Scaling up the vision vocabulary for large vision-language model

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Com- puter Vision, pages 408–424. Springer, 2024. 2, 3, 7

work page 2024
[13]

Focus anywhere for fine- grained multi-page document understanding

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chun- rui Han, and Xiangyu Zhang. Focus anywhere for fine- grained multi-page document understanding. arXiv preprint arXiv:2405.14295, 2024. 2, 3

work page arXiv 2024
[14]

Hi- erarchical multimodal transformers for multipage docvqa

Rub `en Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 2, 3

work page 2023
[15]

Slidevqa: A dataset for document visual question answering on multiple images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. In AAAI, pages 13636–13645, 2023. 2

work page 2023
[16]

Gram: Global reasoning for multi-page vqa

Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, and Ron Litman. Gram: Global reasoning for multi-page vqa. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 15598–15607,

work page
[17]

Document understanding dataset and evaluation (dude)

Jordy Van Landeghem, Rub `en Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Ju- rkiewicz, Micka¨el Coustaty, Bertrand Anckaert, Ernest Val- veny, et al. Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19528–19540, 2023. 2, 5, 7, 13

work page 2023
[18]

Needle in a multimodal haystack

Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024. 2, 7

work page arXiv 2024
[19]

RAPTOR: Re- cursive abstractive processing for tree-organized retrieval

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. RAPTOR: Re- cursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Represen- tations, 2024. 2, 3

work page 2024
[20]

Unidoc: A univer- sal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A univer- sal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023. 3

work page 2023
[21]

mplug-docowl: Modularized multimodal large language model for document understanding

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 3

work page arXiv 2023
[22]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023. 3

work page arXiv 2023
[23]

Llava-next: Im- 9 proved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- 9 proved reasoning, ocr, and world knowledge, January 2024. 3, 7

work page 2024
[24]

Internlm-xcomposer2-4khd: A pioneer- ing large vision-language model handling resolutions from 336 pixels to 4k hd

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneer- ing large vision-language mo...

work page arXiv 2024
[25]

mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding, 2024

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding, 2024. 3, 6, 7

work page 2024
[26]

Cream: Coarse-to- fine retrieval and multi-modal efficient tuning for document vqa

Jinxu Zhang, Yongqi Yu, and Yu Zhang. Cream: Coarse-to- fine retrieval and multi-modal efficient tuning for document vqa. In Proceedings of the 32nd ACM International Confer- ence on Multimedia, pages 925–934, 2024. 3, 7

work page 2024
[27]

Efficient attentions for long document summa- rization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summa- rization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1419–1436,

work page 2021
[28]

A dataset of information-seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021. 2

work page 2021
[29]

Pub- laynet: largest dataset ever for document layout analysis

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In 2019 International conference on document analysis and recognition (ICDAR), pages 1015–1022. IEEE, 2019. 2

work page 2019
[30]

Docbank: A bench- mark dataset for document layout analysis

Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A bench- mark dataset for document layout analysis. arXiv preprint arXiv:2006.01038, 2020

work page arXiv 2006
[31]

Doclaynet: a large human-annotated dataset for document-layout segmentation

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. Doclaynet: a large human-annotated dataset for document-layout segmentation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining, pages 3743–3751, 2022. 2

work page 2022
[32]

Docile benchmark for document information localization and extraction

ˇStˇep´an ˇSimsa, Milan ˇSulc, Michal U ˇriˇc´aˇr, Yash Patel, Ahmed Hamdi, Mat ˇej Koci´an, Maty´aˇs Skalick `y, Jiˇr´ı Matas, Antoine Doucet, Micka¨el Coustaty, et al. Docile benchmark for document information localization and extraction. pages 147–166, 2023. 2

work page 2023
[33]

Cord: A con- solidated receipt dataset for post-ocr parsing

Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jae- heung Surh, Minjoon Seo, and Hwalsuk Lee. Cord: A con- solidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS, 2019

work page 2019
[34]

Icdar2019 com- petition on scanned receipt ocr and information extraction

Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthe- nis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 com- petition on scanned receipt ocr and information extraction. In ICDAR, pages 1516–1520, 2019. 2

work page 2019
[35]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021. 2, 5, 7

work page 2021
[36]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019. 2

work page 2019
[37]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[38]

Chartx & chartvlm: A versatile bench- mark and foundation model for complicated chart reasoning

Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, et al. Chartx & chartvlm: A versatile bench- mark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185, 2024. 2

work page arXiv 2024
[39]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024. 2

work page arXiv 2024
[40]

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V . Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022. 2, 5, 7

work page 2022
[41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Docgenome: An open large- scale scientific document benchmark for training and test- ing multi-modal large language models

Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wen- jie Wu, Hancheng Ye, et al. Docgenome: An open large- scale scientific document benchmark for training and test- ing multi-modal large language models. arXiv preprint arXiv:2406.11633, 2024. 2

work page arXiv 2024
[43]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://openai. com/contributions/gpt-4v, 2023. 4, 7

work page 2023
[45]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Hi- erarchical multimodal transformers for multipage docvqa

Rub `en Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 5, 6, 7, 13

work page 2023
[47]

https://github.com/kermitt2/grobid, 2008–2024

Grobid. https://github.com/kermitt2/grobid, 2008–2024. 5

work page 2008
[48]

Mineru: An open-source solution for precise document content extrac- tion, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian 10 Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extrac- tion, 2024. 5

work page 2024
[49]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Ji- aqi Wang. Internlm-xcomposer2: Mastering free-form text- image composition and compr...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Moonshot AI. Kimi. https://kimi.moonshot.cn ,

work page
[52]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091,

work page
[55]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2004
[56]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neu- ral information processing systems, 33:17283–17297, 2020. 5, 7

work page 2020
[57]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Mon- key: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Mon- key: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607,

work page arXiv
[59]

Generative multimodal mod- els are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal mod- els are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14398–14409, 2024. 6, 7

work page 2024
[60]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Texthawk: Exploring efficient fine- grained perception of multimodal large language models

Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, and Wei Zeng. Texthawk: Exploring efficient fine- grained perception of multimodal large language models. arXiv preprint arXiv:2404.09204, 2024. 7

work page arXiv 2024
[63]

Docformerv2: Local features for document understanding

Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, and R Manmatha. Docformerv2: Local features for document understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 709– 718, 2024. 7

work page 2024
[64]

Obelics: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Pro- cessing Systems, 36, 2024. 7

work page 2024
[65]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 7

work page 2024
[66]

https://tongyi.aliyun.com/qianwen/

Qwen. https://tongyi.aliyun.com/qianwen/ . 12

work page
[67]

https://chatglm.cn/

ChatGLM. https://chatglm.cn/. 12

work page
[68]

https://kimi.moonshot.cn/

Kimi. https://kimi.moonshot.cn/. 12

work page
[69]

https://gemini.google.com/

Gemini-Pro. https://gemini.google.com/. 12 11 A. Algorithm Algorithm 1 shows the detailed inference process of PDF- WuKong. The training pipeline is shown in Algorithm 2. Our PDF-WuKong can achieve efficient and accurate un- derstanding of long PDFs with end-to-end sparse sampling. Algorithm 1 Inference pipeline for PDF-WuKong 1: Input: PDF document D, us...

work page 2026
[70]

用于控制电机，实现循迹与避障。 Evidence 1.底层运动系统的软件设计如图5所示，控制核心是STM32单片机…进入程序后…信息采集完成后进行数据处理，控制电机相应转动… 2.图5

图5显示数据处理后进入“电机控制”环节... 用于控制电机，实现循迹与避障。 Evidence 1.底层运动系统的软件设计如图5所示，控制核心是STM32单片机…进入程序后…信息采集完成后进行数据处理，控制电机相应转动… 2.图5. 底层运动软件设计流程图.jpg 根据图7中的LabVIEW框图，程序是如何在不同的模块之间切换的？图7中，程序通过条件标志切换模块。 Answer 1.通过条件标志切换不同模块。 2.图7显示的LabVIEW框图中，程序通过条件标志切换不同模块。程序首先进入初始化模块…。然后进入等待模块…。当触发事件发生后，程序进入数据处理模块 …。最后，当停止标志（停止开关）触发时，程序进入退出模块，关闭所有指示灯、清除显示并退出 While 循环。“ Evidence 1...

work page
[71]

图4的SEM扫描结果及EDS分析谱图清楚地显示，焊缝特征点1的元素成分主要为Cu（铜）和Ag （银）……两个明显的峰值分别标记为Cu 和Ag……。 Evidence

work page
[72]

焊缝特征点1 SEM扫描结果及EDS分析谱图.jpg 2.焊缝特征点1组织电子能谱仪分析……EDS元素分析结果表明，特征点1 的元素组成分别为61.12%的Cu、 38.88%的Ag……分析此处为Ag 与Cu发生融合形成的共晶产物……是焊缝主要成分。钨-铜二元合金在1084°C时的相组成是什么？ W+Cu. Answer

work page
[73]

2.图1显示了W-Cu二元相图。在1084°C时，相区标记为“W+Cu”。这意味着在这个温度下，钨和铜是以各自的固相形式存在的。这一点可以通过浏览图中1084°C线下的相区标记确认。 Evidence

W和Cu. 2.图1显示了W-Cu二元相图。在1084°C时，相区标记为“W+Cu”。这意味着在这个温度下，钨和铜是以各自的固相形式存在的。这一点可以通过浏览图中1084°C线下的相区标记确认。 Evidence

work page
[74]

图1-二元相图.jpg 该焊接工艺的温度在什么时间段内保持在最高点？ 108分钟至144分钟。 Answer

work page
[75]

从图2可以看出，温度在108分钟时达到850℃，并持续到144分钟，此后温度开始下降。因此，温度在108到144分钟内保持在最高点。 Evidence

work page
[76]

图2焊接工艺参数图.jpg 该材料的抗拉强度的范围是什么？ 250~360 Mpa。 Answer

work page
[77]

根据表4，抗拉强度的数值范围一栏，明确指出了该材料Ag72Cu26Ti的抗拉强度范围为250~360 MPa。因此，该材料的抗拉强度在250到360 MPa之间。 Evidence

work page
[78]

表4 填充材料物理性能.jpg 液压支架从平板车推上平台后，如何进行找正调平？使用牵引千斤进行找正调平。 Answer

work page
[79]

根据图2的描述，当液压支架从平板车推上平台后，是由牵引千斤进行找正调平的。这是在详细的安装过程中说明的步骤。 Evidence

work page
[80]

thought chain

如图2所示，井下支架快速安装平台主要由两部牵引千斤和推移千斤组成……现场安装时，首先将运架平板车与平台对接……由牵引千斤将液压支架从平板车推上平台 ……牵引至无极绳绞车后运输……通过推移千斤将支架推下平台 ……完成进架操作。基于超声波避障的仓… PDF 基于超声波避障的仓… PDF 县级媒体主持人如何融入媒体时代 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 煤矿井下支架快速安装… PDF Figure 6. Examples of PDF-WuKong on Chinese documents. The red box indicates the evidence that the correct a...

work page

[1] [1]

Pdftriage: Question answering over long, structured documents

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A Rossi, and Franck Dernon- court. Pdftriage: Question answering over long, structured documents. arXiv preprint arXiv:2309.08872, 2023. 1

work page arXiv 2023

[2] [2]

Prem Jacob, Beatriz Lucia Salvador Bizotto, and Mithi- leysh Sathiyanarayanan

T. Prem Jacob, Beatriz Lucia Salvador Bizotto, and Mithi- leysh Sathiyanarayanan. Constructing the chatgpt for pdf files with langchain – ai. In 2024 International Conference on Inventive Computation Technologies (ICICT), pages 835– 839, 2024. 1

work page 2024

[3] [3]

YaRN: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. 1, 2, 3

work page 2024

[4] [4]

LongloRA: Efficient fine-tuning of long-context large language models

Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhi- jian Liu, Song Han, and Jiaya Jia. LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representa- tions, 2024. 2, 3

work page 2024

[5] [5]

Fo- cused transformer: Contrastive training for context scaling

Szymon Tworkowski, Konrad Staniszewski, Mikoł aj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Mił o ´s. Fo- cused transformer: Contrastive training for context scaling. In Advances in Neural Information Processing Systems, vol- ume 36, pages 42661–42688, 2023. 1, 2, 3

work page 2023

[6] [6]

Disc-lawllm: Fine-tuning large language models for intelligent legal services.arXiv preprint arXiv:2309.11325, 2023

Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, et al. Disc-lawllm: Fine-tuning large language models for intelligent legal services.arXiv preprint arXiv:2309.11325, 2023. 1, 2, 3

work page arXiv 2023

[7] [7]

Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning

Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xi- ang Bai, Xuanjing Huang, et al. Disc-finllm: A chinese fi- nancial large language model based on multiple experts fine- tuning. arXiv preprint arXiv:2310.15205, 2023

work page arXiv 2023

[8] [8]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Lar- son. From local to global: A graph rag approach to query- focused summarization. arXiv preprint arXiv:2404.16130 ,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Textmonkey: An ocr-free large multimodal model for understanding document

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024. 2, 3, 7

work page arXiv 2024

[11] [11]

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Vary: Scaling up the vision vocabulary for large vision-language model

Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Com- puter Vision, pages 408–424. Springer, 2024. 2, 3, 7

work page 2024

[13] [13]

Focus anywhere for fine- grained multi-page document understanding

Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chun- rui Han, and Xiangyu Zhang. Focus anywhere for fine- grained multi-page document understanding. arXiv preprint arXiv:2405.14295, 2024. 2, 3

work page arXiv 2024

[14] [14]

Hi- erarchical multimodal transformers for multipage docvqa

Rub `en Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 2, 3

work page 2023

[15] [15]

Slidevqa: A dataset for document visual question answering on multiple images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. In AAAI, pages 13636–13645, 2023. 2

work page 2023

[16] [16]

Gram: Global reasoning for multi-page vqa

Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, and Ron Litman. Gram: Global reasoning for multi-page vqa. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 15598–15607,

work page

[17] [17]

Document understanding dataset and evaluation (dude)

Jordy Van Landeghem, Rub `en Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Ju- rkiewicz, Micka¨el Coustaty, Bertrand Anckaert, Ernest Val- veny, et al. Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19528–19540, 2023. 2, 5, 7, 13

work page 2023

[18] [18]

Needle in a multimodal haystack

Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024. 2, 7

work page arXiv 2024

[19] [19]

RAPTOR: Re- cursive abstractive processing for tree-organized retrieval

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. RAPTOR: Re- cursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Represen- tations, 2024. 2, 3

work page 2024

[20] [20]

Unidoc: A univer- sal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023

Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A univer- sal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023. 3

work page 2023

[21] [21]

mplug-docowl: Modularized multimodal large language model for document understanding

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 3

work page arXiv 2023

[22] [22]

Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023. 3

work page arXiv 2023

[23] [23]

Llava-next: Im- 9 proved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- 9 proved reasoning, ocr, and world knowledge, January 2024. 3, 7

work page 2024

[24] [24]

Internlm-xcomposer2-4khd: A pioneer- ing large vision-language model handling resolutions from 336 pixels to 4k hd

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneer- ing large vision-language mo...

work page arXiv 2024

[25] [25]

mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding, 2024

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding, 2024. 3, 6, 7

work page 2024

[26] [26]

Cream: Coarse-to- fine retrieval and multi-modal efficient tuning for document vqa

Jinxu Zhang, Yongqi Yu, and Yu Zhang. Cream: Coarse-to- fine retrieval and multi-modal efficient tuning for document vqa. In Proceedings of the 32nd ACM International Confer- ence on Multimedia, pages 925–934, 2024. 3, 7

work page 2024

[27] [27]

Efficient attentions for long document summa- rization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summa- rization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1419–1436,

work page 2021

[28] [28]

A dataset of information-seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021. 2

work page 2021

[29] [29]

Pub- laynet: largest dataset ever for document layout analysis

Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In 2019 International conference on document analysis and recognition (ICDAR), pages 1015–1022. IEEE, 2019. 2

work page 2019

[30] [30]

Docbank: A bench- mark dataset for document layout analysis

Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A bench- mark dataset for document layout analysis. arXiv preprint arXiv:2006.01038, 2020

work page arXiv 2006

[31] [31]

Doclaynet: a large human-annotated dataset for document-layout segmentation

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. Doclaynet: a large human-annotated dataset for document-layout segmentation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining, pages 3743–3751, 2022. 2

work page 2022

[32] [32]

Docile benchmark for document information localization and extraction

ˇStˇep´an ˇSimsa, Milan ˇSulc, Michal U ˇriˇc´aˇr, Yash Patel, Ahmed Hamdi, Mat ˇej Koci´an, Maty´aˇs Skalick `y, Jiˇr´ı Matas, Antoine Doucet, Micka¨el Coustaty, et al. Docile benchmark for document information localization and extraction. pages 147–166, 2023. 2

work page 2023

[33] [33]

Cord: A con- solidated receipt dataset for post-ocr parsing

Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jae- heung Surh, Minjoon Seo, and Hwalsuk Lee. Cord: A con- solidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS, 2019

work page 2019

[34] [34]

Icdar2019 com- petition on scanned receipt ocr and information extraction

Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthe- nis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 com- petition on scanned receipt ocr and information extraction. In ICDAR, pages 1516–1520, 2019. 2

work page 2019

[35] [35]

Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021. 2, 5, 7

work page 2021

[36] [36]

Ocr-vqa: Visual question answering by reading text in images

Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019. 2

work page 2019

[37] [37]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 2, 5, 7

work page internal anchor Pith review Pith/arXiv arXiv 2022

[38] [38]

Chartx & chartvlm: A versatile bench- mark and foundation model for complicated chart reasoning

Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, et al. Chartx & chartvlm: A versatile bench- mark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185, 2024. 2

work page arXiv 2024

[39] [39]

Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models

Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024. 2

work page arXiv 2024

[40] [40]

Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V . Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022. 2, 5, 7

work page 2022

[41] [41]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Docgenome: An open large- scale scientific document benchmark for training and test- ing multi-modal large language models

Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wen- jie Wu, Hancheng Ye, et al. Docgenome: An open large- scale scientific document benchmark for training and test- ing multi-modal large language models. arXiv preprint arXiv:2406.11633, 2024. 2

work page arXiv 2024

[43] [43]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Gpt-4v(ision) system card

OpenAI. Gpt-4v(ision) system card. https://openai. com/contributions/gpt-4v, 2023. 4, 7

work page 2023

[45] [45]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

Hi- erarchical multimodal transformers for multipage docvqa

Rub `en Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 5, 6, 7, 13

work page 2023

[47] [47]

https://github.com/kermitt2/grobid, 2008–2024

Grobid. https://github.com/kermitt2/grobid, 2008–2024. 5

work page 2008

[48] [48]

Mineru: An open-source solution for precise document content extrac- tion, 2024

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian 10 Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extrac- tion, 2024. 5

work page 2024

[49] [49]

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Ji- aqi Wang. Internlm-xcomposer2: Mastering free-form text- image composition and compr...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[51] [51]

Moonshot AI. Kimi. https://kimi.moonshot.cn ,

work page

[52] [52]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[54] [54]

Layoutlmv3: Pre-training for document ai with unified text and image masking

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091,

work page

[55] [55]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. 7

work page internal anchor Pith review Pith/arXiv arXiv 2004

[56] [56]

Big bird: Transformers for longer sequences

Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neu- ral information processing systems, 33:17283–17297, 2020. 5, 7

work page 2020

[57] [57]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Mon- key: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Mon- key: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607,

work page arXiv

[59] [59]

Generative multimodal mod- els are in-context learners

Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal mod- els are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14398–14409, 2024. 6, 7

work page 2024

[60] [60]

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

CogVLM2: Visual Language Models for Image and Video Understanding

Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[62] [62]

Texthawk: Exploring efficient fine- grained perception of multimodal large language models

Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, and Wei Zeng. Texthawk: Exploring efficient fine- grained perception of multimodal large language models. arXiv preprint arXiv:2404.09204, 2024. 7

work page arXiv 2024

[63] [63]

Docformerv2: Local features for document understanding

Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, and R Manmatha. Docformerv2: Local features for document understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 709– 718, 2024. 7

work page 2024

[64] [64]

Obelics: An open web-scale filtered dataset of interleaved image-text documents

Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Pro- cessing Systems, 36, 2024. 7

work page 2024

[65] [65]

Vila: On pre-training for vi- sual language models

Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 7

work page 2024

[66] [66]

https://tongyi.aliyun.com/qianwen/

Qwen. https://tongyi.aliyun.com/qianwen/ . 12

work page

[67] [67]

https://chatglm.cn/

ChatGLM. https://chatglm.cn/. 12

work page

[68] [68]

https://kimi.moonshot.cn/

Kimi. https://kimi.moonshot.cn/. 12

work page

[69] [69]

https://gemini.google.com/

Gemini-Pro. https://gemini.google.com/. 12 11 A. Algorithm Algorithm 1 shows the detailed inference process of PDF- WuKong. The training pipeline is shown in Algorithm 2. Our PDF-WuKong can achieve efficient and accurate un- derstanding of long PDFs with end-to-end sparse sampling. Algorithm 1 Inference pipeline for PDF-WuKong 1: Input: PDF document D, us...

work page 2026

[70] [70]

用于控制电机，实现循迹与避障。 Evidence 1.底层运动系统的软件设计如图5所示，控制核心是STM32单片机…进入程序后…信息采集完成后进行数据处理，控制电机相应转动… 2.图5

图5显示数据处理后进入“电机控制”环节... 用于控制电机，实现循迹与避障。 Evidence 1.底层运动系统的软件设计如图5所示，控制核心是STM32单片机…进入程序后…信息采集完成后进行数据处理，控制电机相应转动… 2.图5. 底层运动软件设计流程图.jpg 根据图7中的LabVIEW框图，程序是如何在不同的模块之间切换的？图7中，程序通过条件标志切换模块。 Answer 1.通过条件标志切换不同模块。 2.图7显示的LabVIEW框图中，程序通过条件标志切换不同模块。程序首先进入初始化模块…。然后进入等待模块…。当触发事件发生后，程序进入数据处理模块 …。最后，当停止标志（停止开关）触发时，程序进入退出模块，关闭所有指示灯、清除显示并退出 While 循环。“ Evidence 1...

work page

[71] [71]

图4的SEM扫描结果及EDS分析谱图清楚地显示，焊缝特征点1的元素成分主要为Cu（铜）和Ag （银）……两个明显的峰值分别标记为Cu 和Ag……。 Evidence

work page

[72] [72]

焊缝特征点1 SEM扫描结果及EDS分析谱图.jpg 2.焊缝特征点1组织电子能谱仪分析……EDS元素分析结果表明，特征点1 的元素组成分别为61.12%的Cu、 38.88%的Ag……分析此处为Ag 与Cu发生融合形成的共晶产物……是焊缝主要成分。钨-铜二元合金在1084°C时的相组成是什么？ W+Cu. Answer

work page

[73] [73]

2.图1显示了W-Cu二元相图。在1084°C时，相区标记为“W+Cu”。这意味着在这个温度下，钨和铜是以各自的固相形式存在的。这一点可以通过浏览图中1084°C线下的相区标记确认。 Evidence

W和Cu. 2.图1显示了W-Cu二元相图。在1084°C时，相区标记为“W+Cu”。这意味着在这个温度下，钨和铜是以各自的固相形式存在的。这一点可以通过浏览图中1084°C线下的相区标记确认。 Evidence

work page

[74] [74]

图1-二元相图.jpg 该焊接工艺的温度在什么时间段内保持在最高点？ 108分钟至144分钟。 Answer

work page

[75] [75]

从图2可以看出，温度在108分钟时达到850℃，并持续到144分钟，此后温度开始下降。因此，温度在108到144分钟内保持在最高点。 Evidence

work page

[76] [76]

图2焊接工艺参数图.jpg 该材料的抗拉强度的范围是什么？ 250~360 Mpa。 Answer

work page

[77] [77]

根据表4，抗拉强度的数值范围一栏，明确指出了该材料Ag72Cu26Ti的抗拉强度范围为250~360 MPa。因此，该材料的抗拉强度在250到360 MPa之间。 Evidence

work page

[78] [78]

表4 填充材料物理性能.jpg 液压支架从平板车推上平台后，如何进行找正调平？使用牵引千斤进行找正调平。 Answer

work page

[79] [79]

根据图2的描述，当液压支架从平板车推上平台后，是由牵引千斤进行找正调平的。这是在详细的安装过程中说明的步骤。 Evidence

work page

[80] [80]

thought chain

如图2所示，井下支架快速安装平台主要由两部牵引千斤和推移千斤组成……现场安装时，首先将运架平板车与平台对接……由牵引千斤将液压支架从平板车推上平台 ……牵引至无极绳绞车后运输……通过推移千斤将支架推下平台 ……完成进架操作。基于超声波避障的仓… PDF 基于超声波避障的仓… PDF 县级媒体主持人如何融入媒体时代 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 煤矿井下支架快速安装… PDF Figure 6. Examples of PDF-WuKong on Chinese documents. The red box indicates the evidence that the correct a...

work page