PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
Pith reviewed 2026-05-23 19:37 UTC · model grok-4.3
The pith
PDF-WuKong adds an end-to-end sparse sampler to multimodal models so they can read and answer questions about long PDFs that mix text and images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM by selecting the paragraphs or diagrams most pertinent to user queries.
What carries the argument
The sparse sampler that selects pertinent paragraphs or diagrams from text and image representations of long PDFs.
If this is right
- The model processes long PDFs containing interleaved text and images without being limited to plain text or a small number of images.
- It achieves higher F1 scores than prior open and proprietary models on multimodal document QA while using less computation.
- Training on the 1.1 million PaperPDF QA pairs enables the sampler to identify evidence sources relevant to user queries.
- The same architecture supports both English and Chinese academic papers.
Where Pith is reading between the lines
- The sparse selection mechanism could be tested on other long multimodal inputs such as slide decks or technical reports.
- Releasing the dataset and code allows direct measurement of how much the sampler reduces token usage at inference time.
- If the end-to-end training of the sampler generalizes, similar sparse modules could be added to existing MLLMs without full retraining.
Load-bearing premise
The 1.1 million QA pairs constructed via the proposed strategies constitute high-quality, unbiased training and evaluation data that generalizes beyond the PaperPDF collection to arbitrary long PDFs.
What would settle it
Performance on a fresh collection of long PDFs whose QA pairs were built with entirely different methods falls to the level of baseline models.
Figures
read the original abstract
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information. Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task. However, existing methods typically focus on either plain text or a limited number of document images, struggling to handle long PDF documents with interleaved text and images, especially for academic papers. In this paper, we introduce PDF-WuKong, a multimodal large language model (MLLM) that is designed to enhance multimodal question-answering (QA) for long PDF documents. PDF-WuKong incorporates a sparse sampler that operates on both text and image representations, significantly improving the efficiency and capability of the MLLM. The sparse sampler selects the paragraphs or diagrams most pertinent to user queries. To effectively train and evaluate our model, we construct PaperPDF, a dataset consisting of a broad collection of English and Chinese academic papers. Multiple strategies are proposed to build high-quality 1.1 million QA pairs along with their corresponding evidence sources. Experimental results demonstrate the superiority and high efficiency of our approach over other models on the task of long multimodal document understanding, surpassing proprietary products by an average of 8.6% on F1. Our code and dataset will be released at https://github.com/yh-hust/PDF-Wukong.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PDF-WuKong, a multimodal large language model equipped with an end-to-end sparse sampler operating on text and image representations to enable efficient question answering over long PDF documents with interleaved content. It constructs the PaperPDF dataset comprising English and Chinese academic papers and generates 1.1 million QA pairs via multiple proposed strategies, reporting that the model surpasses other approaches—including proprietary products—by an average of 8.6% F1 on long multimodal document understanding tasks. Code and dataset release is promised.
Significance. If the central claims hold after addressing evaluation concerns, the work would offer a practical advance in scalable multimodal document understanding for lengthy academic PDFs, with the sparse sampling mechanism providing efficiency gains. The planned public release of code and the 1.1M-pair dataset constitutes a concrete contribution to reproducibility and community benchmarking in this area.
major comments (2)
- [Abstract] Abstract: the headline claim of an average 8.6% F1 improvement over proprietary models is stated without error bars, baseline implementation details, dataset split statistics, or statistical significance tests, rendering it impossible to evaluate whether the reported superiority is robust or load-bearing for the central experimental conclusion.
- [Abstract] Dataset construction paragraph (Abstract): the 1.1 million QA pairs are generated from the same PaperPDF collection using the authors' proposed strategies, yet no validation (human evaluation, inter-annotator agreement, leakage checks, or external test sets) is described to demonstrate that the pairs are free of construction artifacts and support generalization beyond the collection; this directly underpins the superiority and generalization claims.
minor comments (1)
- [Abstract] Abstract: the phrase 'high-quality 1.1 million QA pairs' is asserted without supporting metrics; moving any available quality statistics or ablation results on the construction strategies into the main text would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the abstract requires clarification to better support its claims and will revise it accordingly while preserving the core contributions. Below we respond point-by-point to the major comments.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of an average 8.6% F1 improvement over proprietary models is stated without error bars, baseline implementation details, dataset split statistics, or statistical significance tests, rendering it impossible to evaluate whether the reported superiority is robust or load-bearing for the central experimental conclusion.
Authors: We acknowledge that the abstract presents the 8.6% average F1 gain without accompanying statistical qualifiers. The full manuscript (Section 4 and associated tables) details the baselines, dataset splits (train/val/test), and per-task F1 scores from which the average is computed. In the revised version we will update the abstract to qualify the claim (e.g., “surpassing … by an average of 8.6% F1 across the reported tasks; see Section 4 for per-task results, splits, and implementation details”) and will add a brief reference to the statistical measures already present in the experimental section. We will not recompute new significance tests if they were not originally performed, but the existing results will be presented more transparently. revision: partial
-
Referee: [Abstract] Dataset construction paragraph (Abstract): the 1.1 million QA pairs are generated from the same PaperPDF collection using the authors' proposed strategies, yet no validation (human evaluation, inter-annotator agreement, leakage checks, or external test sets) is described to demonstrate that the pairs are free of construction artifacts and support generalization beyond the collection; this directly underpins the superiority and generalization claims.
Authors: The abstract’s space constraints limited description of validation steps. Section 3 of the manuscript details the multi-strategy generation process and states that the pairs are accompanied by evidence sources. We agree that explicit validation metrics strengthen the claims. In the revision we will add a concise summary paragraph (or subsection) reporting any internal quality checks performed during construction, including any human spot-checks, leakage mitigation steps, and the use of held-out external test sets if available. If certain validation procedures (e.g., full inter-annotator agreement) were not conducted, we will state this transparently and discuss potential limitations. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper is an empirical contribution introducing a multimodal model and a self-constructed dataset (PaperPDF) for training and evaluation. No mathematical derivation, equations, or first-principles chain is present that reduces to its own inputs by construction. The abstract describes dataset construction via 'multiple strategies' to produce 'high-quality 1.1 million QA pairs' and reports experimental F1 gains, but this is standard self-supervised or self-generated benchmark practice rather than a self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation of a uniqueness theorem. No ansatz smuggling, renaming of known results, or other enumerated patterns apply. The central performance claim is measured on the authors' data but does not equate to the inputs by definition; external validation is not required for the circularity analysis per the rules.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 4 Pith papers
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
-
A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
A survey of MLLM-based Visually Rich Document Understanding covering feature integration techniques, training paradigms, challenges like data scarcity, and emerging trends such as RAG and agentic frameworks.
-
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction
Survey proposing a taxonomy for document parsing into pipeline-based systems and VLM-driven unified models, reviewing components, metrics, benchmarks, and challenges.
Reference graph
Works this paper leans on
-
[1]
Pdftriage: Question answering over long, structured documents
Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A Rossi, and Franck Dernon- court. Pdftriage: Question answering over long, structured documents. arXiv preprint arXiv:2309.08872, 2023. 1
-
[2]
Prem Jacob, Beatriz Lucia Salvador Bizotto, and Mithi- leysh Sathiyanarayanan
T. Prem Jacob, Beatriz Lucia Salvador Bizotto, and Mithi- leysh Sathiyanarayanan. Constructing the chatgpt for pdf files with langchain – ai. In 2024 International Conference on Inventive Computation Technologies (ICICT), pages 835– 839, 2024. 1
work page 2024
-
[3]
YaRN: Efficient context window extension of large language models
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2024. 1, 2, 3
work page 2024
-
[4]
LongloRA: Efficient fine-tuning of long-context large language models
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhi- jian Liu, Song Han, and Jiaya Jia. LongloRA: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representa- tions, 2024. 2, 3
work page 2024
-
[5]
Fo- cused transformer: Contrastive training for context scaling
Szymon Tworkowski, Konrad Staniszewski, Mikoł aj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Mił o ´s. Fo- cused transformer: Contrastive training for context scaling. In Advances in Neural Information Processing Systems, vol- ume 36, pages 42661–42688, 2023. 1, 2, 3
work page 2023
-
[6]
Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, et al. Disc-lawllm: Fine-tuning large language models for intelligent legal services.arXiv preprint arXiv:2309.11325, 2023. 1, 2, 3
-
[7]
Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning
Wei Chen, Qiushi Wang, Zefei Long, Xianyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xi- ang Bai, Xuanjing Huang, et al. Disc-finllm: A chinese fi- nancial large language model based on multiple experts fine- tuning. arXiv preprint arXiv:2310.15205, 2023
-
[8]
MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries. arXiv preprint arXiv:2401.15391, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Lar- son. From local to global: A graph rag approach to query- focused summarization. arXiv preprint arXiv:2404.16130 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Textmonkey: An ocr-free large multimodal model for understanding document
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024. 2, 3, 7
-
[11]
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 3, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Vary: Scaling up the vision vocabulary for large vision-language model
Haoran Wei, Lingyu Kong, Jinyue Chen, Liang Zhao, Zheng Ge, Jinrong Yang, Jianjian Sun, Chunrui Han, and Xiangyu Zhang. Vary: Scaling up the vision vocabulary for large vision-language model. In European Conference on Com- puter Vision, pages 408–424. Springer, 2024. 2, 3, 7
work page 2024
-
[13]
Focus anywhere for fine- grained multi-page document understanding
Chenglong Liu, Haoran Wei, Jinyue Chen, Lingyu Kong, Zheng Ge, Zining Zhu, Liang Zhao, Jianjian Sun, Chun- rui Han, and Xiangyu Zhang. Focus anywhere for fine- grained multi-page document understanding. arXiv preprint arXiv:2405.14295, 2024. 2, 3
-
[14]
Hi- erarchical multimodal transformers for multipage docvqa
Rub `en Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 2, 3
work page 2023
-
[15]
Slidevqa: A dataset for document visual question answering on multiple images
Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. In AAAI, pages 13636–13645, 2023. 2
work page 2023
-
[16]
Gram: Global reasoning for multi-page vqa
Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, and Ron Litman. Gram: Global reasoning for multi-page vqa. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 15598–15607,
-
[17]
Document understanding dataset and evaluation (dude)
Jordy Van Landeghem, Rub `en Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Ju- rkiewicz, Micka¨el Coustaty, Bertrand Anckaert, Ernest Val- veny, et al. Document understanding dataset and evaluation (dude). In Proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19528–19540, 2023. 2, 5, 7, 13
work page 2023
-
[18]
Needle in a multimodal haystack
Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, et al. Needle in a multimodal haystack. arXiv preprint arXiv:2406.07230, 2024. 2, 7
-
[19]
RAPTOR: Re- cursive abstractive processing for tree-organized retrieval
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. RAPTOR: Re- cursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Represen- tations, 2024. 2, 3
work page 2024
-
[20]
Hao Feng, Zijian Wang, Jingqun Tang, Jinghui Lu, Wengang Zhou, Houqiang Li, and Can Huang. Unidoc: A univer- sal large multimodal model for simultaneous text detection, recognition, spotting and understanding, 2023. 3
work page 2023
-
[21]
mplug-docowl: Modularized multimodal large language model for document understanding
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun- feng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023. 3
-
[22]
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understanding with multimodal large language model. arXiv preprint arXiv:2310.05126, 2023. 3
-
[23]
Llava-next: Im- 9 proved reasoning, ocr, and world knowledge, January 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- 9 proved reasoning, ocr, and world knowledge, January 2024. 3, 7
work page 2024
-
[24]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, and Jiaqi Wang. Internlm-xcomposer2-4khd: A pioneer- ing large vision-language mo...
-
[25]
mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding, 2024
Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding, 2024. 3, 6, 7
work page 2024
-
[26]
Cream: Coarse-to- fine retrieval and multi-modal efficient tuning for document vqa
Jinxu Zhang, Yongqi Yu, and Yu Zhang. Cream: Coarse-to- fine retrieval and multi-modal efficient tuning for document vqa. In Proceedings of the 32nd ACM International Confer- ence on Multimedia, pages 925–934, 2024. 3, 7
work page 2024
-
[27]
Efficient attentions for long document summa- rization
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summa- rization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1419–1436,
work page 2021
-
[28]
A dataset of information-seeking questions and answers anchored in research papers
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610, 2021. 2
work page 2021
-
[29]
Pub- laynet: largest dataset ever for document layout analysis
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. Pub- laynet: largest dataset ever for document layout analysis. In 2019 International conference on document analysis and recognition (ICDAR), pages 1015–1022. IEEE, 2019. 2
work page 2019
-
[30]
Docbank: A bench- mark dataset for document layout analysis
Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, and Ming Zhou. Docbank: A bench- mark dataset for document layout analysis. arXiv preprint arXiv:2006.01038, 2020
-
[31]
Doclaynet: a large human-annotated dataset for document-layout segmentation
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter Staar. Doclaynet: a large human-annotated dataset for document-layout segmentation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discov- ery and Data Mining, pages 3743–3751, 2022. 2
work page 2022
-
[32]
Docile benchmark for document information localization and extraction
ˇStˇep´an ˇSimsa, Milan ˇSulc, Michal U ˇriˇc´aˇr, Yash Patel, Ahmed Hamdi, Mat ˇej Koci´an, Maty´aˇs Skalick `y, Jiˇr´ı Matas, Antoine Doucet, Micka¨el Coustaty, et al. Docile benchmark for document information localization and extraction. pages 147–166, 2023. 2
work page 2023
-
[33]
Cord: A con- solidated receipt dataset for post-ocr parsing
Seunghyun Park, Seung Shin, Bado Lee, Junyeop Lee, Jae- heung Surh, Minjoon Seo, and Hwalsuk Lee. Cord: A con- solidated receipt dataset for post-ocr parsing. In Workshop on Document Intelligence at NeurIPS, 2019
work page 2019
-
[34]
Icdar2019 com- petition on scanned receipt ocr and information extraction
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthe- nis Karatzas, Shijian Lu, and CV Jawahar. Icdar2019 com- petition on scanned receipt ocr and information extraction. In ICDAR, pages 1516–1520, 2019. 2
work page 2019
-
[35]
Minesh Mathew, Dimosthenis Karatzas, and C. V . Jawahar. Docvqa: A dataset for vqa on document images. In WACV, pages 2200–2209, 2021. 2, 5, 7
work page 2021
-
[36]
Ocr-vqa: Visual question answering by reading text in images
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, pages 947–952, 2019. 2
work page 2019
-
[37]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question an- swering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244, 2022. 2, 5, 7
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[38]
Chartx & chartvlm: A versatile bench- mark and foundation model for complicated chart reasoning
Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, et al. Chartx & chartvlm: A versatile bench- mark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185, 2024. 2
-
[39]
Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models
Lei Li, Yuqi Wang, Runxin Xu, Peiyi Wang, Xiachong Feng, Lingpeng Kong, and Qi Liu. Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models. arXiv preprint arXiv:2403.00231, 2024. 2
-
[40]
Minesh Mathew, Viraj Bagal, Rub `en Tito, Dimosthenis Karatzas, Ernest Valveny, and C.V . Jawahar. Infographicvqa. In WACV, pages 1697–1706, 2022. 2, 5, 7
work page 2022
-
[41]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Jun- yang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Renqiu Xia, Song Mao, Xiangchao Yan, Hongbin Zhou, Bo Zhang, Haoyang Peng, Jiahao Pi, Daocheng Fu, Wen- jie Wu, Hancheng Ye, et al. Docgenome: An open large- scale scientific document benchmark for training and test- ing multi-modal large language models. arXiv preprint arXiv:2406.11633, 2024. 2
-
[43]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrit- twieser, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
OpenAI. Gpt-4v(ision) system card. https://openai. com/contributions/gpt-4v, 2023. 4, 7
work page 2023
-
[45]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embed- dings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Hi- erarchical multimodal transformers for multipage docvqa
Rub `en Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 5, 6, 7, 13
work page 2023
-
[47]
https://github.com/kermitt2/grobid, 2008–2024
Grobid. https://github.com/kermitt2/grobid, 2008–2024. 5
work page 2008
-
[48]
Mineru: An open-source solution for precise document content extrac- tion, 2024
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, Bo Zhang, Liqun Wei, Zhihao Sui, Wei Li, Botian 10 Shi, Yu Qiao, Dahua Lin, and Conghui He. Mineru: An open-source solution for precise document content extrac- tion, 2024. 5
work page 2024
-
[49]
Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, and Ji- aqi Wang. Internlm-xcomposer2: Mastering free-form text- image composition and compr...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023. 6
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[51]
Moonshot AI. Kimi. https://kimi.moonshot.cn ,
-
[52]
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM, Aohan Zeng, Bin Xu, Bowen Wang, Chenhui Zhang, Da Yin, Diego Rojas, Guanyu Feng, Hanlin Zhao, Hanyu Lai, et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Layoutlmv3: Pre-training for document ai with unified text and image masking
Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4083–4091,
-
[55]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Long- former: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. 7
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[56]
Big bird: Transformers for longer sequences
Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big bird: Transformers for longer sequences. Advances in neu- ral information processing systems, 33:17283–17297, 2020. 5, 7
work page 2020
-
[57]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for un- derstanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Mon- key: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xiang Bai. Mon- key: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607,
-
[59]
Generative multimodal mod- els are in-context learners
Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiy- ing Yu, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Generative multimodal mod- els are in-context learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14398–14409, 2024. 6, 7
work page 2024
-
[60]
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies
Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395, 2024. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
CogVLM2: Visual Language Models for Image and Video Understanding
Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Jun- hui Ji, Zhao Xue, et al. Cogvlm2: Visual language mod- els for image and video understanding. arXiv preprint arXiv:2408.16500, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Texthawk: Exploring efficient fine- grained perception of multimodal large language models
Ya-Qi Yu, Minghui Liao, Jihao Wu, Yongxin Liao, Xiaoyu Zheng, and Wei Zeng. Texthawk: Exploring efficient fine- grained perception of multimodal large language models. arXiv preprint arXiv:2404.09204, 2024. 7
-
[63]
Docformerv2: Local features for document understanding
Srikar Appalaraju, Peng Tang, Qi Dong, Nishant Sankaran, Yichu Zhou, and R Manmatha. Docformerv2: Local features for document understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 709– 718, 2024. 7
work page 2024
-
[64]
Obelics: An open web-scale filtered dataset of interleaved image-text documents
Hugo Laurenc ¸on, Lucile Saulnier, L´eo Tronchon, Stas Bek- man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid- dharth Karamcheti, Alexander Rush, Douwe Kiela, et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Pro- cessing Systems, 36, 2024. 7
work page 2024
-
[65]
Vila: On pre-training for vi- sual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Moham- mad Shoeybi, and Song Han. Vila: On pre-training for vi- sual language models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 26689–26699, 2024. 7
work page 2024
- [66]
- [67]
- [68]
-
[69]
Gemini-Pro. https://gemini.google.com/. 12 11 A. Algorithm Algorithm 1 shows the detailed inference process of PDF- WuKong. The training pipeline is shown in Algorithm 2. Our PDF-WuKong can achieve efficient and accurate un- derstanding of long PDFs with end-to-end sparse sampling. Algorithm 1 Inference pipeline for PDF-WuKong 1: Input: PDF document D, us...
work page 2026
-
[70]
用于控制电机,实现循迹与避障。 Evidence 1.底层运动系统的软件设计如图5所示,控制核 心是STM32单片机…进入程序后…信息采集完成 后进行数据处理,控制电机相应转动… 2.图5
图5显示数据处理后进入“电机控制”环节... 用于控制电机,实现循迹与避障。 Evidence 1.底层运动系统的软件设计如图5所示,控制核 心是STM32单片机…进入程序后…信息采集完成 后进行数据处理,控制电机相应转动… 2.图5. 底层运动软件设计流程图.jpg 根据图7中的LabVIEW框图, 程序是如何在不同的模块 之间切换的? 图7中,程序通过条件 标志切换模块。 Answer 1.通过条件标志切换不同模块。 2.图7显示的LabVIEW框图中,程序通过条件标志切换不同模块。程序首先进入初始化模块…。然后进入 等待模块…。当触发事件发生后,程序进入数据处理模块 …。最后,当停止标志(停止开关)触发时,程 序进入退出模块,关闭所有指示灯、清除显示并退出 While 循环。“ Evidence 1...
-
[71]
图4的SEM扫描结果及EDS分析谱图清楚地显示,焊缝特征点1的元素成分主要为Cu(铜)和Ag (银)……两个明显的峰值分别标记为Cu 和Ag……。 Evidence
-
[72]
焊缝特征点1 SEM扫描结果及EDS分析谱图.jpg 2.焊缝特征点1组织电子能谱仪分析……EDS元素分析结果表明,特征点1 的元素组成分别为61.12%的Cu、 38.88%的Ag……分析此处为Ag 与Cu发生融合形成的共晶产物……是焊缝主要成分。 钨-铜二元合金在1084°C时 的相组成是什么? W+Cu. Answer
-
[73]
W和Cu. 2.图1显示了W-Cu二元相图。在1084°C时,相区标记为“W+Cu”。这意味着在这个 温度下,钨和铜是以各自的固相形式存在的。这一点可以通过浏览图中1084°C线 下的相区标记确认。 Evidence
-
[74]
图1-二元相图.jpg 该焊接工艺的温度在什么时 间段内保持在最高点? 108分钟至144分钟。 Answer
-
[75]
从图2可以看出,温度在108分钟时达到850℃,并持续到144分钟,此后温度开始下降。因此,温 度在108到144分钟内保持在最高点。 Evidence
-
[76]
图2焊接工艺参数图.jpg 该材料的抗拉强度的范围是什么? 250~360 Mpa。 Answer
-
[77]
根据表4,抗拉强度的数值范围一栏,明确指出了该材料Ag72Cu26Ti的抗拉强度范 围为250~360 MPa。因此,该材料的抗拉强度在250到360 MPa之间。 Evidence
-
[78]
表4 填充材料物理性能.jpg 液压支架从平板车推上平 台后,如何进行找正调平? 使用牵引千斤进行找正调 平。 Answer
-
[79]
根据图2的描述,当液压支架从平板车推上平台后,是由牵引千斤进行找正调平的。这是在详细 的安装过程中说明的步骤。 Evidence
-
[80]
如图2所示,井下支架快速安装平台主要由两部牵引千斤和推移千斤组成……现场安装时,首先 将运架平板车与平台对接……由牵引千斤将液压支架从平板车推上平台 ……牵引至无极绳绞车后运 输……通过推移千斤将支架推下平台 ……完成进架操作。 基于超声波避障的仓… PDF 基于超声波避障的仓… PDF 县级媒体主持人如何融入媒体时代 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 铜钨异种金属焊接工艺研究 PDF 煤矿井下支架快速安装… PDF Figure 6. Examples of PDF-WuKong on Chinese documents. The red box indicates the evidence that the correct a...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.