pith. sign in

arxiv: 2605.18774 · v1 · pith:OTD6S6E5new · submitted 2026-04-17 · 💻 cs.IR · cs.AI

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Pith reviewed 2026-05-21 01:02 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords document chunkingdependency parsingmulti-page documentsretrieval-augmented generationvision-language modelsstructure recoverymulti-modal documents
0
0 comments X

The pith

M3DocDep recovers block dependencies in multi-page documents with vision-language models to produce coherent chunks for retrieval and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that chunking long industrial documents works better when it first reconstructs the document's logical tree of blocks rather than splitting on text alone. M3DocDep runs a shared detection and OCR step, builds multimodal embeddings with boundary-aware pooling, scores possible parent-child links with a biaffine head, and decodes a valid tree using minimum spanning tree constraints. The resulting tree then guides chunk creation that carries section paths and page ranges. A sympathetic reader cares because fragmented chunks break retrieval and force language models to answer from incomplete context, while tree-guided chunks preserve figure-caption links and cross-page relations.

Core claim

By recovering a globally consistent dependency tree over multimodal blocks and then chunking along that tree, M3DocDep produces retrieval units whose boundaries better match the document's intended structure, yielding higher scores on structure-aware evaluation, retrieval metrics, and downstream question answering.

What carries the argument

The biaffine head that scores candidate parent-child edges over multimodal block embeddings, decoded under MST constraints to produce a single valid document dependency tree.

If this is right

  • Retrieval nDCG rises because chunks now respect section boundaries and visual relations.
  • Question-answering accuracy improves when the retriever supplies complete, non-fragmented context.
  • The same dependency tree can annotate chunks with explicit section paths and page ranges for downstream use.
  • Shared-block preprocessing lets the gains be measured without confounding differences in detection quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on domains with different visual conventions, such as legal contracts or scientific papers, to check whether the same tree-recovery step remains effective.
  • If the MST decoding step is replaced by a learned global parser, the method might scale to even larger multi-document collections without manual tree constraints.
  • Integrating the recovered trees into indexing systems would let users query by logical section rather than by raw page or text span.

Load-bearing premise

The decoded dependency tree must accurately capture the document's true logical hierarchy, including cross-page and figure-caption relations.

What would settle it

A side-by-side manual audit that finds frequent errors in the recovered parent-child links for cross-page or figure-caption pairs would eliminate the claimed gains in chunk coherence.

Figures

Figures reproduced from arXiv: 2605.18774 by Heuiseok Lim, Jaehyung Seo, Jeongbae Park, Joongmin Shin.

Figure 1
Figure 1. Figure 1: Overview of M3DOCDEP. (a) SharedDet (DP+OCR) converts multi-page documents into Global Document Blocks V. (b) A frozen LVLM with SoftROI pooling produces multi-modal block embeddings ei. (c) A biaffine scorer and MST decoder recover a global document dependency tree T . (d) Structure-Aware Dependency Chunking deterministically converts T into chunks C with section paths and page spans. Notation Across stag… view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end qualitative example of M3DOCDEP. (a) A 5-page industrial document is input. (b) The recovered dependency subtree (cropped from full tree T ): 1:title → 17:section-title → 19:figure → 20:figure-caption shows the figure– caption binding under the governing section. (c) Structure-aware chunking emits a chunk that keeps the figure crop and its caption together, annotated with the section path and pa… view at source ↗
Figure 1
Figure 1. Figure 1: Schematic of pho￾ton trajectory in the equatorial plane of a Kerr black hole, parametrized as r = r(φ). The labeled photon trajectory shows the relationship between the impact parameter b, radial distance r, azimuthal angle φ, and bending angle α . . . (b) Structure-based chunking # | Accurate closed-form trajectories of light around a Kerr black hole using asymptotic approximants ## 2. Light deflection: n… view at source ↗
Figure 1
Figure 1. Figure 1: Schematic of photon trajectory in the equatorial plane of a Kerr black hole, parametrized as [PITH_FULL_IMAGE:figures/full_fig_p024_1.png] view at source ↗
read the original abstract

In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes M3DocDep, an LVLM-based pipeline that recovers block-level dependencies in multi-page multimodal documents via SharedDet preprocessing, boundary-aware SoftROI pooling, biaffine parent-child scoring, and MST-constrained decoding, then performs tree-guided chunking annotated with section paths and page ranges. It reports relative gains of +28.5 to +39.6% STEDS on DHP benchmarks, +1.1 to +15.3% retrieval nDCG, and +4.5 to +15.3% QA ANLS on corpus-level RAG benchmarks under a shared-block evaluation protocol.

Significance. If the results prove robust, the work could advance RAG chunking for long industrial documents by demonstrating that explicit recovery of cross-page and figure-caption dependencies produces more coherent retrieval units than text-centric or generative hierarchy baselines. The shared-block protocol and multimodal embedding approach are constructive elements for fair comparison.

major comments (1)
  1. [Evaluation / Results] The manuscript reports substantial gains in STEDS, nDCG, and ANLS but supplies no intermediate metrics validating the recovered dependency trees, such as UAS, LAS, section-boundary F1, or tree-edit distance against gold hierarchies on any annotated subset of the DHP or RAG corpora. This is load-bearing for the central claim that the improvements arise from faithful recovery of the document's true logical structure (including cross-page and figure-caption relations); without these diagnostics, the gains could plausibly originate from SharedDet, SoftROI pooling, or the LVLM embeddings alone.
minor comments (1)
  1. The abstract and methods should explicitly define all acronyms (STEDS, ANLS, nDCG, DHP) at first use and clarify the precise implementation details of the shared-block protocol for each baseline to support reproducibility.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding evaluation of the recovered dependency trees below.

read point-by-point responses
  1. Referee: [Evaluation / Results] The manuscript reports substantial gains in STEDS, nDCG, and ANLS but supplies no intermediate metrics validating the recovered dependency trees, such as UAS, LAS, section-boundary F1, or tree-edit distance against gold hierarchies on any annotated subset of the DHP or RAG corpora. This is load-bearing for the central claim that the improvements arise from faithful recovery of the document's true logical structure (including cross-page and figure-caption relations); without these diagnostics, the gains could plausibly originate from SharedDet, SoftROI pooling, or the LVLM embeddings alone.

    Authors: We agree that direct validation of the dependency trees via UAS, LAS, section-boundary F1, or tree-edit distance would help attribute the gains more precisely to structure recovery. However, neither the DHP nor the RAG corpora provide gold-standard block-level dependency annotations or hierarchical labels. Our evaluation therefore relies on downstream metrics (STEDS for chunk coherence, nDCG for retrieval, ANLS for QA) under a shared-block protocol that holds preprocessing and block detection fixed across methods. This design isolates the effect of the biaffine scoring and MST decoding steps. We will expand the revised manuscript with an explicit limitations paragraph and a qualitative error analysis of recovered trees on a small manually inspected sample to address this concern. revision: partial

standing simulated objections not resolved
  • The DHP and RAG corpora lack gold-standard annotations for block-level dependency trees, preventing computation of UAS, LAS, section-boundary F1, or tree-edit distance.

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmarks

full rationale

The manuscript describes an LVLM pipeline (SharedDet preprocessing, SoftROI embeddings, biaffine scoring, MST decoding, tree-guided chunking) whose headline results are direct performance deltas on independent DHP and RAG corpora under a shared-block protocol. No equations, fitted parameters, or self-referential definitions appear in the provided text; the reported STEDS/nDCG/ANLS lifts are external measurements rather than quantities forced by construction from the method's own inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that documents possess recoverable hierarchical block dependencies that can be inferred from visual and textual cues; no new physical entities or ad-hoc constants are introduced beyond standard LVLM components.

axioms (1)
  • domain assumption Multi-page documents contain consistent parent-child relations between blocks that are detectable from multimodal features.
    Invoked in the description of the dependency scoring and MST decoding steps.

pith-pipeline@v0.9.0 · 5769 in / 1414 out tokens · 29959 ms · 2026-05-21T01:02:15.305283+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025

    Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. 1, 2, 6, 18

  2. [2]

    Qwen2.5- vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shi- jie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report...

  3. [3]

    Meteor: An au- tomatic metric for mt evaluation with improved corre- lation with human judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An au- tomatic metric for mt evaluation with improved corre- lation with human judgments. InProceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summariza- tion, pages 65–72, 2005. 6, 13

  4. [4]

    Scene text visual question answering

    Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4291–4301, 2019. 6, 13

  5. [5]

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

  6. [6]

    Y . J. Chu and T. H. Liu. On the shortest arborescence of a directed graph.Scientia Sinica, 14(10):1396– 1400, 1965. 5, 17

  7. [7]

    Deformable con- volutional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable con- volutional networks. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 764–773, 2017. 4

  8. [8]

    An image is worth 16x16 words: Transformers for image recogni- tion at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, and et al. An image is worth 16x16 words: Transformers for image recogni- tion at scale. InProceedings of the 9th International Conference on Learning Representations (ICLR),

  9. [9]

    Timothy Dozat and Christopher D. Manning. Deep bi- affine attention for neural dependency parsing. InIn- ternational Conference on Learning Representations (ICLR) Workshop, 2017. arXiv:1611.01734. 2

  10. [10]

    Lum- berchunker: Long-form narrative document segmen- tation.arXiv preprint arXiv:2406.17526, 2024

    Andr ´e V Duarte, Jo ˜ao Marques, Miguel Grac ¸a, Miguel Freire, Lei Li, and Arlindo L Oliveira. Lum- berchunker: Long-form narrative document segmen- tation.arXiv preprint arXiv:2406.17526, 2024. 2, 6, 14

  11. [11]

    Optimum branchings.Journal of Re- search of the National Bureau of Standards, Section B, 71B(4):233–240, 1967

    Jack Edmonds. Optimum branchings.Journal of Re- search of the National Bureau of Standards, Section B, 71B(4):233–240, 1967. 5, 17

  12. [12]

    LayoutLLM: Large language model instruction tuning for visually rich document under- standing

    Masato Fujitake. LayoutLLM: Large language model instruction tuning for visually rich document under- standing. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 10219–10224, Torino, Italia,

  13. [13]

    Retrieval-augmented generation for large language models: A survey, 2024

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024. 1, 2

  14. [14]

    Ge, Steve Sun, Joseph Owens, Victor Galvez, O

    J. Ge, Steve Sun, Joseph Owens, Victor Galvez, O. Gologorskaya, Jennifer C Lai, Mark J Pletcher, and Ki Lai. Development of a liver disease-specific large lan- guage model chat interface using retrieval augmented generation.medRxiv, 2023. 1

  15. [15]

    Hongyu Gong, Yelong Shen, Dian Yu, Jianshu Chen, and Dong Yu. Recurrent chunking mechanisms for long-text machine reading comprehension.Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6751–6761, 2020. 1, 2, 6, 14

  16. [16]

    Girshick

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 2980–2988, 2017. 4

  17. [17]

    Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021

    Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021. 5, 12

  18. [18]

    Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning

    Seongtae Hong, Joong Min Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Cho Man Young, Byeongho Choi, and Heuiseok Lim. Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Tr...

  19. [19]

    Cumulated gain-based evaluation of ir techniques.ACM Transac- tions on Information Systems (TOIS), 20(4):422–446,

    Kalervo J ¨arvelin and Jaana Kek ¨al¨ainen. Cumulated gain-based evaluation of ir techniques.ACM Transac- tions on Information Systems (TOIS), 20(4):422–446,

  20. [20]

    A study on the implementation of generative ai services using an enterprise data-based 9 llm application architecture.Adv

    CheonSu Jeong. A study on the implementation of generative ai services using an enterprise data-based 9 llm application architecture.Adv. Artif. Intell. Mach. Learn., 3:1588–1618, 2023. 1

  21. [21]

    Multi-page document visual question answering using self-attention scoring mechanism

    Lei Kang, Rub `en Tito, Ernest Valveny, and Dimos- thenis Karatzas. Multi-page document visual question answering using self-attention scoring mechanism. In Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, Au- gust 30–September 4, 2024, Proceedings, Part VI, page 219–232, Berlin, Heidelberg, 2024. Springer- Verlag. 2

  22. [22]

    Document un- derstanding dataset and evaluation (dude)

    Jordy Van Landeghem, Rafał Powalski, Rub `en Tito, Dawid Jurkiewicz, Matthew Blaschko, Łukasz Borchmann, Micka ¨el Coustaty, Sien Moens, Michał Pietruszka, Bertrand Ackaert, Tomasz Stanisławek, Paweł J´oziak, and Ernest Valveny. Document un- derstanding dataset and evaluation (dude). In2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pa...

  23. [23]

    Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021. 1

  24. [24]

    Rouge: A package for automatic eval- uation of summaries

    Chin-Yew Lin. Rouge: A package for automatic eval- uation of summaries. InText summarization branches out, pages 74–81, 2004. 6, 13

  25. [25]

    Mm- embed: Universal multimodal retrieval with multi- modal llms, 2025

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm- embed: Universal multimodal retrieval with multi- modal llms, 2025. 18

  26. [26]

    Hrdoc: dataset and baseline method toward hierarchical re- construction of document structures

    Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui Zhu, and Cong Liu. Hrdoc: dataset and baseline method toward hierarchical re- construction of document structures. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innova- tive Applications of Artificial Intelligence and Thir-...

  27. [27]

    Revisiting the tree edit distance and its backtracing: A tutorial.CoRR, abs/1805.06869,

    Benjamin Paaßen. Revisiting the tree edit distance and its backtracing: A tutorial.CoRR, abs/1805.06869,

  28. [28]

    Nassar, and Peter Staar

    Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document-layout seg- mentation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, page 3743–3751, New York, NY , USA, 2022. As- sociation for Computing Machinery. 1, 2, 12

  29. [29]

    Is se- mantic chunking worth the computational cost? In Findings of the Association for Computational Lin- guistics: NAACL 2025, pages 2155–2177, Albu- querque, New Mexico, 2025

    Renyi Qu, Ruixuan Tu, and Forrest Sheng Bao. Is se- mantic chunking worth the computational cost? In Findings of the Association for Computational Lin- guistics: NAACL 2025, pages 2155–2177, Albu- querque, New Mexico, 2025. Association for Compu- tational Linguistics. 1, 2, 6, 14

  30. [30]

    Docparser: Hier- archical document structure parsing from renderings

    Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriegel. Docparser: Hier- archical document structure parsing from renderings. Proceedings of the AAAI Conference on Artificial In- telligence, 35:4328–4338, 2021. 1, 2, 6, 13

  31. [31]

    Dsg: An end- to-end document structure generator.arXiv preprint arXiv:2310.09118, 2023

    Johannes Rausch, Gentiana Rashiti, Maxim Gusev, Ce Zhang, and Stefan Feuerriegel. Dsg: An end- to-end document structure generator.arXiv preprint arXiv:2310.09118, 2023. 1, 2, 6, 13

  32. [32]

    The prob- abilistic relevance framework: Bm25 and beyond

    Stephen Robertson and Hugo Zaragoza. The prob- abilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, 2009. 18

  33. [33]

    Rossi, and Franck Dernoncourt

    Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A. Rossi, and Franck Dernoncourt. Pdftriage: Question answering over long, structured documents, 2023. 2

  34. [34]

    MultiDocFusion : Hi- erarchical and multimodal chunking pipeline for en- hanced RAG on long industrial documents

    Joongmin Shin, Chanjun Park, Jeongbae Park, Jae- hyung Seo, and Heuiseok Lim. MultiDocFusion : Hi- erarchical and multimodal chunking pipeline for en- hanced RAG on long industrial documents. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 20996– 21015, Suzhou, China, 2025. Association for Compu- tationa...

  35. [35]

    Seyed Amin Tabatabaei, Sarah Fancher, Michael Par- sons, and Arian Askari. Can large language mod- els serve as effective classifiers for hierarchical multi- label classification of scientific documents at indus- trial scale? InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 163–174, Abu Dhabi, UAE, 202...

  36. [36]

    Hierarchical multimodal transformers for multi- page docvqa, 2023

    Rub `en Tito, Dimosthenis Karatzas, and Ernest Val- veny. Hierarchical multimodal transformers for multi- page docvqa, 2023. 1, 5, 12

  37. [37]

    S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis, 2025

    Prashant Verma. S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis, 2025. 15

  38. [38]

    DocLLM: A layout-aware generative language model for mul- timodal document understanding

    Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. DocLLM: A layout-aware generative language model for mul- timodal document understanding. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 8529–8548,...

  39. [39]

    Detect-order-construct: A tree construc- tion based approach for hierarchical document struc- 10 ture analysis.Pattern Recognition, 156:110836, 2024

    Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, and Qiang Huo. Detect-order-construct: A tree construc- tion based approach for hierarchical document struc- 10 ture analysis.Pattern Recognition, 156:110836, 2024. 2

  40. [40]

    Multilingual e5 text embeddings: A technical report, 2024

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report, 2024. 18

  41. [41]

    In- ternvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  42. [42]

    Dochienet: A large and diverse dataset for document hierarchy parsing

    Hangdi Xing, Changxu Cheng, Feiyu Gao, Zirui Shao, Zhi Yu, Jiajun Bu, Qi Zheng, and Cong Yao. Dochienet: A large and diverse dataset for document hierarchy parsing. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 1, 2, 5, 12

  43. [43]

    Dochienet: A large and diverse dataset for document hierarchy pars- ing

    Hangdi Xing, Changxu Cheng, et al. Dochienet: A large and diverse dataset for document hierarchy pars- ing. InEMNLP, 2024. 1, 14

  44. [44]

    Intelligent document pars- ing: Towards end-to-end document parsing via decou- pled content parsing and layout grounding

    Hangdi Xing, Feiyu Gao, Qi Zheng, Zhaoqing Zhu, Zirui Shao, and Ming Yan. Intelligent document pars- ing: Towards end-to-end document parsing via decou- pled content parsing and layout grounding. InFind- ings of the Association for Computational Linguistics: EMNLP 2025, pages 19987–19998, Suzhou, China,

  45. [45]

    Association for Computational Linguistics. 1, 2

  46. [46]

    Financial report chunking for effective retrieval augmented generation, 2024

    Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebas- tian Laverde, and Renyu Li. Financial report chunking for effective retrieval augmented generation, 2024. 1, 6, 15

  47. [47]

    Instruc- tion tuning for large language models: A survey, 2024

    Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. Instruc- tion tuning for large language models: A survey, 2024. 1

  48. [48]

    Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaox- uan Tan, Xiaochuang Han, Tianxing He, and Yulia Tsvetkov. Can LLM graph reasoning generalize be- yond pattern memorization? InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 2289–2305, Miami, Florida, USA, 2024. Asso- ciation for Computational Linguistics. 1, 2

  49. [49]

    PDF-to- tree: Parsing PDF text blocks into a tree

    Yue Zhang, Zhihao Zhang, Wenbin Lai, Chong Zhang, Tao Gui, Qi Zhang, and Xuanjing Huang. PDF-to- tree: Parsing PDF text blocks into a tree. InFind- ings of the Association for Computational Linguistics: EMNLP 2024, pages 10704–10714, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 2

  50. [50]

    instruction

    Jihao Zhao, Zhiyuan Ji, Pengnian Qi, Simin Niu, Bo Tang, Feiyu Xiong, and Zhiyu li. Meta-chunking: Learning efficient text segmentation via logical per- ception, 2024. 2, 6, 14 11 A. Datasets and Pre-processing Details All datasets used in our experiments are publicly available research benchmarks. We rely exclusively on open corpora for both hierarchy pa...