M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Heuiseok Lim; Jaehyung Seo; Jeongbae Park; Joongmin Shin

arxiv: 2605.18774 · v1 · pith:OTD6S6E5new · submitted 2026-04-17 · 💻 cs.IR · cs.AI

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

Joongmin Shin , Jeongbae Park , Jaehyung Seo , Heuiseok Lim This is my paper

Pith reviewed 2026-05-21 01:02 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords document chunkingdependency parsingmulti-page documentsretrieval-augmented generationvision-language modelsstructure recoverymulti-modal documents

0 comments

The pith

M3DocDep recovers block dependencies in multi-page documents with vision-language models to produce coherent chunks for retrieval and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that chunking long industrial documents works better when it first reconstructs the document's logical tree of blocks rather than splitting on text alone. M3DocDep runs a shared detection and OCR step, builds multimodal embeddings with boundary-aware pooling, scores possible parent-child links with a biaffine head, and decodes a valid tree using minimum spanning tree constraints. The resulting tree then guides chunk creation that carries section paths and page ranges. A sympathetic reader cares because fragmented chunks break retrieval and force language models to answer from incomplete context, while tree-guided chunks preserve figure-caption links and cross-page relations.

Core claim

By recovering a globally consistent dependency tree over multimodal blocks and then chunking along that tree, M3DocDep produces retrieval units whose boundaries better match the document's intended structure, yielding higher scores on structure-aware evaluation, retrieval metrics, and downstream question answering.

What carries the argument

The biaffine head that scores candidate parent-child edges over multimodal block embeddings, decoded under MST constraints to produce a single valid document dependency tree.

If this is right

Retrieval nDCG rises because chunks now respect section boundaries and visual relations.
Question-answering accuracy improves when the retriever supplies complete, non-fragmented context.
The same dependency tree can annotate chunks with explicit section paths and page ranges for downstream use.
Shared-block preprocessing lets the gains be measured without confounding differences in detection quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on domains with different visual conventions, such as legal contracts or scientific papers, to check whether the same tree-recovery step remains effective.
If the MST decoding step is replaced by a learned global parser, the method might scale to even larger multi-document collections without manual tree constraints.
Integrating the recovered trees into indexing systems would let users query by logical section rather than by raw page or text span.

Load-bearing premise

The decoded dependency tree must accurately capture the document's true logical hierarchy, including cross-page and figure-caption relations.

What would settle it

A side-by-side manual audit that finds frequent errors in the recovered parent-child links for cross-page or figure-caption pairs would eliminate the claimed gains in chunk coherence.

Figures

Figures reproduced from arXiv: 2605.18774 by Heuiseok Lim, Jaehyung Seo, Jeongbae Park, Joongmin Shin.

**Figure 1.** Figure 1: Overview of M3DOCDEP. (a) SharedDet (DP+OCR) converts multi-page documents into Global Document Blocks V. (b) A frozen LVLM with SoftROI pooling produces multi-modal block embeddings ei. (c) A biaffine scorer and MST decoder recover a global document dependency tree T . (d) Structure-Aware Dependency Chunking deterministically converts T into chunks C with section paths and page spans. Notation Across stag… view at source ↗

**Figure 2.** Figure 2: End-to-end qualitative example of M3DOCDEP. (a) A 5-page industrial document is input. (b) The recovered dependency subtree (cropped from full tree T ): 1:title → 17:section-title → 19:figure → 20:figure-caption shows the figure– caption binding under the governing section. (c) Structure-aware chunking emits a chunk that keeps the figure crop and its caption together, annotated with the section path and pa… view at source ↗

**Figure 1.** Figure 1: Schematic of photon trajectory in the equatorial plane of a Kerr black hole, parametrized as r = r(φ). The labeled photon trajectory shows the relationship between the impact parameter b, radial distance r, azimuthal angle φ, and bending angle α . . . (b) Structure-based chunking # | Accurate closed-form trajectories of light around a Kerr black hole using asymptotic approximants ## 2. Light deflection: n… view at source ↗

**Figure 1.** Figure 1: Schematic of photon trajectory in the equatorial plane of a Kerr black hole, parametrized as [PITH_FULL_IMAGE:figures/full_fig_p024_1.png] view at source ↗

read the original abstract

In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

M3DocDep adds a multimodal dependency step before chunking and reports retrieval/QA gains, but without tree accuracy checks the source of those gains stays unclear.

read the letter

The main takeaway is that this paper builds a pipeline to recover block-level parent-child links across pages and figures using LVLM embeddings, then chunks along the resulting tree. It reports clear lifts under a shared-block setup: STEDS up 28-39 points on DHP sets, plus smaller but consistent gains in nDCG and ANLS on RAG tasks. That part is straightforward and addresses a real pain point in long industrial documents where text-only chunkers break cross-page structure or miss caption ties. The technical pieces—boundary-aware SoftROI pooling plus biaffine scoring with MST decoding—look like a sensible way to enforce global tree validity, and the shared preprocessing layer helps isolate the contribution. Credit for shipping concrete numbers on external benchmarks rather than just synthetic examples. The soft spot is exactly what the stress-test note flags: no intermediate numbers on how faithful the trees actually are. No UAS, LAS, section-boundary F1, or tree-edit distance against any gold hierarchies, so we cannot rule out that the gains come mainly from the LVLM embeddings or the SharedDet layer instead of the dependency recovery. The abstract also stays light on baseline details and significance tests, which leaves the central claim plausible but not fully pinned down. This work sits squarely in the document-processing and RAG engineering corner. A reader who already runs multi-page retrieval pipelines would get practical value from the chunking recipe and the reported deltas. It is coherent on its own terms and shows honest engagement with the multimodal setting, so it deserves a serious referee even if the tree-validation gap needs fixing in revision.

Referee Report

1 major / 1 minor

Summary. The paper proposes M3DocDep, an LVLM-based pipeline that recovers block-level dependencies in multi-page multimodal documents via SharedDet preprocessing, boundary-aware SoftROI pooling, biaffine parent-child scoring, and MST-constrained decoding, then performs tree-guided chunking annotated with section paths and page ranges. It reports relative gains of +28.5 to +39.6% STEDS on DHP benchmarks, +1.1 to +15.3% retrieval nDCG, and +4.5 to +15.3% QA ANLS on corpus-level RAG benchmarks under a shared-block evaluation protocol.

Significance. If the results prove robust, the work could advance RAG chunking for long industrial documents by demonstrating that explicit recovery of cross-page and figure-caption dependencies produces more coherent retrieval units than text-centric or generative hierarchy baselines. The shared-block protocol and multimodal embedding approach are constructive elements for fair comparison.

major comments (1)

[Evaluation / Results] The manuscript reports substantial gains in STEDS, nDCG, and ANLS but supplies no intermediate metrics validating the recovered dependency trees, such as UAS, LAS, section-boundary F1, or tree-edit distance against gold hierarchies on any annotated subset of the DHP or RAG corpora. This is load-bearing for the central claim that the improvements arise from faithful recovery of the document's true logical structure (including cross-page and figure-caption relations); without these diagnostics, the gains could plausibly originate from SharedDet, SoftROI pooling, or the LVLM embeddings alone.

minor comments (1)

The abstract and methods should explicitly define all acronyms (STEDS, ANLS, nDCG, DHP) at first use and clarify the precise implementation details of the shared-block protocol for each baseline to support reproducibility.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding evaluation of the recovered dependency trees below.

read point-by-point responses

Referee: [Evaluation / Results] The manuscript reports substantial gains in STEDS, nDCG, and ANLS but supplies no intermediate metrics validating the recovered dependency trees, such as UAS, LAS, section-boundary F1, or tree-edit distance against gold hierarchies on any annotated subset of the DHP or RAG corpora. This is load-bearing for the central claim that the improvements arise from faithful recovery of the document's true logical structure (including cross-page and figure-caption relations); without these diagnostics, the gains could plausibly originate from SharedDet, SoftROI pooling, or the LVLM embeddings alone.

Authors: We agree that direct validation of the dependency trees via UAS, LAS, section-boundary F1, or tree-edit distance would help attribute the gains more precisely to structure recovery. However, neither the DHP nor the RAG corpora provide gold-standard block-level dependency annotations or hierarchical labels. Our evaluation therefore relies on downstream metrics (STEDS for chunk coherence, nDCG for retrieval, ANLS for QA) under a shared-block protocol that holds preprocessing and block detection fixed across methods. This design isolates the effect of the biaffine scoring and MST decoding steps. We will expand the revised manuscript with an explicit limitations paragraph and a qualitative error analysis of recovered trees on a small manually inspected sample to address this concern. revision: partial

standing simulated objections not resolved

The DHP and RAG corpora lack gold-standard annotations for block-level dependency trees, preventing computation of UAS, LAS, section-boundary F1, or tree-edit distance.

Circularity Check

0 steps flagged

No circularity: empirical gains measured on external benchmarks

full rationale

The manuscript describes an LVLM pipeline (SharedDet preprocessing, SoftROI embeddings, biaffine scoring, MST decoding, tree-guided chunking) whose headline results are direct performance deltas on independent DHP and RAG corpora under a shared-block protocol. No equations, fitted parameters, or self-referential definitions appear in the provided text; the reported STEDS/nDCG/ANLS lifts are external measurements rather than quantities forced by construction from the method's own inputs. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results are present. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that documents possess recoverable hierarchical block dependencies that can be inferred from visual and textual cues; no new physical entities or ad-hoc constants are introduced beyond standard LVLM components.

axioms (1)

domain assumption Multi-page documents contain consistent parent-child relations between blocks that are detectable from multimodal features.
Invoked in the description of the dependency scoring and MST decoding steps.

pith-pipeline@v0.9.0 · 5769 in / 1414 out tokens · 29959 ms · 2026-05-21T01:02:15.305283+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. 1, 2, 6, 18

work page 2025
[2]

Qwen2.5- vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shi- jie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report...

work page 2025
[3]

Meteor: An au- tomatic metric for mt evaluation with improved corre- lation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An au- tomatic metric for mt evaluation with improved corre- lation with human judgments. InProceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summariza- tion, pages 65–72, 2005. 6, 13

work page 2005
[4]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4291–4301, 2019. 6, 13

work page 2019
[5]

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

work page
[6]

Y . J. Chu and T. H. Liu. On the shortest arborescence of a directed graph.Scientia Sinica, 14(10):1396– 1400, 1965. 5, 17

work page 1965
[7]

Deformable con- volutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable con- volutional networks. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 764–773, 2017. 4

work page 2017
[8]

An image is worth 16x16 words: Transformers for image recogni- tion at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, and et al. An image is worth 16x16 words: Transformers for image recogni- tion at scale. InProceedings of the 9th International Conference on Learning Representations (ICLR),

work page
[9]

Timothy Dozat and Christopher D. Manning. Deep bi- affine attention for neural dependency parsing. InIn- ternational Conference on Learning Representations (ICLR) Workshop, 2017. arXiv:1611.01734. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Lum- berchunker: Long-form narrative document segmen- tation.arXiv preprint arXiv:2406.17526, 2024

Andr ´e V Duarte, Jo ˜ao Marques, Miguel Grac ¸a, Miguel Freire, Lei Li, and Arlindo L Oliveira. Lum- berchunker: Long-form narrative document segmen- tation.arXiv preprint arXiv:2406.17526, 2024. 2, 6, 14

work page arXiv 2024
[11]

Optimum branchings.Journal of Re- search of the National Bureau of Standards, Section B, 71B(4):233–240, 1967

Jack Edmonds. Optimum branchings.Journal of Re- search of the National Bureau of Standards, Section B, 71B(4):233–240, 1967. 5, 17

work page 1967
[12]

LayoutLLM: Large language model instruction tuning for visually rich document under- standing

Masato Fujitake. LayoutLLM: Large language model instruction tuning for visually rich document under- standing. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 10219–10224, Torino, Italia,

work page 2024
[13]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024. 1, 2

work page 2024
[14]

Ge, Steve Sun, Joseph Owens, Victor Galvez, O

J. Ge, Steve Sun, Joseph Owens, Victor Galvez, O. Gologorskaya, Jennifer C Lai, Mark J Pletcher, and Ki Lai. Development of a liver disease-specific large lan- guage model chat interface using retrieval augmented generation.medRxiv, 2023. 1

work page 2023
[15]

Hongyu Gong, Yelong Shen, Dian Yu, Jianshu Chen, and Dong Yu. Recurrent chunking mechanisms for long-text machine reading comprehension.Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6751–6761, 2020. 1, 2, 6, 14

work page 2020
[16]

Girshick

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 2980–2988, 2017. 4

work page 2017
[17]

Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021. 5, 12

work page 2021
[18]

Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning

Seongtae Hong, Joong Min Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Cho Man Young, Byeongho Choi, and Heuiseok Lim. Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Tr...

work page 2024
[19]

Cumulated gain-based evaluation of ir techniques.ACM Transac- tions on Information Systems (TOIS), 20(4):422–446,

Kalervo J ¨arvelin and Jaana Kek ¨al¨ainen. Cumulated gain-based evaluation of ir techniques.ACM Transac- tions on Information Systems (TOIS), 20(4):422–446,

work page
[20]

A study on the implementation of generative ai services using an enterprise data-based 9 llm application architecture.Adv

CheonSu Jeong. A study on the implementation of generative ai services using an enterprise data-based 9 llm application architecture.Adv. Artif. Intell. Mach. Learn., 3:1588–1618, 2023. 1

work page 2023
[21]

Multi-page document visual question answering using self-attention scoring mechanism

Lei Kang, Rub `en Tito, Ernest Valveny, and Dimos- thenis Karatzas. Multi-page document visual question answering using self-attention scoring mechanism. In Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, Au- gust 30–September 4, 2024, Proceedings, Part VI, page 219–232, Berlin, Heidelberg, 2024. Springer- Verlag. 2

work page 2024
[22]

Document un- derstanding dataset and evaluation (dude)

Jordy Van Landeghem, Rafał Powalski, Rub `en Tito, Dawid Jurkiewicz, Matthew Blaschko, Łukasz Borchmann, Micka ¨el Coustaty, Sien Moens, Michał Pietruszka, Bertrand Ackaert, Tomasz Stanisławek, Paweł J´oziak, and Ernest Valveny. Document un- derstanding dataset and evaluation (dude). In2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pa...

work page 2023
[23]

Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021. 1

work page 2021
[24]

Rouge: A package for automatic eval- uation of summaries

Chin-Yew Lin. Rouge: A package for automatic eval- uation of summaries. InText summarization branches out, pages 74–81, 2004. 6, 13

work page 2004
[25]

Mm- embed: Universal multimodal retrieval with multi- modal llms, 2025

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm- embed: Universal multimodal retrieval with multi- modal llms, 2025. 18

work page 2025
[26]

Hrdoc: dataset and baseline method toward hierarchical re- construction of document structures

Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui Zhu, and Cong Liu. Hrdoc: dataset and baseline method toward hierarchical re- construction of document structures. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innova- tive Applications of Artificial Intelligence and Thir-...

work page 2023
[27]

Revisiting the tree edit distance and its backtracing: A tutorial.CoRR, abs/1805.06869,

Benjamin Paaßen. Revisiting the tree edit distance and its backtracing: A tutorial.CoRR, abs/1805.06869,

work page arXiv
[28]

Nassar, and Peter Staar

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document-layout seg- mentation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, page 3743–3751, New York, NY , USA, 2022. As- sociation for Computing Machinery. 1, 2, 12

work page 2022
[29]

Is se- mantic chunking worth the computational cost? In Findings of the Association for Computational Lin- guistics: NAACL 2025, pages 2155–2177, Albu- querque, New Mexico, 2025

Renyi Qu, Ruixuan Tu, and Forrest Sheng Bao. Is se- mantic chunking worth the computational cost? In Findings of the Association for Computational Lin- guistics: NAACL 2025, pages 2155–2177, Albu- querque, New Mexico, 2025. Association for Compu- tational Linguistics. 1, 2, 6, 14

work page 2025
[30]

Docparser: Hier- archical document structure parsing from renderings

Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriegel. Docparser: Hier- archical document structure parsing from renderings. Proceedings of the AAAI Conference on Artificial In- telligence, 35:4328–4338, 2021. 1, 2, 6, 13

work page 2021
[31]

Dsg: An end- to-end document structure generator.arXiv preprint arXiv:2310.09118, 2023

Johannes Rausch, Gentiana Rashiti, Maxim Gusev, Ce Zhang, and Stefan Feuerriegel. Dsg: An end- to-end document structure generator.arXiv preprint arXiv:2310.09118, 2023. 1, 2, 6, 13

work page arXiv 2023
[32]

The prob- abilistic relevance framework: Bm25 and beyond

Stephen Robertson and Hugo Zaragoza. The prob- abilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, 2009. 18

work page 2009
[33]

Rossi, and Franck Dernoncourt

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A. Rossi, and Franck Dernoncourt. Pdftriage: Question answering over long, structured documents, 2023. 2

work page 2023
[34]

MultiDocFusion : Hi- erarchical and multimodal chunking pipeline for en- hanced RAG on long industrial documents

Joongmin Shin, Chanjun Park, Jeongbae Park, Jae- hyung Seo, and Heuiseok Lim. MultiDocFusion : Hi- erarchical and multimodal chunking pipeline for en- hanced RAG on long industrial documents. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 20996– 21015, Suzhou, China, 2025. Association for Compu- tationa...

work page 2025
[35]

Seyed Amin Tabatabaei, Sarah Fancher, Michael Par- sons, and Arian Askari. Can large language mod- els serve as effective classifiers for hierarchical multi- label classification of scientific documents at indus- trial scale? InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 163–174, Abu Dhabi, UAE, 202...

work page 2025
[36]

Hierarchical multimodal transformers for multi- page docvqa, 2023

Rub `en Tito, Dimosthenis Karatzas, and Ernest Val- veny. Hierarchical multimodal transformers for multi- page docvqa, 2023. 1, 5, 12

work page 2023
[37]

S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis, 2025

Prashant Verma. S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis, 2025. 15

work page 2025
[38]

DocLLM: A layout-aware generative language model for mul- timodal document understanding

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. DocLLM: A layout-aware generative language model for mul- timodal document understanding. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 8529–8548,...

work page 2024
[39]

Detect-order-construct: A tree construc- tion based approach for hierarchical document struc- 10 ture analysis.Pattern Recognition, 156:110836, 2024

Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, and Qiang Huo. Detect-order-construct: A tree construc- tion based approach for hierarchical document struc- 10 ture analysis.Pattern Recognition, 156:110836, 2024. 2

work page 2024
[40]

Multilingual e5 text embeddings: A technical report, 2024

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report, 2024. 18

work page 2024
[41]

In- ternvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

work page 2025
[42]

Dochienet: A large and diverse dataset for document hierarchy parsing

Hangdi Xing, Changxu Cheng, Feiyu Gao, Zirui Shao, Zhi Yu, Jiajun Bu, Qi Zheng, and Cong Yao. Dochienet: A large and diverse dataset for document hierarchy parsing. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 1, 2, 5, 12

work page 2024
[43]

Dochienet: A large and diverse dataset for document hierarchy pars- ing

Hangdi Xing, Changxu Cheng, et al. Dochienet: A large and diverse dataset for document hierarchy pars- ing. InEMNLP, 2024. 1, 14

work page 2024
[44]

Intelligent document pars- ing: Towards end-to-end document parsing via decou- pled content parsing and layout grounding

Hangdi Xing, Feiyu Gao, Qi Zheng, Zhaoqing Zhu, Zirui Shao, and Ming Yan. Intelligent document pars- ing: Towards end-to-end document parsing via decou- pled content parsing and layout grounding. InFind- ings of the Association for Computational Linguistics: EMNLP 2025, pages 19987–19998, Suzhou, China,

work page 2025
[45]

Association for Computational Linguistics. 1, 2

work page
[46]

Financial report chunking for effective retrieval augmented generation, 2024

Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebas- tian Laverde, and Renyu Li. Financial report chunking for effective retrieval augmented generation, 2024. 1, 6, 15

work page 2024
[47]

Instruc- tion tuning for large language models: A survey, 2024

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. Instruc- tion tuning for large language models: A survey, 2024. 1

work page 2024
[48]

Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaox- uan Tan, Xiaochuang Han, Tianxing He, and Yulia Tsvetkov. Can LLM graph reasoning generalize be- yond pattern memorization? InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 2289–2305, Miami, Florida, USA, 2024. Asso- ciation for Computational Linguistics. 1, 2

work page 2024
[49]

PDF-to- tree: Parsing PDF text blocks into a tree

Yue Zhang, Zhihao Zhang, Wenbin Lai, Chong Zhang, Tao Gui, Qi Zhang, and Xuanjing Huang. PDF-to- tree: Parsing PDF text blocks into a tree. InFind- ings of the Association for Computational Linguistics: EMNLP 2024, pages 10704–10714, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 2

work page 2024
[50]

instruction

Jihao Zhao, Zhiyuan Ji, Pengnian Qi, Simin Niu, Bo Tang, Feiyu Xiong, and Zhiyu li. Meta-chunking: Learning efficient text segmentation via logical per- ception, 2024. 2, 6, 14 11 A. Datasets and Pre-processing Details All datasets used in our experiments are publicly available research benchmarks. We rely exclusively on open corpora for both hierarchy pa...

work page 2024

[1] [1]

Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025

Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, and Jiankang Deng. Llava-onevision-1.5: Fully open framework for democratized multimodal training, 2025. 1, 2, 6, 18

work page 2025

[2] [2]

Qwen2.5- vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shi- jie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5- vl technical report...

work page 2025

[3] [3]

Meteor: An au- tomatic metric for mt evaluation with improved corre- lation with human judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An au- tomatic metric for mt evaluation with improved corre- lation with human judgments. InProceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summariza- tion, pages 65–72, 2005. 6, 13

work page 2005

[4] [4]

Scene text visual question answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marcal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4291–4301, 2019. 6, 13

work page 2019

[5] [5]

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

work page

[6] [6]

Y . J. Chu and T. H. Liu. On the shortest arborescence of a directed graph.Scientia Sinica, 14(10):1396– 1400, 1965. 5, 17

work page 1965

[7] [7]

Deformable con- volutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable con- volutional networks. InProceedings of the IEEE In- ternational Conference on Computer Vision (ICCV), pages 764–773, 2017. 4

work page 2017

[8] [8]

An image is worth 16x16 words: Transformers for image recogni- tion at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, and et al. An image is worth 16x16 words: Transformers for image recogni- tion at scale. InProceedings of the 9th International Conference on Learning Representations (ICLR),

work page

[9] [9]

Timothy Dozat and Christopher D. Manning. Deep bi- affine attention for neural dependency parsing. InIn- ternational Conference on Learning Representations (ICLR) Workshop, 2017. arXiv:1611.01734. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Lum- berchunker: Long-form narrative document segmen- tation.arXiv preprint arXiv:2406.17526, 2024

Andr ´e V Duarte, Jo ˜ao Marques, Miguel Grac ¸a, Miguel Freire, Lei Li, and Arlindo L Oliveira. Lum- berchunker: Long-form narrative document segmen- tation.arXiv preprint arXiv:2406.17526, 2024. 2, 6, 14

work page arXiv 2024

[11] [11]

Optimum branchings.Journal of Re- search of the National Bureau of Standards, Section B, 71B(4):233–240, 1967

Jack Edmonds. Optimum branchings.Journal of Re- search of the National Bureau of Standards, Section B, 71B(4):233–240, 1967. 5, 17

work page 1967

[12] [12]

LayoutLLM: Large language model instruction tuning for visually rich document under- standing

Masato Fujitake. LayoutLLM: Large language model instruction tuning for visually rich document under- standing. InProceedings of the 2024 Joint In- ternational Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC- COLING 2024), pages 10219–10224, Torino, Italia,

work page 2024

[13] [13]

Retrieval-augmented generation for large language models: A survey, 2024

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey, 2024. 1, 2

work page 2024

[14] [14]

Ge, Steve Sun, Joseph Owens, Victor Galvez, O

J. Ge, Steve Sun, Joseph Owens, Victor Galvez, O. Gologorskaya, Jennifer C Lai, Mark J Pletcher, and Ki Lai. Development of a liver disease-specific large lan- guage model chat interface using retrieval augmented generation.medRxiv, 2023. 1

work page 2023

[15] [15]

Hongyu Gong, Yelong Shen, Dian Yu, Jianshu Chen, and Dong Yu. Recurrent chunking mechanisms for long-text machine reading comprehension.Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6751–6761, 2020. 1, 2, 6, 14

work page 2020

[16] [16]

Girshick

Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross B. Girshick. Mask r-cnn. InProceedings of the IEEE International Conference on Computer Vi- sion (ICCV), pages 2980–2988, 2017. 4

work page 2017

[17] [17]

Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021. 5, 12

work page 2021

[18] [18]

Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning

Seongtae Hong, Joong Min Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Cho Man Young, Byeongho Choi, and Heuiseok Lim. Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing: Industry Tr...

work page 2024

[19] [19]

Cumulated gain-based evaluation of ir techniques.ACM Transac- tions on Information Systems (TOIS), 20(4):422–446,

Kalervo J ¨arvelin and Jaana Kek ¨al¨ainen. Cumulated gain-based evaluation of ir techniques.ACM Transac- tions on Information Systems (TOIS), 20(4):422–446,

work page

[20] [20]

A study on the implementation of generative ai services using an enterprise data-based 9 llm application architecture.Adv

CheonSu Jeong. A study on the implementation of generative ai services using an enterprise data-based 9 llm application architecture.Adv. Artif. Intell. Mach. Learn., 3:1588–1618, 2023. 1

work page 2023

[21] [21]

Multi-page document visual question answering using self-attention scoring mechanism

Lei Kang, Rub `en Tito, Ernest Valveny, and Dimos- thenis Karatzas. Multi-page document visual question answering using self-attention scoring mechanism. In Document Analysis and Recognition - ICDAR 2024: 18th International Conference, Athens, Greece, Au- gust 30–September 4, 2024, Proceedings, Part VI, page 219–232, Berlin, Heidelberg, 2024. Springer- Verlag. 2

work page 2024

[22] [22]

Document un- derstanding dataset and evaluation (dude)

Jordy Van Landeghem, Rafał Powalski, Rub `en Tito, Dawid Jurkiewicz, Matthew Blaschko, Łukasz Borchmann, Micka ¨el Coustaty, Sien Moens, Michał Pietruszka, Bertrand Ackaert, Tomasz Stanisławek, Paweł J´oziak, and Ernest Valveny. Document un- derstanding dataset and evaluation (dude). In2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pa...

work page 2023

[23] [23]

Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen tau Yih, Tim Rockt ¨aschel, Sebastian Riedel, and Douwe Kiela. Retrieval- augmented generation for knowledge-intensive nlp tasks, 2021. 1

work page 2021

[24] [24]

Rouge: A package for automatic eval- uation of summaries

Chin-Yew Lin. Rouge: A package for automatic eval- uation of summaries. InText summarization branches out, pages 74–81, 2004. 6, 13

work page 2004

[25] [25]

Mm- embed: Universal multimodal retrieval with multi- modal llms, 2025

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm- embed: Universal multimodal retrieval with multi- modal llms, 2025. 18

work page 2025

[26] [26]

Hrdoc: dataset and baseline method toward hierarchical re- construction of document structures

Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui Zhu, and Cong Liu. Hrdoc: dataset and baseline method toward hierarchical re- construction of document structures. InProceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innova- tive Applications of Artificial Intelligence and Thir-...

work page 2023

[27] [27]

Revisiting the tree edit distance and its backtracing: A tutorial.CoRR, abs/1805.06869,

Benjamin Paaßen. Revisiting the tree edit distance and its backtracing: A tutorial.CoRR, abs/1805.06869,

work page arXiv

[28] [28]

Nassar, and Peter Staar

Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S. Nassar, and Peter Staar. Doclaynet: A large human-annotated dataset for document-layout seg- mentation. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Min- ing, page 3743–3751, New York, NY , USA, 2022. As- sociation for Computing Machinery. 1, 2, 12

work page 2022

[29] [29]

Is se- mantic chunking worth the computational cost? In Findings of the Association for Computational Lin- guistics: NAACL 2025, pages 2155–2177, Albu- querque, New Mexico, 2025

Renyi Qu, Ruixuan Tu, and Forrest Sheng Bao. Is se- mantic chunking worth the computational cost? In Findings of the Association for Computational Lin- guistics: NAACL 2025, pages 2155–2177, Albu- querque, New Mexico, 2025. Association for Compu- tational Linguistics. 1, 2, 6, 14

work page 2025

[30] [30]

Docparser: Hier- archical document structure parsing from renderings

Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriegel. Docparser: Hier- archical document structure parsing from renderings. Proceedings of the AAAI Conference on Artificial In- telligence, 35:4328–4338, 2021. 1, 2, 6, 13

work page 2021

[31] [31]

Dsg: An end- to-end document structure generator.arXiv preprint arXiv:2310.09118, 2023

Johannes Rausch, Gentiana Rashiti, Maxim Gusev, Ce Zhang, and Stefan Feuerriegel. Dsg: An end- to-end document structure generator.arXiv preprint arXiv:2310.09118, 2023. 1, 2, 6, 13

work page arXiv 2023

[32] [32]

The prob- abilistic relevance framework: Bm25 and beyond

Stephen Robertson and Hugo Zaragoza. The prob- abilistic relevance framework: Bm25 and beyond. Found. Trends Inf. Retr., 3(4):333–389, 2009. 18

work page 2009

[33] [33]

Rossi, and Franck Dernoncourt

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, David Seunghyun Yoon, Ryan A. Rossi, and Franck Dernoncourt. Pdftriage: Question answering over long, structured documents, 2023. 2

work page 2023

[34] [34]

MultiDocFusion : Hi- erarchical and multimodal chunking pipeline for en- hanced RAG on long industrial documents

Joongmin Shin, Chanjun Park, Jeongbae Park, Jae- hyung Seo, and Heuiseok Lim. MultiDocFusion : Hi- erarchical and multimodal chunking pipeline for en- hanced RAG on long industrial documents. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 20996– 21015, Suzhou, China, 2025. Association for Compu- tationa...

work page 2025

[35] [35]

Seyed Amin Tabatabaei, Sarah Fancher, Michael Par- sons, and Arian Askari. Can large language mod- els serve as effective classifiers for hierarchical multi- label classification of scientific documents at indus- trial scale? InProceedings of the 31st International Conference on Computational Linguistics: Industry Track, pages 163–174, Abu Dhabi, UAE, 202...

work page 2025

[36] [36]

Hierarchical multimodal transformers for multi- page docvqa, 2023

Rub `en Tito, Dimosthenis Karatzas, and Ernest Val- veny. Hierarchical multimodal transformers for multi- page docvqa, 2023. 1, 5, 12

work page 2023

[37] [37]

S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis, 2025

Prashant Verma. S2 chunking: A hybrid framework for document segmentation through integrated spatial and semantic analysis, 2025. 15

work page 2025

[38] [38]

DocLLM: A layout-aware generative language model for mul- timodal document understanding

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. DocLLM: A layout-aware generative language model for mul- timodal document understanding. InProceedings of the 62nd Annual Meeting of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 8529–8548,...

work page 2024

[39] [39]

Detect-order-construct: A tree construc- tion based approach for hierarchical document struc- 10 ture analysis.Pattern Recognition, 156:110836, 2024

Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, and Qiang Huo. Detect-order-construct: A tree construc- tion based approach for hierarchical document struc- 10 ture analysis.Pattern Recognition, 156:110836, 2024. 2

work page 2024

[40] [40]

Multilingual e5 text embeddings: A technical report, 2024

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report, 2024. 18

work page 2024

[41] [41]

In- ternvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency, 2025

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

work page 2025

[42] [42]

Dochienet: A large and diverse dataset for document hierarchy parsing

Hangdi Xing, Changxu Cheng, Feiyu Gao, Zirui Shao, Zhi Yu, Jiajun Bu, Qi Zheng, and Cong Yao. Dochienet: A large and diverse dataset for document hierarchy parsing. InProceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 1, 2, 5, 12

work page 2024

[43] [43]

Dochienet: A large and diverse dataset for document hierarchy pars- ing

Hangdi Xing, Changxu Cheng, et al. Dochienet: A large and diverse dataset for document hierarchy pars- ing. InEMNLP, 2024. 1, 14

work page 2024

[44] [44]

Intelligent document pars- ing: Towards end-to-end document parsing via decou- pled content parsing and layout grounding

Hangdi Xing, Feiyu Gao, Qi Zheng, Zhaoqing Zhu, Zirui Shao, and Ming Yan. Intelligent document pars- ing: Towards end-to-end document parsing via decou- pled content parsing and layout grounding. InFind- ings of the Association for Computational Linguistics: EMNLP 2025, pages 19987–19998, Suzhou, China,

work page 2025

[45] [45]

Association for Computational Linguistics. 1, 2

work page

[46] [46]

Financial report chunking for effective retrieval augmented generation, 2024

Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebas- tian Laverde, and Renyu Li. Financial report chunking for effective retrieval augmented generation, 2024. 1, 6, 15

work page 2024

[47] [47]

Instruc- tion tuning for large language models: A survey, 2024

Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang. Instruc- tion tuning for large language models: A survey, 2024. 1

work page 2024

[48] [48]

Yizhuo Zhang, Heng Wang, Shangbin Feng, Zhaox- uan Tan, Xiaochuang Han, Tianxing He, and Yulia Tsvetkov. Can LLM graph reasoning generalize be- yond pattern memorization? InFindings of the Asso- ciation for Computational Linguistics: EMNLP 2024, pages 2289–2305, Miami, Florida, USA, 2024. Asso- ciation for Computational Linguistics. 1, 2

work page 2024

[49] [49]

PDF-to- tree: Parsing PDF text blocks into a tree

Yue Zhang, Zhihao Zhang, Wenbin Lai, Chong Zhang, Tao Gui, Qi Zhang, and Xuanjing Huang. PDF-to- tree: Parsing PDF text blocks into a tree. InFind- ings of the Association for Computational Linguistics: EMNLP 2024, pages 10704–10714, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 2

work page 2024

[50] [50]

instruction

Jihao Zhao, Zhiyuan Ji, Pengnian Qi, Simin Niu, Bo Tang, Feiyu Xiong, and Zhiyu li. Meta-chunking: Learning efficient text segmentation via logical per- ception, 2024. 2, 6, 14 11 A. Datasets and Pre-processing Details All datasets used in our experiments are publicly available research benchmarks. We rely exclusively on open corpora for both hierarchy pa...

work page 2024