arxiv: 2602.23061 · v3 · submitted 2026-02-26 · 💻 cs.IR · cs.AI· cs.CL· cs.DB· cs.LG

Recognition: no theorem link

MoDora: Tree-Based Semi-Structured Document Analysis System

Bangrui Xu , Qihang Yao , Zirui Tang , Xuanhe Zhou , Yeye He , Shihan Yu , Qianqian Xu , Bin Wang

show 3 more authors

Guoliang Li Conghui He Fan Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:07 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.DBcs.LG

keywords semi-structured documentsquestion answeringcomponent correlation treeOCR aggregationlayout-aware retrievalLLM systemhierarchical organization

0 comments

The pith

A tree structure turns fragmented OCR data into accurate answers for questions on semi-structured documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MoDora to handle documents that mix tables, charts, and hierarchical paragraphs in irregular layouts. Existing OCR methods break these elements into isolated pieces that lose their original positions and connections, so standard systems cannot reliably answer questions that link content across pages or regions. MoDora first groups the fragments back into layout-aware components, then builds a Component-Correlation Tree that records how those components relate to one another through successive summarization steps. A retrieval method that combines layout grids with semantic pruning then locates the right pieces for each question. Experiments report accuracy gains of 5.97 to 61.07 percent over prior approaches.

Core claim

MoDora converts OCR-parsed fragments into layout-aware components via local-alignment aggregation and type-specific extraction for titles or non-text items. It organizes these components into the Component-Correlation Tree to capture hierarchical relations and layout distinctions through bottom-up cascade summarization. A question-type-aware retrieval step then applies grid partitioning for location queries and LLM-guided pruning for semantic ones, producing answers that respect both structure and content.

What carries the argument

The Component-Correlation Tree (CCTree), which arranges layout-aware components into a hierarchy and records their relations and layout distinctions via bottom-up cascade summarization.

If this is right

Questions that require linking a paragraph on one page to table cells elsewhere become answerable because the tree preserves both location and semantic ties.
Nested structures such as chapter titles containing sub-tables or sidebars remain distinguishable during retrieval.
Layout-specific distinctions like main content versus side panels improve the precision of location-based retrieval steps.
End-to-end accuracy on natural-language question answering over semi-structured documents rises by 5.97 to 61.07 percent relative to existing baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tree construction could support incremental updates when new pages are added to a live document collection.
Scaling the cascade summarization to documents hundreds of pages long would test whether coherence is maintained at greater depth.
Analogous tree organizations might improve LLM performance on other scattered structured sources such as web pages or code repositories.

Load-bearing premise

Local alignment can reassemble fragmented OCR pieces into components that keep their original semantic context and hierarchical links without major loss.

What would settle it

Run the system on a collection of documents whose OCR output is heavily fragmented and whose test questions require exact reconstruction of nested titles or cross-region table references; if accuracy falls to or below baseline levels, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2602.23061 by Bangrui Xu, Bin Wang, Conghui He, Fan Wu, Guoliang Li, Qianqian Xu, Qihang Yao, Shihan Yu, Xuanhe Zhou, Yeye He, Zirui Tang.

**Figure 2.** Figure 2: Example Semi-structured Document Layouts. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Error Rate Distribution of Existing Methods. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 1.** Figure 1: [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 4.** Figure 4: System Overview of MoDora. node. We construct CCTree in two stages: (1) Main content organization, which groups components based on detected title hierarchies and semantic links between textual and nearby non-text elements (e.g., tables, charts); and (2) Supplementary content isolation, which attaches auxiliary components (e.g., sidebars) as separate subtrees to prevent semantic interference. For efficien… view at source ↗

**Figure 5.** Figure 5: The number distribution of summary keywords in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Tree-based Data Retrieval. retrieval offers higher efficiency by directly capturing fine-grained semantics, yet it depends on appropriate chunking and fails to capture long-distance context dependencies. Considering the strengths and limitations of the above methods, we propose a CCTree retrieval algorithm that integrates MLLM reasoning, pruning strategies, and embedding-based search to balance recall and… view at source ↗

**Figure 7.** Figure 7: Distribution of Document Layout Characteristics. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Performance Comparison Between Different Baselines on Four Benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Case Study on MMDA Benchmark [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts. These documents are widely observed across domains and account for a large portion of real-world data. However, existing methods struggle to support natural language question answering over these documents due to three main technical challenges: (1) The elements extracted by techniques like OCR are often fragmented and stripped of their original semantic context, making them inadequate for analysis. (2) Existing approaches lack effective representations to capture hierarchical structures within documents (e.g., associating tables with nested chapter titles) and to preserve layout-specific distinctions (e.g., differentiating sidebars from main content). (3) Answering questions often requires retrieving and aligning relevant information scattered across multiple regions or pages, such as linking a descriptive paragraph to table cells located elsewhere in the document. To address these issues, we propose MoDora, an LLM-powered system for semi-structured document analysis. First, we adopt a local-alignment aggregation strategy to convert OCR-parsed elements into layout-aware components, and conduct type-specific information extraction for components with hierarchical titles or non-text elements. Second, we design the Component-Correlation Tree (CCTree) to hierarchically organize components, explicitly modeling inter-component relations and layout distinctions through a bottom-up cascade summarization process. Finally, we propose a question-type-aware retrieval strategy that supports (1) layout-based grid partitioning for location-based retrieval and (2) LLM-guided pruning for semantic-based retrieval. Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy. The code is at https://github.com/weAIDB/MoDora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoDora's CCTree plus local-alignment and question-type retrieval form a practical engineering package for semi-structured document QA, but the accuracy gains lack ablations that would show which part actually drives them.

read the letter

MoDora tackles the usual headaches with semi-structured documents: OCR fragments that lose context, missing hierarchy between tables and headings, and information scattered across pages. The system first aggregates elements into layout-aware components, builds a Component-Correlation Tree that links them with bottom-up summarization, and then retrieves via a mix of grid-based location lookup and LLM-guided pruning depending on question type. The concrete design of the CCTree and the two retrieval modes is the clearest new piece; it gives an explicit structure for preserving layout distinctions that most flat retrieval setups ignore. The reported accuracy lifts of 5.97-61.07% over baselines are large enough to notice if the comparisons are fair, and releasing the code is a plus for anyone who wants to test it directly. The main gap is the absence of any ablation that isolates the local-alignment aggregation. The abstract treats it as the fix for fragmented semantics, yet nothing shows what happens if you feed raw OCR lists straight into the tree or the retriever instead. Without that check, the gains could trace more to extra LLM calls or the retrieval heuristics than to the claimed preservation of hierarchy. Dataset sizes, exact baseline implementations, and error analysis are also missing from the provided description, so the numbers are hard to weigh. This is aimed at people building retrieval systems for enterprise documents that mix text, tables, and charts. A reader who needs a tree-structured way to keep layout signals would get concrete ideas from the architecture. I would send it to peer review; the core construction is clear and the claims are testable once the experimental details are filled in.

Referee Report

2 major / 2 minor

Summary. The paper proposes MoDora, an LLM-powered system for natural language question answering over semi-structured documents. It addresses three challenges—fragmented OCR elements lacking semantic context, inadequate hierarchical and layout representations, and scattered multi-region information—via a local-alignment aggregation strategy to produce layout-aware components (with type-specific extraction), the Component-Correlation Tree (CCTree) for bottom-up hierarchical organization and relation modeling, and a question-type-aware retrieval method combining layout-based grid partitioning with LLM-guided semantic pruning. Experiments report accuracy gains of 5.97%–61.07% over baselines, with code released at https://github.com/weAIDB/MoDora.

Significance. If the performance claims are substantiated with rigorous controls, MoDora could advance practical QA systems for real-world semi-structured documents (reports, forms, scientific papers) by explicitly handling layout distinctions and hierarchies that current OCR+LLM pipelines often lose. The open-source release is a clear strength for reproducibility.

major comments (2)

[Abstract and §4 (System Overview / Local-Alignment Aggregation)] The headline accuracy improvements (5.97%–61.07%) are attributed to the full pipeline, yet the manuscript provides no ablation isolating the local-alignment aggregation step. Without a controlled comparison (raw OCR element lists vs. aggregated components before CCTree construction), it remains unclear whether this step preserves semantic context and hierarchical relations or merely adds overhead; the central claim that it successfully converts fragmented elements into layout-aware components therefore rests on an untested premise.
[Experimental Evaluation] The experimental section lacks sufficient detail on baseline definitions, dataset statistics (size, domain, OCR quality), exact metrics, and error analysis. This prevents verification of whether reported gains derive from the CCTree cascade and retrieval heuristics or from differences in prompting volume, model choice, or post-hoc tuning.

minor comments (2)

[Abstract] The accuracy range 5.97%–61.07% is stated without mapping specific values to particular baselines or datasets; a table or explicit per-baseline breakdown would improve clarity.
[§3 (CCTree Construction)] Notation for component types and CCTree node attributes is introduced without a consolidated table; readers must infer definitions from scattered prose descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional experiments and details as outlined.

read point-by-point responses

Referee: [Abstract and §4 (System Overview / Local-Alignment Aggregation)] The headline accuracy improvements (5.97%–61.07%) are attributed to the full pipeline, yet the manuscript provides no ablation isolating the local-alignment aggregation step. Without a controlled comparison (raw OCR element lists vs. aggregated components before CCTree construction), it remains unclear whether this step preserves semantic context and hierarchical relations or merely adds overhead; the central claim that it successfully converts fragmented elements into layout-aware components therefore rests on an untested premise.

Authors: We agree that an isolated ablation of the local-alignment aggregation step would strengthen the claims. In the revised manuscript, we will add a controlled comparison between raw OCR element lists and the aggregated layout-aware components (prior to CCTree construction). This will report effects on component quality, downstream CCTree structure, and final QA accuracy, clarifying the contribution of this step beyond overhead. revision: yes
Referee: [Experimental Evaluation] The experimental section lacks sufficient detail on baseline definitions, dataset statistics (size, domain, OCR quality), exact metrics, and error analysis. This prevents verification of whether reported gains derive from the CCTree cascade and retrieval heuristics or from differences in prompting volume, model choice, or post-hoc tuning.

Authors: We acknowledge the need for greater experimental transparency. In the revision, we will expand §5 to include: precise definitions and prompting details for all baselines; full dataset statistics (document counts, domains, average OCR error rates where measurable); exact metric formulations; and a dedicated error analysis categorizing failure modes. These additions will allow readers to attribute gains more clearly to the proposed components. revision: yes

Circularity Check

0 steps flagged

No circularity: MoDora is an engineered pipeline with no self-referential derivations

full rationale

The paper presents MoDora as a descriptive system architecture consisting of local-alignment aggregation for OCR elements, CCTree construction via bottom-up cascade summarization, and question-type-aware retrieval with grid partitioning and LLM pruning. No equations, fitted parameters, or first-principles derivations appear in the provided text; performance claims rest on experimental comparisons rather than any quantity that reduces to its own inputs by construction. No self-citations are used to justify uniqueness or to smuggle ansatzes, and the central components are externally motivated engineering decisions rather than tautological redefinitions. The derivation chain is therefore a standard pipeline specification and remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that current LLMs can reliably perform component summarization and semantic pruning, plus the invented data structure CCTree whose utility is demonstrated only through the reported experiments.

axioms (1)

domain assumption LLMs can perform effective type-specific information extraction and LLM-guided pruning on document components
Invoked when describing component processing and question-type-aware retrieval.

invented entities (1)

Component-Correlation Tree (CCTree) no independent evidence
purpose: Hierarchically organize layout-aware components while modeling inter-component relations and layout distinctions via bottom-up cascade summarization
New structure introduced to address the lack of hierarchical representations in prior work

pith-pipeline@v0.9.0 · 5649 in / 1281 out tokens · 42459 ms · 2026-05-15T19:07:07.552022+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Semantic Data Processing with Holistic Data Understanding
cs.DB 2026-04 unverdicted novelty 6.0

HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 ...

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

https://pdfdrive.webs.nf

[n.d.]. https://pdfdrive.webs.nf

work page
[2]

https://www.sci-hub.se

[n.d.]. https://www.sci-hub.se

work page
[3]

https://github.com/lisun-ai/DocAgent

2025.DocAgent. https://github.com/lisun-ai/DocAgent

work page 2025
[4]

https://github.com/bloomberg/m3docrag

2025.m3docrag. https://github.com/bloomberg/m3docrag

work page 2025
[5]

https://pypi.org/project/PyMuPDF/

2025.PyMuPDF 1.26.7. https://pypi.org/project/PyMuPDF/

work page 2025
[6]

https://github.com/Ruiying-Ma/SHTRAG

2025.SHTRAG. https://github.com/Ruiying-Ma/SHTRAG

work page 2025
[7]

https://pypi.org/project/pdfplumber/

2026.pdfplumber 0.11.9. https://pypi.org/project/pdfplumber/

work page 2026
[8]

Srikar Appalaraju, Bhavan Jasani, Bhargava Urala Kota, Yusheng Xie, and R Manmatha. 2021. Docformer: End-to-end transformer for document understand- ing. InProceedings of the IEEE/CVF international conference on computer vision. 993–1003

work page 2021
[9]

Simran Arora, Brandon Yang, Sabri Eyuboglu, Avanika Narayan, Andrew Ho- jel, Immanuel Trummer, and Christopher Ré. 2023. Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes. Proceedings of the VLDB Endowment17, 2 (2023), 92–105

work page 2023
[10]

Yushi Bai, Shangqing Tu, Jiajie Zhang, and et alsdf. 2025. LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks. In ACL (1). Association for Computational Linguistics, 3639–3664

work page 2025
[11]

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. 2019. Scene text visual question answering. InProceedings of the IEEE/CVF international conference on computer vision. 4291–4301

work page 2019
[12]

Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Dernoncourt, Jiuxiang Gu, Ryan A Rossi, Changyou Chen, and Tong Sun. 2024. SV-RAG: LoRA- Contextualizing Adaptation of MLLMs for Long Document Understanding.arXiv preprint arXiv:2411.01106(2024)

work page arXiv 2024
[13]

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2024. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952(2024)

work page arXiv 2024
[14]

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. 2025. PaddleOCR 3.0 Technical Report. arXiv:2507.05595 [cs.CV] https://arxiv.org/abs/2507.05595

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2025. ColPali: Efficient Document Retrieval with Vision Language Models. arXiv:2407.01449 [cs.IR] https://arxiv.org/abs/2407. 01449

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Nidhi Hegde, Sujoy Paul, Gagan Madan, and Gaurav Aggarwal. 2023. Analyz- ing the efficacy of an llm-only approach for image-based document question answering.arXiv preprint arXiv:2309.14389(2023)

work page arXiv 2023
[17]

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. 2024. mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding.arXiv preprint arXiv:2409.03420 (2024)

work page arXiv 2024
[18]

Yupan Huang, Tengchao Lv, Lei Cui, Yutong Lu, and Furu Wei. 2022. Layoutlmv3: Pre-training for document ai with unified text and image masking. InProceedings of the 30th ACM international conference on multimedia. 4083–4091

work page 2022
[19]

Parameswaran

Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, and Aditya G. Parameswaran. 2025. TWIX: Automatically Reconstructing Structured Data from Templatized Documents.CoRRabs/2501.06659 (2025)

work page arXiv 2025
[20]

Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G Parameswaran, and Eugene Wu. 2024. Towards accurate and efficient document analytics with large language models.arXiv preprint arXiv:2405.04674 (2024)

work page arXiv 2024
[21]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, et al. 2025. Palimpzest: Optimizing ai-powered analytics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR). 2

work page 2025
[22]

2025.Introducing GPT-5

OpenAI. 2025.Introducing GPT-5. Technical Report. https://openai.com/index/ introducing-gpt-5/

work page 2025
[23]

Panupong Pasupat and Percy Liang. 2015. Compositional Semantic Parsing on Semi-Structured Tables. InACL (1). The Association for Computer Linguistics, 1470–1480

work page 2015
[24]

Jon Saad-Falcon, Joe Barrow, Alexa Siu, Ani Nenkova, Seunghyun Yoon, Ryan A Rossi, and Franck Dernoncourt. 2024. Pdftriage: Question answering over long, structured documents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 153–169

work page 2024
[25]

Li Sun, Liu He, Shuyue Jia, Yangfan He, and Chenyu You. 2025. Docagent: An agentic framework for multi-modal long-context document understanding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 17712–17727

work page 2025
[26]

Zhaoze Sun, Chengliang Chai, Qiyan Deng, Kaisen Jin, Xinyu Guo, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. 2025. QUEST: Query Optimization in Unstructured Document Analysis.Proceedings of the VLDB Endowment18, 11 (2025), 4560–4573

work page 2025
[27]

Zirui Tang, Boyu Niu, Xuanhe Zhou, Boxiu Li, Wei Zhou, Jiannan Wang, Guoliang Li, Xinyi Zhang, and Fan Wu. 2026. ST-Raptor: LLM-Powered Semi-Structured Table Question Answering.Proc. ACM Manag. Data(2026)

work page 2026
[28]

Zineng Tang, Ziyi Yang, Guoxin Wang, Yuwei Fang, Yang Liu, Chenguang Zhu, Michael Zeng, Cha Zhang, and Mohit Bansal. 2023. Unifying vision, text, and lay- out for universal document processing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19254–19264

work page 2023
[29]

Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

work page 2025
[30]

Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. 2023. Hierarchical multi- modal transformers for multipage docvqa.Pattern Recognition144 (2023), 109834

work page 2023
[31]

Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Valveny, et al. 2023. Document understanding dataset and evaluation (dude). InProceedings of the IEEE/CVF International Conference on Computer Vision. 19528–19540

work page 2023
[32]

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. 2020. Visual Transformers: Token-based Image Representation and Processing for Computer Vision. arXiv:2006.03677 [cs.CV]

work page arXiv 2020
[33]

Chening Yang, Duy-Khanh Vu, Minh-Tien Nguyen, Xuan-Quang Nguyen, Linh Nguyen, and Hung Le. 2025. Superrag: Beyond rag with layout-aware graph mod- eling. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track). 544–557

work page 2025
[34]

Haopeng Zhang, Philip S Yu, and Jiawei Zhang. 2025. A systematic survey of text summarization: From statistical methods to large language models.Comput. Surveys57, 11 (2025), 1–41

work page 2025
[35]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

work page
[36]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025). 15

work page internal anchor Pith review Pith/arXiv arXiv 2025