Reading Order Inference for Complex Document Layouts

Berat Kurar-Barakat; Daria Vasyutinsky-Shapira; Gal Grudka; Iddo Hakim; Nachum Dershowitz; Omer Ventura; Sharva Gogawale

arxiv: 2607.01018 · v1 · pith:SKR6CTHMnew · submitted 2026-07-01 · 💻 cs.CL · cs.AI· cs.CV· cs.DL

Reading Order Inference for Complex Document Layouts

Iddo Hakim , Sharva Gogawale , Omer Ventura , Gal Grudka , Daria Vasyutinsky-Shapira , Berat Kurar-Barakat , Nachum Dershowitz This is my paper

Pith reviewed 2026-07-02 12:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.DL

keywords reading orderdocument layoutgraph-based inferencelanguage modelshistorical manuscriptsGlossa Ordinariapath covertraining-free

0 comments

The pith

A training-free graph method using language model signals recovers reading order in complex wrap-around layouts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to determine the correct sequence of text lines in documents with complicated layouts, such as historical pages where commentaries wrap around a main text in irregular shapes. It builds a graph of possible transitions between lines, scores them using probabilities from language models without any training on the specific documents, and selects the best path with a rule that avoids common greedy errors. This is important because many digitization efforts fail on these interleaved streams, and current standard methods like XY-cut only get half the connections right on such pages. The approach shows strong results on synthetic Glossa layouts and real multi-column documents while remaining stable under page flips.

Core claim

Each OCR text line is a node in a directed candidate-transition graph whose edges receive scores from a weighted additive ensemble of causal language model conditional likelihood and BERT next-sentence prediction. The global reading order is recovered as a degree-constrained directed path cover by applying a max-regret inference rule that prioritizes high-opportunity-cost commitments to prevent cascading greedy failures. This framework is shown to recover 95 percent of ground-truth successor edges on synthetic wrap-around Glossa layouts and 88 percent macro edge accuracy on multi-column pages, substantially above the performance of XY-cut and LayoutReader baselines on the same inputs.

What carries the argument

The max-regret inference rule on the degree-constrained directed path cover of the transition graph scored by the LM ensemble.

If this is right

Recovers 95% of ground-truth successor edges on wrap-around Glossa layouts compared to 50% for XY-cut.
Achieves 88% macro edge accuracy on multi-column OmniDocBench subset versus 75% for XY-cut.
Maintains performance with less than 1 percentage point change under horizontal and vertical reflections.
Avoids cascading edge-theft failures that plague greedy edge selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be applied to other document types with interleaved text streams, such as annotated scientific papers.
Combining this with visual features might address cases where text alone is ambiguous.
The method's reliance on pre-trained models suggests it could work across languages if the models are multilingual.

Load-bearing premise

The weighted additive ensemble of causal language model conditional likelihood and BERT next-sentence prediction provides reliable edge scores for reading order without task-specific training on the target layouts.

What would settle it

A test on additional wrap-around layout pages where the method achieves under 70% successor edge recovery while XY-cut exceeds 60% would indicate the claimed advantage does not hold.

Figures

Figures reproduced from arXiv: 2607.01018 by Berat Kurar-Barakat, Daria Vasyutinsky-Shapira, Gal Grudka, Iddo Hakim, Nachum Dershowitz, Omer Ventura, Sharva Gogawale.

**Figure 1.** Figure 1: Two examples of non-Manhattan layouts. Left: a printed Hebrew Bible page, main text flanked by an Aramaic translation, two commentaries wrapping around them, Masorah parva as abbreviated notes in an internal margin, and Masorah magna spanning the top and bottom margins, all automatically (and imperfectly) line-segmented. Right: a page of a manuscript (Codex Bodmer 25) of the Greek Bible with two regions of… view at source ↗

**Figure 2.** Figure 2: Next-column stress test on an ALTO-based graph (5,783 candidate edges). [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Reading order inference remains a critical bottleneck in the digitization of complex historical manuscripts, where pages contain multiple spatially interleaved reading streams, the canonical example being the Glossa Ordinaria layout, in which a central text is surrounded by commentaries that wrap around it in non-rectangular, non-convex regions. We present a training-free, graph-based framework: each OCR text line becomes a node in a directed candidate-transition graph, edges are scored by a weighted additive ensemble of two lightweight language-model signals (causal language model conditional likelihood and BERT next-sentence prediction, NSP; a third sentence-embedding signal was evaluated but did not improve reading order), and the global reading order is recovered as a degree-constrained directed path cover. To avoid the cascading "edge-theft" failures of greedy edge selection, we propose a max-regret inference rule that prioritizes commitments with high opportunity cost. We evaluate on synthetic Glossa Ordinaria grid layouts, on 23 ALTO page geometries (10 historical source pages plus mirrored and flipped variants), and on a 140-page multi-column English subset of OmniDocBench, comparing our method against the canonical recursive XY-cut (PaddleOCR PP-StructureV3) and two LayoutReader variants (layout-only and text+layout) on identical inputs. On wrap-around Glossa layouts our method recovers 95% of ground-truth successor edges on average vs. XY-cut's 50%; on the OmniDocBench multi-column subset it reaches 88% macro edge accuracy versus XY-cut's 75% and LayoutReader's 25%. The LayoutReader baselines transfer poorly due to a word-level vs. line-level granularity mismatch. We additionally verify mirror-invariance under horizontal and vertical page reflections: Our method changes by less than 1 percentage point, classical XY-cut by 2 points, and LayoutReader-T by up to 8 points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The max-regret rule on the graph path cover is the real addition here, and the Glossa numbers look like a genuine step up from XY-cut.

read the letter

This paper's main advance is the max-regret inference rule on top of a graph path cover for recovering reading order in non-rectangular layouts. It combines that with simple LM edge scores and shows clear gains on Glossa examples and multi-column pages.

The numbers look good: 95 percent successor edge recovery on the wrap-around cases versus 50 for XY-cut, and 88 percent on OmniDocBench multi-column versus 75 and 25. The mirror invariance check is a solid addition too, and the point about LayoutReader's granularity mismatch is fair.

The soft spot is the reliance on off-the-shelf English LMs for scoring edges on Latin historical text. The abstract gives no ablations removing the LM terms or correlation analysis on the Glossa pages, so it is hard to know whether the LM signals actually help or if the gains come mostly from the candidate pruning and the regret rule. Ten source pages is also a thin base for the historical claim.

The work is aimed at document digitization folks dealing with complex manuscripts. It has enough new machinery and empirical backing to warrant peer review, though the authors should add those ablations before final submission.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a training-free graph-based framework for reading order inference in complex document layouts, particularly wrap-around Glossa Ordinaria pages. Nodes are OCR text lines, edges scored by weighted ensemble of causal LM conditional likelihood and BERT NSP, and reading order recovered as degree-constrained directed path cover using max-regret inference. Evaluations on synthetic grids, 23 ALTO pages (including 10 historical), and OmniDocBench multi-column subset show superior performance over XY-cut and LayoutReader baselines, with 95% successor recovery on Glossa vs 50%, and 88% macro edge accuracy on OmniDocBench vs 75% and 25%. Mirror-invariance is also verified.

Significance. If the results hold, this work could significantly advance the digitization of historical manuscripts with interleaved layouts by providing a method that does not require task-specific training. The explicit comparison on historical data and robustness checks are strengths. However, the reliance on pre-trained LMs for non-English text is a key assumption that needs validation to fully assess impact.

major comments (2)

[Abstract] Abstract: The central claim of 95% average successor edge recovery on the 10 historical Glossa pages depends on the weighted additive ensemble of causal-LM conditional likelihood and BERT NSP producing useful edge scores for medieval Latin text. No ablation removing the LM terms, no correlation analysis between LM scores and ground-truth edges, and no language-specific validation on these pages are reported, leaving open whether the gains derive from the LM signals or from spatial candidate pruning plus the max-regret rule.
[Evaluation] Evaluation section: The training-free claim on Glossa layouts rests on off-the-shelf English-centric models generalizing to Latin historical manuscripts; the manuscript should supply score-distribution statistics or an LM-ablation result on the historical subset to substantiate that the ensemble contributes signal rather than noise.

minor comments (1)

[Abstract] Abstract: The sentence noting that the third sentence-embedding signal "did not improve reading order" is terse; a brief quantitative comparison or reason would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The points raised correctly identify the absence of explicit validation for the LM ensemble on the historical Latin data. We address each comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of 95% average successor edge recovery on the 10 historical Glossa pages depends on the weighted additive ensemble of causal-LM conditional likelihood and BERT NSP producing useful edge scores for medieval Latin text. No ablation removing the LM terms, no correlation analysis between LM scores and ground-truth edges, and no language-specific validation on these pages are reported, leaving open whether the gains derive from the LM signals or from spatial candidate pruning plus the max-regret rule.

Authors: We agree that the manuscript lacks an ablation removing the LM terms and a correlation analysis on the Glossa pages. In revision we will add both: an ablation on the 10 historical pages comparing the full ensemble against spatial pruning plus max-regret alone, plus Pearson correlations and score-distribution statistics between LM scores and ground-truth successor edges. These additions will directly test whether the LM signals contribute beyond the graph components. revision: yes
Referee: [Evaluation] Evaluation section: The training-free claim on Glossa layouts rests on off-the-shelf English-centric models generalizing to Latin historical manuscripts; the manuscript should supply score-distribution statistics or an LM-ablation result on the historical subset to substantiate that the ensemble contributes signal rather than noise.

Authors: We concur that explicit LM-ablation results and score statistics on the historical subset are required to support the generalization claim. The revised evaluation section will include the ablation and score-distribution statistics on the 10 Glossa pages as described in the response to the abstract comment. This will substantiate that the ensemble supplies signal rather than noise for the medieval Latin text. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses external pre-trained LMs and independent baselines.

full rationale

The paper's core method constructs a candidate graph from OCR lines, scores edges via off-the-shelf causal LM likelihood and BERT NSP (no task-specific training or fitting on target data), and recovers order via a max-regret path-cover rule. All performance numbers are obtained by direct comparison against published external baselines (XY-cut, LayoutReader) on held-out datasets (Glossa pages, OmniDocBench). No equation reduces a claimed prediction to a fitted parameter by construction, no load-bearing premise rests on self-citation, and the uniqueness of the inference rule is justified by its stated avoidance of greedy edge-theft rather than by prior author theorems. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard graph algorithms and pre-trained models but introduces a new inference rule; no new entities postulated.

free parameters (1)

ensemble weights
Weights for combining causal LM and NSP signals are implied but not detailed in abstract; likely chosen or tuned on development data.

axioms (1)

domain assumption Pre-trained language models supply useful signals for document reading order without domain adaptation.
Central to the edge-scoring step.

pith-pipeline@v0.9.1-grok · 5913 in / 1243 out tokens · 38333 ms · 2026-07-02T12:48:41.849187+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages

[1]

Computational Linguistics34(1), 1–34 (2008).https://doi.org/10.1162/coli

Barzilay, R., Lapata, M.: Modeling local coherence: An entity-based approach. Computational Linguistics34(1), 1–34 (2008).https://doi.org/10.1162/coli. 2008.34.1.1

work page doi:10.1162/coli 2008
[2]

In: Proc

Clausner, C., Pletschacher, S., Antonacopoulos, A.: The significance of reading order in document recognition and its evaluation. In: Proc. 12th Int. Conf. on Document Analysis and Recognition (ICDAR). pp. 688–692 (2013).https://doi. org/10.1109/ICDAR.2013.141

work page doi:10.1109/icdar.2013.141 2013
[3]

In: Proc

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. NAACL-HLT. pp. 4171–4186. ACL (2019).https://doi.org/10.18653/v1/N19-1423

work page doi:10.18653/v1/n19-1423 2019
[4]

In: Proc

Li, J., Hovy, E.: A model of coherence based on distributed sentence representation. In: Proc. EMNLP. pp. 2039–2048. ACL (2014).https://doi.org/10.3115/v1/ D14-1218

work page doi:10.3115/v1/ 2039
[5]

Library of Congress: ALTO: Technical metadata for layout and text objects.https: //www.loc.gov/standards/alto/(2022)

2022
[6]

In: Proc

Meunier, J.L.: Optimized XY-cut for determining a page reading order. In: Proc. 8th Int. Conf. on Document Analysis and Recognition (ICDAR). pp. 347–351 (2005).https://doi.org/10.1109/ICDAR.2005.182

work page doi:10.1109/icdar.2005.182 2005
[7]

In: Proc

Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Proc. 7th Int. Conf. on Pattern Recognition (ICPR). vol. 1, pp. 347–349 (1984)

1984
[8]

IEEE TPAMI 15(11), 1162–1173 (1993).https://doi.org/10.1109/34.244677

O’Gorman, L.: The document spectrum for page layout analysis. IEEE TPAMI 15(11), 1162–1173 (1993).https://doi.org/10.1109/34.244677

work page doi:10.1109/34.244677 1993
[9]

In: Proc

Ouyang, L., Qu, Y., Zhou, H., Zhu, J., et al.: OmniDocBench: Benchmarking diverse pdf document parsing with comprehensive annotations. In: Proc. CVPR (2025)

2025
[10]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep., OpenAI (2019)

2019
[11]

In: Proc

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proc. EMNLP-IJCNLP. pp. 3982–3992. ACL (2019).https: //doi.org/10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019
[12]

Rozenberg, M., Munk, M., Kainan, A.: A Talmud page as a metaphor of a scientific text. Int. J. Qualitative Methods5(4), 30–44 (2006).https://doi.org/10.1177/ 160940690600500403

2006
[13]

In: Proc

Wang, R., Fujii, Y., Bissacco, A.: Text reading order in uncontrolled condi- tions by sparse graph segmentation. In: Proc. Int. Conf. on Document Analysis and Recognition (ICDAR). pp. 3–21. Springer (2023).https://doi.org/10.1007/ 978-3-031-41731-3_1

2023
[14]

In: Proc

Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: LayoutReader: Pre-training of text and layout for reading order detection. In: Proc. EMNLP. pp. 4735–4744. ACL (2021).https://doi.org/10.18653/v1/2021.emnlp-main.389

work page doi:10.18653/v1/2021.emnlp-main.389 2021
[15]

In: Proc

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proc. 26th ACM SIGKDD. pp. 1192–1200 (2020).https://doi.org/10.1145/3394486.3403172 17

work page doi:10.1145/3394486.3403172 2020

[1] [1]

Computational Linguistics34(1), 1–34 (2008).https://doi.org/10.1162/coli

Barzilay, R., Lapata, M.: Modeling local coherence: An entity-based approach. Computational Linguistics34(1), 1–34 (2008).https://doi.org/10.1162/coli. 2008.34.1.1

work page doi:10.1162/coli 2008

[2] [2]

In: Proc

Clausner, C., Pletschacher, S., Antonacopoulos, A.: The significance of reading order in document recognition and its evaluation. In: Proc. 12th Int. Conf. on Document Analysis and Recognition (ICDAR). pp. 688–692 (2013).https://doi. org/10.1109/ICDAR.2013.141

work page doi:10.1109/icdar.2013.141 2013

[3] [3]

In: Proc

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proc. NAACL-HLT. pp. 4171–4186. ACL (2019).https://doi.org/10.18653/v1/N19-1423

work page doi:10.18653/v1/n19-1423 2019

[4] [4]

In: Proc

Li, J., Hovy, E.: A model of coherence based on distributed sentence representation. In: Proc. EMNLP. pp. 2039–2048. ACL (2014).https://doi.org/10.3115/v1/ D14-1218

work page doi:10.3115/v1/ 2039

[5] [5]

Library of Congress: ALTO: Technical metadata for layout and text objects.https: //www.loc.gov/standards/alto/(2022)

2022

[6] [6]

In: Proc

Meunier, J.L.: Optimized XY-cut for determining a page reading order. In: Proc. 8th Int. Conf. on Document Analysis and Recognition (ICDAR). pp. 347–351 (2005).https://doi.org/10.1109/ICDAR.2005.182

work page doi:10.1109/icdar.2005.182 2005

[7] [7]

In: Proc

Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Proc. 7th Int. Conf. on Pattern Recognition (ICPR). vol. 1, pp. 347–349 (1984)

1984

[8] [8]

IEEE TPAMI 15(11), 1162–1173 (1993).https://doi.org/10.1109/34.244677

O’Gorman, L.: The document spectrum for page layout analysis. IEEE TPAMI 15(11), 1162–1173 (1993).https://doi.org/10.1109/34.244677

work page doi:10.1109/34.244677 1993

[9] [9]

In: Proc

Ouyang, L., Qu, Y., Zhou, H., Zhu, J., et al.: OmniDocBench: Benchmarking diverse pdf document parsing with comprehensive annotations. In: Proc. CVPR (2025)

2025

[10] [10]

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. Tech. rep., OpenAI (2019)

2019

[11] [11]

In: Proc

Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proc. EMNLP-IJCNLP. pp. 3982–3992. ACL (2019).https: //doi.org/10.18653/v1/D19-1410

work page doi:10.18653/v1/d19-1410 2019

[12] [12]

Rozenberg, M., Munk, M., Kainan, A.: A Talmud page as a metaphor of a scientific text. Int. J. Qualitative Methods5(4), 30–44 (2006).https://doi.org/10.1177/ 160940690600500403

2006

[13] [13]

In: Proc

Wang, R., Fujii, Y., Bissacco, A.: Text reading order in uncontrolled condi- tions by sparse graph segmentation. In: Proc. Int. Conf. on Document Analysis and Recognition (ICDAR). pp. 3–21. Springer (2023).https://doi.org/10.1007/ 978-3-031-41731-3_1

2023

[14] [14]

In: Proc

Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: LayoutReader: Pre-training of text and layout for reading order detection. In: Proc. EMNLP. pp. 4735–4744. ACL (2021).https://doi.org/10.18653/v1/2021.emnlp-main.389

work page doi:10.18653/v1/2021.emnlp-main.389 2021

[15] [15]

In: Proc

Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: Pre-training of text and layout for document image understanding. In: Proc. 26th ACM SIGKDD. pp. 1192–1200 (2020).https://doi.org/10.1145/3394486.3403172 17

work page doi:10.1145/3394486.3403172 2020