pith. sign in

arxiv: 2607.00596 · v1 · pith:MKZEWEKUnew · submitted 2026-07-01 · 💻 cs.CV

Semantic-Guided Reading Order Reconstruction in Historical Armenian Newspapers with LLMs

Pith reviewed 2026-07-02 14:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords reading order reconstructionhistorical newspapersArmenian languagelarge language modelssemantic zone detectiondocument layout analysislow-resource languagesOCR
0
0 comments X

The pith

A hybrid of semantic zone detection and generative LLM prompting reconstructs reading order in historical Armenian newspapers with up to 76% fewer errors than geometric baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a 66-page annotated dataset of historical Armenian newspapers and tests geometric heuristics, YOLO layout parsing, an end-to-end model called ECLAIR, and a hybrid pipeline. The hybrid first detects semantic zones then uses an LLM to determine reading order. It records the lowest error rates across single-page, multi-page, and noisy-OCR conditions. The method is framed explicitly as a bootstrapping tool for rapid annotation rather than a production system, and the authors also release a specialized Tesseract OCR model for historical Armenian print.

Core claim

Our hybrid method achieves the lowest error rates of all evaluated approaches, reducing ordering errors by up to 76% over the strongest geometric baseline, and remains robust in multi-page settings and under noisy OCR. Rather than targeting production the method is designed as a data bootstrapping strategy enabling rapid annotation in highly under-resourced scenarios.

What carries the argument

The hybrid pipeline that pairs semantic zone detection with generative LLM prompting to infer reading order from a small set of annotations.

Load-bearing premise

Semantic zone detection followed by generative LLM prompting can reliably infer correct reading order from limited annotations without extensive domain-specific fine-tuning or additional labeled examples beyond the 66-page set.

What would settle it

Applying the hybrid method to a fresh collection of historical Armenian newspaper pages and finding that its ordering error rate is no lower than the strongest geometric baseline would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2607.00596 by Chahan Vidal-Gor\`ene (CJM, LIPN), Nadi Tomeh (LIPN), SeDyL), Victoria Khurshudyan (Inalco.

Figure 1
Figure 1. Figure 1: Task 1, ex. 1: two-column page, read column by column (zones 1–3 down column 1, then zones 4–6 down column 2). Task 1, ex. 2: two upper columns (zones 1– 2, then 3–4) separated by a horizontal rule from a bottom row (zones 5–6 read left￾to-right). Task 2: an article begins on page 1 (zones 1–6, column-major order) and continues on page 2 as zone 9, below a horizontal separator and unrelated zones 7–8 (dash… view at source ↗
Figure 2
Figure 2. Figure 2: VGSL OCR architecture with a CTC decoder. This compact model is well￾suited to our ∼15,000-line training set; the stacked LSTM layers implicitly capture local sequential dependencies (a lightweight LM substitute). Absolute CER reductions range from 8 to 35 percentage points ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of LLM prompt used to infer paragraph reading order from OCR output. We benchmark ECLAIR [6] alongside our proposed SD + Generative LLM pipeline ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Proposed SD + Generative LLM pipeline: SD detection, intra-SD Local Topo￾logical Sort, then LLM pairwise comparison restricted to the first and last paragraphs of each SD. Dashed red arrows are shown only between SD1–SD3 for readability; all SD pairs are compared. 5 Results and discussions 5.1 Metrics Following Quirós & Vidal [10], we use Kendall’s tau distance τ (number of pair￾wise inversions) and Spearm… view at source ↗
read the original abstract

This paper addresses reading order reconstruction in historical Armenian newspapers, which combine complex layouts with limited language resources. We introduce a new annotated dataset of 66 pages and compare geometric heuristics, YOLO-based layout parsing, an end-to-end document model ECLAIR, and a hybrid method combining semantic zone detection with a generative LLM. Our hybrid method achieves the lowest error rates of all evaluated approaches, reducing ordering errors by up to 76% over the strongest geometric baseline, and remains robust in multi-page settings and under noisy OCR. Rather than targeting production the method is designed as a data bootstrapping strategy enabling rapid annotation in highly under-resourced scenarios. Alongside the dataset, we release a specialized Tesseract OCR model for historical Armenian print.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces a new 66-page annotated dataset of historical Armenian newspapers and compares geometric heuristics, YOLO-based layout parsing, the ECLAIR end-to-end model, and a hybrid semantic zone detection plus generative LLM approach for reading order reconstruction. It claims the hybrid method yields the lowest error rates, reducing ordering errors by up to 76% relative to the strongest geometric baseline, while remaining robust for multi-page documents and noisy OCR; the work is framed as a bootstrapping strategy for annotation in low-resource settings and includes release of a specialized Tesseract OCR model.

Significance. If the reported gains prove reliable under proper validation, the contribution would be useful for bootstrapping annotations in under-resourced historical document processing, particularly for non-Latin scripts, and the released dataset plus OCR model would constitute reusable resources.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation section: the central claim of 'up to 76% error reduction' is presented without any description of the evaluation protocol, metric definitions (e.g., how ordering errors are counted), train/test splits, number of LLM sampling runs, or statistical significance tests. This directly undermines assessment of the hybrid method's superiority.
  2. [Dataset and Experiments] Dataset and Experiments: the 66-page corpus is evaluated without reported cross-validation, variance across splits, or multiple random seeds; given the small size and potential shared layout/OCR characteristics, this leaves the 76% reduction vulnerable to sampling artifacts and prevents confirmation that the hybrid advantage generalizes.
  3. [Experiments / Results] Robustness claims: statements that the method 'remains robust in multi-page settings and under noisy OCR' are unsupported by per-regime error tables, ablation numbers, or separate breakdowns, making these assertions load-bearing for the overall contribution but currently unverified.
minor comments (1)
  1. [Methods] Methods section: the geometric heuristics and YOLO-based parsing baselines would benefit from explicit pseudocode or parameter settings to allow exact reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional detail is needed to substantiate our claims. We agree that the evaluation protocol, dataset handling, and robustness assertions require more explicit support and will revise the manuscript accordingly. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation section: the central claim of 'up to 76% error reduction' is presented without any description of the evaluation protocol, metric definitions (e.g., how ordering errors are counted), train/test splits, number of LLM sampling runs, or statistical significance tests. This directly undermines assessment of the hybrid method's superiority.

    Authors: We agree that the abstract and Evaluation section lack sufficient detail on the protocol. In the revised manuscript we will expand the Evaluation section to define the ordering error metric explicitly, describe the train/test split procedure, report the number of LLM sampling runs, and include statistical significance tests for the reported reductions. revision: yes

  2. Referee: [Dataset and Experiments] Dataset and Experiments: the 66-page corpus is evaluated without reported cross-validation, variance across splits, or multiple random seeds; given the small size and potential shared layout/OCR characteristics, this leaves the 76% reduction vulnerable to sampling artifacts and prevents confirmation that the hybrid advantage generalizes.

    Authors: We acknowledge the concern given the modest corpus size. In revision we will add results from multiple random seeds with variance statistics and discuss the risk of shared layout characteristics as a limitation; full k-fold cross-validation remains impractical for this scale but the added seed-level reporting will mitigate sampling concerns. revision: partial

  3. Referee: [Experiments / Results] Robustness claims: statements that the method 'remains robust in multi-page settings and under noisy OCR' are unsupported by per-regime error tables, ablation numbers, or separate breakdowns, making these assertions load-bearing for the overall contribution but currently unverified.

    Authors: We agree the robustness statements require supporting data. The revised Experiments section will include per-regime error tables for multi-page documents and noisy OCR conditions together with relevant ablation numbers to verify these claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical method comparison on held-out annotations

full rationale

The paper introduces a 66-page annotated dataset and reports error rates for geometric baselines, YOLO, ECLAIR, and a hybrid semantic+LLM pipeline. No equations, fitted parameters, or derivations appear; the 76% error reduction is a direct measured difference on the provided annotations rather than a quantity forced by construction from any input. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The evaluation is therefore self-contained against external benchmarks (the released dataset and OCR model).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no mathematical model, free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5677 in / 1120 out tokens · 31232 ms · 2026-07-02T14:45:35.645386+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    In: International Conference on Document Analysis and Recognition

    Bizais-Lillig, M., Vidal-Gorène, C., Dupin, B.: Optimizing htr and reading order strategies for chinese imperial editions with few-shot learning. In: International Conference on Document Analysis and Recognition. pp. 37–56. Springer (2024)

  2. [2]

    Blecher, L., Cucurull, G., Scialom, T., Stojnic, R.: Nougat: Neural optical under- standing for academic documents (2023),https://arxiv.org/abs/2308.13418

  3. [3]

    In: Proceedings of the Symposium on Document Image Understanding Technology

    Breuel, T.M.: High performance document layout analysis. In: Proceedings of the Symposium on Document Image Understanding Technology. vol. 5 (2003)

  4. [4]

    In: 3rd Int

    Breuel, T.M.: Layout analysis based on text line segment hypotheses. In: 3rd Int. Workshop on Document Layout Interpretation and its Applications (DLIA2003). pp. 25–30 (2003)

  5. [5]

    Chagué, A., Clérice, T., Pinche, A., Kiessling, B., Stokes, P., Romary, L., Hodel, T., Kermorvant, C., Gabay, S., Gille Levenson, M., Brisville-Fertin, O., Vlachou- Efstathiou, M., Guénette, M., von Stockhausen, A., Verstraete, M., Chauhan, R., Bizais-Lillig, M., Vidal-Gorène, C., Kasparian, A., Tanelian, A., Ohanian, A., Lucas, N., Perrier, A., Salah, C....

  6. [6]

    arXiv preprint arXiv:2502.04223 (2025) Complex Reading Order in Armenian 15

    Karmanov, I., Deshmukh, A.S., Vögtle, L., Fischer, P., Chumachenko, K., Ro- man, T., Seppänen, J., Parmar, J., Jennings, J., Tao, A., et al.:\’eclair–extracting content and layout with integrated reading order for documents. arXiv preprint arXiv:2502.04223 (2025) Complex Reading Order in Armenian 15

  7. [7]

    Transactions of the Association for Computational Linguistics8, 726–742 (2020)

    Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M., Zettlemoyer, L.: Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics8, 726–742 (2020)

  8. [8]

    Lv, T., Huang, Y., Chen, J., Zhao, Y., Jia, Y., Cui, L., Ma, S., Chang, Y., Huang, S., Wang, W., Dong, L., Luo, W., Wu, S., Wang, G., Zhang, C., Wei, F.: Kosmos- 2.5: A multimodal literate model (2024),https://arxiv.org/abs/2309.11419

  9. [9]

    Multi-Task Handwritten Document Layout Analysis

    Quirós, L.: Multi-task handwritten document layout analysis. arXiv preprint arXiv:1806.08852 (2018)

  10. [10]

    Neural Computing and Applications34(12), 9593–9611 (2022)

    Quirós, L., Vidal, E.: Reading order detection on handwritten documents. Neural Computing and Applications34(12), 9593–9611 (2022)

  11. [11]

    In: 2009 10th International Conference on Document Analysis and Recognition

    Smith, R.W.: Hybrid page layout analysis via tab-stop detection. In: 2009 10th International Conference on Document Analysis and Recognition. pp. 241–245. IEEE (2009)

  12. [12]

    arXiv preprint arXiv:2212.13924 (2022)

    Sven, N.M., Matteo, R.: Page layout analysis of text-heavy historical documents: a comparison of textual and visual approaches. arXiv preprint arXiv:2212.13924 (2022)

  13. [13]

    In: International Conference on Document Analysis and Recognition

    Vidal-Gorène, C., Camps, J.B.: Image-to-image translation approach for page lay- out analysis and artificial generation of historical manuscripts. In: International Conference on Document Analysis and Recognition. pp. 140–158. Springer (2024)

  14. [14]

    Vidal-Gorène, C., Decours-Perez, A., Kasparian, A., Tanelian, A., Ohanian, A.: Armenian htr: State of the art, transcription guidelines and good practices (2025)

  15. [15]

    In: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part III

    Vidal-Gorène, C., Dupin, B., Decours-Perez, A., Riccioli, T.: A modular and au- tomated annotation platform for handwritings: evaluation on under-resourced lan- guages. In: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part III

  16. [16]

    pp. 507–522. Springer (2021)

  17. [17]

    arXiv preprint arXiv:2108.11591 (2021)

    Wang, Z., Xu, Y., Cui, L., Shang, J., Wei, F.: Layoutreader: Pre-training of text and layout for reading order detection. arXiv preprint arXiv:2108.11591 (2021)

  18. [18]

    Wei, H., Liu, C., Chen, J., Wang, J., Kong, L., Xu, Y., Ge, Z., Zhao, L., Sun, J., Peng, Y., et al.: General ocr theory: Towards ocr-2.0 via a unified end-to-end model (2024)

  19. [19]

    In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining

    Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: Pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 1192–1200 (2020)

  20. [20]

    arXiv preprint arXiv:2410.12628 (2024)

    Zhao, Z., Kang, H., Wang, B., He, C.: Doclayout-yolo: Enhancing document layout analysis through diverse synthetic data and global-to-local adaptive perception. arXiv preprint arXiv:2410.12628 (2024)

  21. [21]

    In: Computer Vision (ICCV), 2017 IEEE International Conference on (2017)

    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Computer Vision (ICCV), 2017 IEEE International Conference on (2017)