pith. sign in

arxiv: 2511.18329 · v3 · submitted 2025-11-23 · 💻 cs.CV

SciPostLayoutTree: A Dataset for Structural Analysis of Scientific Posters

Pith reviewed 2026-05-17 06:01 UTC · model grok-4.3

classification 💻 cs.CV
keywords scientific posterslayout analysisreading orderparent-child relationsdatasetcomputer visionstructure prediction
0
0 comments X

The pith

A new dataset of about 8000 posters with reading order and parent-child annotations, paired with a Layout Tree Decoder, improves accuracy on spatially challenging relations like upward and long-distance ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs SciPostLayoutTree, a dataset of approximately 8000 scientific posters annotated with reading order and parent-child relations to address the gap in structural analysis that has mostly targeted papers instead of posters. The dataset deliberately includes more instances of difficult spatial configurations such as upward, horizontal, and long-distance relations. The authors introduce the Layout Tree Decoder, which combines visual features with bounding box information on position and category, and applies beam search to ensure sequence-level plausibility when predicting relations. Experiments indicate that this approach raises accuracy specifically on the spatially hard cases while setting an initial baseline for the task. The work supports future structure-aware interfaces that help users extract research content more reliably from posters.

Core claim

The central claim is that SciPostLayoutTree supplies a large collection of posters with reading order and parent-child annotations that feature more spatially challenging relations than earlier datasets, and that the Layout Tree Decoder, by integrating visual features with bounding box position and category details plus beam search, raises prediction accuracy for those relations and supplies a working baseline for poster structure analysis.

What carries the argument

Layout Tree Decoder that fuses visual features with bounding box position and category information and applies beam search to select relation sequences with overall plausibility.

If this is right

  • More accurate extraction of logical structure from posters that use non-standard layouts.
  • Support for interfaces that present poster content in the correct sequence for readers.
  • A measurable baseline against which future models for poster parsing can be compared.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could enable automatic conversion of posters into accessible text or audio summaries that respect original reading order.
  • Similar annotation schemes and decoding methods might transfer to analyzing conference slides or scientific infographics.
  • Standardized structural labels could help train models that compare how different researchers visually organize the same findings.

Load-bearing premise

The human annotations for reading order and parent-child relations are consistent across posters and correctly capture the intended structure.

What would settle it

A replication study in which independent annotators relabel a subset of the posters and show low agreement on reading order or parent-child links, or in which the proposed decoder fails to outperform simpler baselines on the subset of upward, horizontal, and long-distance relations.

Figures

Figures reproduced from arXiv: 2511.18329 by Atsushi Hashimoto, Shohei Tanaka, Yoshitaka Ushiku.

Figure 1
Figure 1. Figure 1: Example from SciPostLayoutTree. Each arrow denotes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distributions of tree depth, tree width, and number of children per node. SciPostLayoutTree (blue) and DocHieNet (orange [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Reading order heatmaps by direction and distance. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of Layout Tree Decoder. We extend DRGG [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of parent-child prediction failures for each [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of reading order prediction failures for each [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 12
Figure 12. Figure 12: Heatmap of reading order frequencies in SciPost [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Heatmap of parent-child relation frequencies in Sci [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Example of a poster with the GT annotation and the predicted trees. The predicted trees received low STEDS (42.70). [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Example of a poster with the GT annotation and the predicted trees. The predicted trees received low STEDS (25.71). [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Example of a poster with the GT annotation and the predicted trees. The predicted trees received low STEDS (46.67). [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Example of a poster with the GT annotation and the predicted trees. The predicted trees received low STEDS (53.68). [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Effect of beam width on STEDS performance [PITH_FULL_IMAGE:figures/full_fig_p021_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Effect of beam width on REDS performance [PITH_FULL_IMAGE:figures/full_fig_p021_19.png] view at source ↗
Figure 21
Figure 21. Figure 21: Histogram of STEDS with a bin width of 10 [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Histogram of REDS with a bin width of 10 [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Histogram of TED with a bin width of 10 24 [PITH_FULL_IMAGE:figures/full_fig_p024_23.png] view at source ↗
read the original abstract

Scientific posters play a vital role in academic communication by presenting ideas through visual summaries. Analyzing reading order and parent-child relations of posters is essential for building structure-aware interfaces that facilitate clear and accurate understanding of research content. Despite their prevalence in academic communication, posters remain underexplored in structural analysis research, which has primarily focused on papers. To address this gap, we constructed SciPostLayoutTree, a dataset of approximately 8,000 posters annotated with reading order and parent-child relations. Compared to an existing structural analysis dataset, SciPostLayoutTree contains more instances of spatially challenging relations, including upward, horizontal, and long-distance relations. As a solution to these challenges, we develop Layout Tree Decoder, which incorporates visual features as well as bounding box features including position and category information. The model also uses beam search to predict relations while capturing sequence-level plausibility. Experimental results demonstrate that our model improves the prediction accuracy for spatially challenging relations and establishes a solid baseline for poster structure analysis. The dataset is publicly available at https://huggingface.co/datasets/omron-sinicx/scipostlayouttree. The code is also publicly available at https://github.com/omron-sinicx/scipostlayouttree.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SciPostLayoutTree, a dataset of approximately 8,000 scientific posters annotated with reading order and parent-child relations. It emphasizes that this dataset features more spatially challenging relations (upward, horizontal, and long-distance) compared to existing structural analysis datasets primarily focused on papers. The authors introduce the Layout Tree Decoder model, which integrates visual features along with bounding box features including position and category information, and employs beam search to predict relations while accounting for sequence-level plausibility. Experimental results are reported to demonstrate improved prediction accuracy for these challenging relations, establishing a baseline for poster structure analysis. The dataset and code are made publicly available.

Significance. This contribution addresses an important gap in computer vision and document analysis research by focusing on the structural analysis of scientific posters, an area that has received less attention than paper layout analysis. The public release of the dataset and code supports reproducibility and enables future work on structure-aware interfaces for academic communication. The approach of using beam search to capture global consistency in relation predictions is a sensible way to handle the complexities of poster layouts. If the annotations are shown to be reliable, the dataset could become a standard benchmark for this task.

major comments (2)
  1. [Abstract] The abstract states that 'Experimental results demonstrate that our model improves the prediction accuracy for spatially challenging relations' but provides no specific numeric metrics, details on baselines, or error analysis. This makes it challenging to evaluate the practical significance of the claimed improvements.
  2. [Dataset Construction] The section on dataset construction does not provide details on the annotation protocol, guidelines given to annotators, or any inter-annotator agreement metrics for the reading order and parent-child relations. Given that the central claims rely on the accuracy of these human annotations as ground truth—particularly for ambiguous spatial layouts in posters—this omission is load-bearing and requires clarification to substantiate the evaluation results.
minor comments (1)
  1. [Introduction] The introduction could benefit from additional citations to recent works on document layout analysis to strengthen the motivation for focusing on posters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of the significance of SciPostLayoutTree. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Abstract] The abstract states that 'Experimental results demonstrate that our model improves the prediction accuracy for spatially challenging relations' but provides no specific numeric metrics, details on baselines, or error analysis. This makes it challenging to evaluate the practical significance of the claimed improvements.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised version, we will update the abstract to report specific accuracy improvements on upward, horizontal, and long-distance relations relative to the baselines, along with a concise reference to the evaluation protocol. revision: yes

  2. Referee: [Dataset Construction] The section on dataset construction does not provide details on the annotation protocol, guidelines given to annotators, or any inter-annotator agreement metrics for the reading order and parent-child relations. Given that the central claims rely on the accuracy of these human annotations as ground truth—particularly for ambiguous spatial layouts in posters—this omission is load-bearing and requires clarification to substantiate the evaluation results.

    Authors: We acknowledge this gap. The revised manuscript will expand the Dataset Construction section with a detailed account of the annotation protocol, the specific guidelines provided to annotators for handling spatial ambiguities, and inter-annotator agreement metrics for both reading order and parent-child relations to support the reliability of the ground-truth annotations. revision: yes

Circularity Check

0 steps flagged

No circularity: standard dataset construction and model evaluation on released data

full rationale

The paper constructs SciPostLayoutTree (~8k annotated posters) and evaluates a Layout Tree Decoder on it. Claims rest on empirical accuracy improvements for challenging relations using standard train-test protocols on the public dataset. No equations, fitted parameters renamed as predictions, or self-citation chains reduce the results to inputs by construction. Annotation consistency is an external validity concern, not a circularity issue per the defined patterns. Derivation chain is self-contained against the released benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that posters possess consistent hierarchical and sequential structure that can be reliably annotated by humans. No free parameters or invented entities are introduced beyond standard model components.

axioms (1)
  • domain assumption Posters possess consistent hierarchical parent-child relations and a linear reading order that can be annotated reliably by humans.
    Invoked when constructing the dataset annotations and when claiming the model solves real poster structure problems.

pith-pipeline@v0.9.0 · 5519 in / 1198 out tokens · 61940 ms · 2026-05-17T06:01:33.096853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Scibert: A pre- trained language model for scientific text, 2019

    Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pre- trained language model for scientific text, 2019. 15

  2. [2]

    Graph-based document structure analysis, 2025

    Yufan Chen, Ruiping Liu, Junwei Zheng, Di Wen, Kunyu Peng, Jiaming Zhang, and Rainer Stiefelhagen. Graph-based document structure analysis, 2025. 1, 2, 4

  3. [3]

    An image is worth 16x16 words: Transformers for image recognition at scale, 2021

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. 4, 6, 14

  4. [4]

    Dolphin: Doc- ument image parsing via heterogeneous anchor prompting,

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, Jingqun Tang, Hao Liu, and Can Huang. Dolphin: Doc- ument image parsing via heterogeneous anchor prompting,

  5. [5]

    Beam search strate- gies for neural machine translation

    Markus Freitag and Yaser Al-Onaizan. Beam search strate- gies for neural machine translation. InProceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics, 2017. 5

  6. [6]

    Deep residual learning for image recognition, 2015

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. 4, 6, 14

  7. [7]

    Mask r-cnn, 2018

    Kaiming He, Georgia Gkioxari, Piotr Doll ´ar, and Ross Gir- shick. Mask r-cnn, 2018. 4

  8. [8]

    Mathreader : Text-to-speech for mathe- matical documents, 2025

    Sieun Hyeon, Kyudan Jung, Nam-Joon Kim, Hyun Gon Ryu, and Jaeyoung Do. Mathreader : Text-to-speech for mathe- matical documents, 2025. 1

  9. [9]

    Document understanding dataset and evaluation (dude), 2023

    Jordy Van Landeghem, Rub ´en Tito, Łukasz Borchmann, Michał Pietruszka, Paweł J´oziak, Rafał Powalski, Dawid Ju- rkiewicz, Micka ¨el Coustaty, Bertrand Ackaert, Ernest Val- veny, Matthew Blaschko, Sien Moens, and Tomasz Sta- nisławek. Document understanding dataset and evaluation (dude), 2023. 1

  10. [10]

    A diversity-promoting objective function for neural conversation models, 2016

    Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. A diversity-promoting objective function for neural conversation models, 2016. 5

  11. [11]

    Dit: Self-supervised pre-training for docu- ment image transformer, 2022

    Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, and Furu Wei. Dit: Self-supervised pre-training for docu- ment image transformer, 2022. 6, 14

  12. [12]

    Structure-aware language model pretraining improves dense retrieval on structured data, 2023

    Xinze Li, Zhenghao Liu, Chenyan Xiong, Shi Yu, Yu Gu, Zhiyuan Liu, and Ge Yu. Structure-aware language model pretraining improves dense retrieval on structured data, 2023. 1

  13. [13]

    Feature pyramid networks for object detection, 2017

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection, 2017. 4

  14. [14]

    Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted win- dows, 2021. 6, 14

  15. [15]

    Sgdr: Stochastic gradient descent with warm restarts, 2017

    Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts, 2017. 14

  16. [16]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019. 14

  17. [17]

    Hrdoc: Dataset and base- line method toward hierarchical reconstruction of document structures, 2023

    Jiefeng Ma, Jun Du, Pengfei Hu, Zhenrong Zhang, Jianshu Zhang, Huihui Zhu, and Cong Liu. Hrdoc: Dataset and base- line method toward hierarchical reconstruction of document structures, 2023. 2, 6

  18. [18]

    Efficient computa- tion of the tree edit distance.ACM Trans

    Mateusz Pawlik and Nikolaus Augsten. Efficient computa- tion of the tree edit distance.ACM Trans. Database Syst., 40 (1), 2015. 6

  19. [19]

    Tree edit distance: Robust and memory-efficient.Information Systems, 56:157– 173, 2016

    Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient.Information Systems, 56:157– 173, 2016. 6

  20. [20]

    Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models.Journal of Computer Science and Technology, 34(1):155–169, 2019

    Yu-Ting Qiang, Yan-Wei Fu, Xiao Yu, Yan-Wen Guo, Zhi- Hua Zhou, and Leonid Sigal. Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models.Journal of Computer Science and Technology, 34(1):155–169, 2019. 2

  21. [21]

    Docparser: Hierarchical structure parsing of document renderings, 2021

    Johannes Rausch, Octavio Martinez, Fabian Bissig, Ce Zhang, and Stefan Feuerriegel. Docparser: Hierarchical structure parsing of document renderings, 2021. 2

  22. [22]

    Faster r-cnn: Towards real-time object detection with region proposal networks, 2016

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks, 2016. 4

  23. [23]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator net- works, 2017. 5

  24. [24]

    R. Smith. An overview of the tesseract ocr engine. InNinth International Conference on Document Analysis and Recog- nition (ICDAR 2007), pages 629–633, 2007. 15

  25. [25]

    P2p: Automated paper-to-poster gen- eration and fine-grained benchmark, 2025

    Tao Sun, Enhao Pan, Zhengkai Yang, Kaixin Sui, Jiajun Shi, Xianfu Cheng, Tongliang Li, Wenhao Huang, Ge Zhang, Jian Yang, and Zhoujun Li. P2p: Automated paper-to-poster gen- eration and fine-grained benchmark, 2025. 2

  26. [26]

    Scipost- layout: A dataset for layout analysis and layout generation of scientific posters

    Shohei Tanaka, Hao Wang, and Yoshitaka Ushiku. Scipost- layout: A dataset for layout analysis and layout generation of scientific posters. In35th British Machine Vision Con- ference 2024, BMVC 2024, Glasgow, UK, November 25-28,

  27. [27]

    2, 3, 11 9

    BMV A, 2024. 2, 3, 11 9

  28. [28]

    Hi- erarchical multimodal transformers for multi-page docvqa,

    Rub `en Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multi-page docvqa,

  29. [29]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. 5, 14

  30. [30]

    Show and tell: A neural image caption gen- erator, 2015

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du- mitru Erhan. Show and tell: A neural image caption gen- erator, 2015. 5

  31. [31]

    Detect-order-construct: A tree construction based ap- proach for hierarchical document structure analysis, 2024

    Jiawei Wang, Kai Hu, Zhuoyao Zhong, Lei Sun, and Qiang Huo. Detect-order-construct: A tree construction based ap- proach for hierarchical document structure analysis, 2024. 1, 2, 5, 6

  32. [32]

    Unihdsa: A unified relation prediction approach for hierarchical document struc- ture analysis, 2025

    Jiawei Wang, Kai Hu, and Qiang Huo. Unihdsa: A unified relation prediction approach for hierarchical document struc- ture analysis, 2025. 5

  33. [33]

    Internim- age: Exploring large-scale vision foundation models with deformable convolutions, 2023

    Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, and Yu Qiao. Internim- age: Exploring large-scale vision foundation models with deformable convolutions, 2023. 4, 6, 14

  34. [34]

    Layoutreader: Pre-training of text and layout for read- ing order detection, 2021

    Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. Layoutreader: Pre-training of text and layout for read- ing order detection, 2021. 2

  35. [35]

    Soong, and Tan Lee

    Yujia Xiao, Shaofei Zhang, Xi Wang, Xu Tan, Lei He, Sheng Zhao, Frank K. Soong, and Tan Lee. Contextspeech: Ex- pressive and efficient text-to-speech for paragraph reading. InINTERSPEECH 2023, page 4883–4887. ISCA, 2023. 1

  36. [36]

    DocHieNet: A large and diverse dataset for document hierarchy parsing

    Hangdi Xing, Changxu Cheng, Feiyu Gao, Zirui Shao, Zhi Yu, Jiajun Bu, Qi Zheng, and Cong Yao. DocHieNet: A large and diverse dataset for document hierarchy parsing. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1129–1142, Miami, Florida, USA, 2024. Association for Computational Linguis- tics. 2, 3, 11

  37. [37]

    Neural Content Extraction for Poster Generation of Scientific Papers, 2021

    Sheng Xu and Xiaojun Wan. Neural Content Extraction for Poster Generation of Scientific Papers, 2021. 2

  38. [38]

    Reading order matters: Information extraction from visually-rich documents by to- ken path prediction, 2023

    Chong Zhang, Ya Guo, Yi Tu, Huan Chen, Jinyang Tang, Huijia Zhu, Qi Zhang, and Tao Gui. Reading order matters: Information extraction from visually-rich documents by to- ken path prediction, 2023. 5

  39. [39]

    Modeling layout reading order as ordering rela- tions for visually-rich document understanding

    Chong Zhang, Yi Tu, Yixi Zhao, Chenshu Yuan, Huan Chen, Yue Zhang, Mingxu Chai, Ya Guo, Huijia Zhu, Qi Zhang, and Tao Gui. Modeling layout reading order as ordering rela- tions for visually-rich document understanding. InProceed- ings of the 2024 Conference on Empirical Methods in Natu- ral Language Processing, pages 9658–9678, Miami, Florida, USA, 2024. ...