HiPS: Hierarchical PDF Segmentation of Doctrinal Legal Books

Harikrishnan Changaramkulath; Ivan Habernal; Sabine Wehnert

read the original abstract

PDF parsers have recently improved on page-level layout understanding. However, recovering a document-global section hierarchy with reliable boundaries remains brittle for deeply structured books: many systems expose only page-local heading roles, assume shallow depth, or rely on high-quality PDF tags or Table of Contents (TOC) metadata, and public gold-standard data for deep book hierarchies is scarce. We present HiPS for hierarchical PDF segmentation of doctrinal legal books and make two main contributions. First, we release a gold-standard benchmark of 49 open-access law books with 9,812 manually curated headings, hierarchy levels, and page anchors, enabling evaluation of title detection, hierarchy reconstruction, and section boundary assignment. Second, we introduce complementary segmentation pipelines: a TOC-based parser for books with reliable outline metadata and a TOC-free LLM-refined pipeline that combines OCR whitespace cues, XML typography, and local context. Across a broad comparison against open-source parsers and multimodal/LLM baselines, the TOC-based pipeline is strongest when metadata is complete, while the LLM-refined pipeline improves heading precision, deep-level recovery, and boundary quality when metadata is missing or noisy.

HiPS: Hierarchical PDF Segmentation of Doctrinal Legal Books

discussion (0)