pith. sign in

arxiv: 2509.00909 · v2 · pith:7UFGASQTnew · submitted 2025-08-31 · 💻 cs.IR

HiPS: Hierarchical PDF Segmentation of Doctrinal Legal Books

classification 💻 cs.IR
keywords booksmetadatahierarchypipelinesegmentationboundarydoctrinalgold-standard
0
0 comments X
read the original abstract

PDF parsers have recently improved on page-level layout understanding. However, recovering a document-global section hierarchy with reliable boundaries remains brittle for deeply structured books: many systems expose only page-local heading roles, assume shallow depth, or rely on high-quality PDF tags or Table of Contents (TOC) metadata, and public gold-standard data for deep book hierarchies is scarce. We present HiPS for hierarchical PDF segmentation of doctrinal legal books and make two main contributions. First, we release a gold-standard benchmark of 49 open-access law books with 9,812 manually curated headings, hierarchy levels, and page anchors, enabling evaluation of title detection, hierarchy reconstruction, and section boundary assignment. Second, we introduce complementary segmentation pipelines: a TOC-based parser for books with reliable outline metadata and a TOC-free LLM-refined pipeline that combines OCR whitespace cues, XML typography, and local context. Across a broad comparison against open-source parsers and multimodal/LLM baselines, the TOC-based pipeline is strongest when metadata is complete, while the LLM-refined pipeline improves heading precision, deep-level recovery, and boundary quality when metadata is missing or noisy.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.