SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

Mike Rainey , Umut Acar , Muhammed Sezer

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:27 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords structurecontextcontextualizationretrievalstructuralwhilecandidatesdocument

0 comments

The pith

SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The core problem is that current AI retrieval systems break web pages into flat pieces of text, losing the headings, lists, and tables that make information understandable. SPIRE instead treats each document as a tree and lets the system pick precise sub-parts called subdocuments. These subdocuments keep their original address in the tree so the system can later add the right amount of surrounding context. Global contextualization brings in titles and overall structure to make a selection readable. Local contextualization grows the selection outward within its immediate neighborhood until it reaches a size limit. The system first finds candidate subdocuments with embeddings, then aggregates shared context across them, and finally re-scores the results with the locally expanded views. On standard HTML question-answering tests, this approach returned more useful and varied citations than the usual chunking methods while staying fast enough for practical use.

Core claim

Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

Load-bearing premise

That the defined document primitives (paths, path sets, pruning-based extraction, global and local contextualization) can be implemented efficiently and that the embedding-based candidate generator plus document-aware aggregation will reliably surface better evidence than standard linearization without introducing structural biases or scalability issues.

Figures

Figures reproduced from arXiv: 2604.20849 by Mike Rainey, Muhammed Sezer, Umut Acar.

**Figure 2.** Figure 2: Sentence-anchored subdocuments. Sentence boundaries are inferred and recorded using [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

read the original abstract

Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPIRE introduces structure-preserving subdocument retrieval for tree documents but its baseline comparisons may not isolate the benefits of structure from added context.

read the letter

The key takeaway is that SPIRE introduces a structure-preserving approach to evidence retrieval from semi-structured documents like HTML by defining subdocuments via paths, pruning, and contextualization mechanisms, leading to claims of higher quality and more diverse citations than standard passage baselines. This addresses a real mismatch in retrieval pipelines for RAG systems. What stands out as new is the set of document primitives: representing candidates as addressable subdocuments that preserve structural identity, with global contextualization adding non-local elements like titles and headers, and local contextualization expanding within the neighborhood for compact views. They combine this with an embedding-based candidate generator on sentence-seeded subdocuments and a query-time document-aware aggregation step that shares context, followed by contextual filtering. This setup defers context decisions and aims for scalability while keeping citations interpretable. The paper does well in laying out these ideas clearly and showing how they can integrate with existing embedding techniques without major overhead. The focus on fixed budgets for citations and diversity is practical for downstream applications. The soft spots are in the experimental validation. The reported superiority on HTML QA benchmarks relies on comparisons to passage-based baselines, but those baselines typically lack the global and local contextualization that SPIRE incorporates. This raises the possibility that gains come more from richer surrounding context than from the tree primitives themselves. Without clear ablations separating these factors, the central claim about structure preservation is not fully isolated. The abstract mentions maintaining scalability, but details on implementation efficiency and any structural biases would strengthen the case. This work is aimed at researchers in information retrieval working on retrieval-augmented generation over semi-structured sources. Readers focused on improving citation quality and interpretability in such systems would find value in the primitives and pipeline design. It shows honest engagement with the literature on chunking limitations. I would bring this to a reading group as maybe, since the core idea is solid but evaluation details need scrutiny. I would not cite it in the next year without more confirmation on the results. It deserves peer review because the problem is important and the approach is grounded enough for referees to provide useful feedback on the experiments.

Referee Report

2 major / 1 minor

Summary. The paper presents SPIRE, a structure-aware retrieval pipeline for semi-structured documents such as HTML. It defines document primitives including paths and path sets, subdocument extraction by pruning, global contextualization (adding non-local scaffolding like titles and headers), and local contextualization (expanding within structural neighborhoods). These support an embedding-based candidate generator indexing sentence-seeded subdocuments, a query-time document-aware aggregation step, and a contextual filtering stage. Experiments on HTML question-answering benchmarks claim that this yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines while maintaining scalability.

Significance. If the results hold after addressing the comparison design, the work could meaningfully advance retrieval-augmented generation over structured sources by reducing the mismatch between tree-structured documents and flat embedding interfaces. The explicit focus on addressable, citation-ready subdocuments with amortized context is a practical strength for interpretability in RAG pipelines.

major comments (2)

[Abstract / Experimental evaluation] Abstract and experimental evaluation: The central claim that preserving structure via the defined primitives (paths, pruning, document-aware aggregation) produces superior citations rests on comparisons to passage-based baselines. However, SPIRE explicitly incorporates global contextualization (non-local scaffolding) and local contextualization (structural neighborhood expansion) that standard linearization baselines typically omit. Without a control that supplies equivalent context to the passage baselines, it is impossible to isolate whether gains derive from the tree primitives themselves or from the added contextual information. This is load-bearing for the structure-specific advantage asserted in the abstract.
[Method description] Method description: The embedding-based candidate generator indexes sentence-seeded subdocuments and the aggregation step amortizes shared structural context, but the manuscript provides no quantitative analysis of how pruning-based extraction interacts with embedding similarity under varying document depths or table/list structures. If pruning introduces systematic biases in candidate selection (e.g., favoring shallower paths), this would undermine the claim of reliable evidence surfacing.

minor comments (1)

[Abstract] The abstract introduces several new terms (subdocument, global contextualization, local contextualization) without a compact forward reference or diagram; a small illustrative figure early in the paper would improve readability.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the introduction of several new document primitives whose correctness and utility are asserted rather than derived from prior results.

axioms (1)

domain assumption HTML and similar documents possess reliable tree structure that can be parsed and addressed via paths
Invoked throughout the description of subdocument extraction and contextualization.

invented entities (3)

subdocument no independent evidence
purpose: precise, addressable selection that preserves structural identity while deferring context choice
Core new abstraction introduced to replace flat chunks.
global contextualization no independent evidence
purpose: adds non-local scaffolding (titles, headers, list/table structure) to make a selection intelligible
New mechanism defined in the pipeline.
local contextualization no independent evidence
purpose: expands a seed selection within its structural neighborhood under a budget
New mechanism defined in the pipeline.

pith-pipeline@v0.9.0 · 5570 in / 1456 out tokens · 135708 ms · 2026-05-16T03:27:38.449833+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

[1]

Introducing contextual retrieval

Anthropic. Introducing contextual retrieval. Anthropic Blog, September 2024. Accessed: 2026-02-09

work page 2024
[2]

XML subtree queries: specification and composition

Michael Benedikt and Irini Fundulaki. XML subtree queries: specification and composition. InProceedings of the 10th International Conference on Database Programming Languages, DBPL’05, page 138–153, Berlin, Heidelberg, 2005. Springer-Verlag

work page 2005
[3]

O’Reilly Media, Sebastopol, CA, 2009

Steven Bird, Ewan Klein, and Edward Loper.Natural Language Processing with Python. O’Reilly Media, Sebastopol, CA, 2009. Chapter on sentence segmentation

work page 2009
[4]

Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Liber to Salvador Barrena, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan Das, Donald Metzler, Slav Petrov, and Kellie Webster. Attributed question answering: Eva...

work page arXiv 2023
[5]

Dense X retrieval: What retrieval granularity should we use? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. Dense X retrieval: What retrieval granularity should we use? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024
[6]

Glass, Shang-Wen Li, and Wen-Tau Yih

Yung-Sung Chuang, Benjamin Cohen-Wang, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James R. Glass, Shang-Wen Li, and Wen-Tau Yih. SelfCite: Self-supervised alignment 23 for context attribution in large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, edit...

work page 2025
[7]

Crichton and Shriram Krishnamurthi

William J. Crichton and Shriram Krishnamurthi. A core calculus for documents.Proceedings of the ACM on Programming Languages, 8(POPL):667–694, 2024

work page 2024
[8]

INEX: Initiative for the Evaluation of XML Retrieval,

Norbert Fuhr, Mounia Lalmas, Saadia Malik, and Zoltán Szlávik. Advances in XML retrieval: The INEX initiative. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Poster), 2005. The INEX initiative ran from 2002–2010; see also Fuhr et al., “INEX: Initiative for the Evaluation of XML Ret...

work page 2005
[9]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[10]

Williams, Bo Wang, and Han Xiao

Michael Günther, Isabelle Mohr, Daniel J. Williams, Bo Wang, and Han Xiao. Late chunk- ing: Contextual chunk embeddings using long-context embedding models.arXiv preprint arXiv:2409.04701, 2024

work page arXiv 2024
[11]

Halasz and Mayer Schwartz

Frank G. Halasz and Mayer Schwartz. The dexter hypertext reference model.Communications of the ACM, 37(2):30–39, 1994

work page 1994
[12]

Context rot: How increasing input tokens impacts llm performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma Research, July 2025. Technical report

work page 2025
[13]

Building a web search engine from scratch in two months with 3 billion neural embeddings, August 2025

Wilson Lin. Building a web search engine from scratch in two months with 3 billion neural embeddings, August 2025. Accessed: 2025-11-17

work page 2025
[14]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024

work page 2024
[15]

Pandoc, December 2025

John MacFarlane, Albert Krewinkel, and Jesse Rosenthal. Pandoc, December 2025

work page 2025
[16]

Nelson.Literary Machines

Theodor H. Nelson.Literary Machines. Mindful Press, 1981. Reprinted through 1993

work page 1981
[17]

Passage Re-ranking with BERT

Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[18]

Qwen3-235B-A22B-Instruct model card

Qwen Team. Qwen3-235B-A22B-Instruct model card. Hugging Face, 2025. Accessed 2026-01-28

work page 2025
[19]

Qwen3-32B model card

Qwen Team. Qwen3-32B model card. Hugging Face, 2025. Accessed 2026-01-28

work page 2025
[20]

Qwen3-Embedding-0.6B model card

Qwen Team. Qwen3-Embedding-0.6B model card. Hugging Face, 2025. Accessed 2026-01-28

work page 2025
[21]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024. 24

work page 2024
[22]

ASQA: Factoid questions meet long-form answers

Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2022

work page 2022
[23]

HtmlRAG: HTML is better than plain text for modeling retrieved knowledge in rag systems

Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, and Ji-Rong Wen. HtmlRAG: HTML is better than plain text for modeling retrieved knowledge in rag systems. InProceedings of the ACM on Web Conference 2025, WWW ’25, page 1733–1746, New York, NY, USA, 2025. Association for Computing Machinery

work page 2025
[24]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page 2025
[25]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018

work page 2018
[26]

LongCite: Enabling LLMs to generate fine-grained citations in long-context QA.arXiv preprint arXiv:2409.02897, 2024

Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li. LongCite: Enabling LLMs to generate fine-grained citations in long-context QA.arXiv preprint arXiv:2409.02897, 2024. A Prompts used in experiments The following is the prompt we use for scoring citations via our LLM judge. You a...

work page arXiv 2024
[27]

Read the QUESTION carefully - understand exactly what is being asked

work page
[28]

Read the CITATION - analyze what information it contains

work page
[29]

Answer: Does this citation help answer the question? (YES or NO)

work page
[30]

If YES: Extract the specific part of the citation that helps

work page
[31]

What is the speed of light in a vacuum?

If YES: Identify which part of the question this citation addresses EVALUATION PRINCIPLES: **When to answer YES (helps_answer_question = true):** - Citation contains facts, names, dates, or details that directly answer the question - Citation provides information needed to verify or find the answer - Citation discusses the specific entities, events, or co...

work page 2023