Recognition: no theorem link
SPIRE: Structure-Preserving Interpretable Retrieval of Evidence
Pith reviewed 2026-05-16 03:27 UTC · model grok-4.3
The pith
SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.
Load-bearing premise
That the defined document primitives (paths, path sets, pruning-based extraction, global and local contextualization) can be implemented efficiently and that the embedding-based candidate generator plus document-aware aggregation will reliably surface better evidence than standard linearization without introducing structural biases or scalability issues.
Figures
read the original abstract
Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SPIRE, a structure-aware retrieval pipeline for semi-structured documents such as HTML. It defines document primitives including paths and path sets, subdocument extraction by pruning, global contextualization (adding non-local scaffolding like titles and headers), and local contextualization (expanding within structural neighborhoods). These support an embedding-based candidate generator indexing sentence-seeded subdocuments, a query-time document-aware aggregation step, and a contextual filtering stage. Experiments on HTML question-answering benchmarks claim that this yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines while maintaining scalability.
Significance. If the results hold after addressing the comparison design, the work could meaningfully advance retrieval-augmented generation over structured sources by reducing the mismatch between tree-structured documents and flat embedding interfaces. The explicit focus on addressable, citation-ready subdocuments with amortized context is a practical strength for interpretability in RAG pipelines.
major comments (2)
- [Abstract / Experimental evaluation] Abstract and experimental evaluation: The central claim that preserving structure via the defined primitives (paths, pruning, document-aware aggregation) produces superior citations rests on comparisons to passage-based baselines. However, SPIRE explicitly incorporates global contextualization (non-local scaffolding) and local contextualization (structural neighborhood expansion) that standard linearization baselines typically omit. Without a control that supplies equivalent context to the passage baselines, it is impossible to isolate whether gains derive from the tree primitives themselves or from the added contextual information. This is load-bearing for the structure-specific advantage asserted in the abstract.
- [Method description] Method description: The embedding-based candidate generator indexes sentence-seeded subdocuments and the aggregation step amortizes shared structural context, but the manuscript provides no quantitative analysis of how pruning-based extraction interacts with embedding similarity under varying document depths or table/list structures. If pruning introduces systematic biases in candidate selection (e.g., favoring shallower paths), this would undermine the claim of reliable evidence surfacing.
minor comments (1)
- [Abstract] The abstract introduces several new terms (subdocument, global contextualization, local contextualization) without a compact forward reference or diagram; a small illustrative figure early in the paper would improve readability.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption HTML and similar documents possess reliable tree structure that can be parsed and addressed via paths
invented entities (3)
-
subdocument
no independent evidence
-
global contextualization
no independent evidence
-
local contextualization
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Introducing contextual retrieval
Anthropic. Introducing contextual retrieval. Anthropic Blog, September 2024. Accessed: 2026-02-09
work page 2024
-
[2]
XML subtree queries: specification and composition
Michael Benedikt and Irini Fundulaki. XML subtree queries: specification and composition. InProceedings of the 10th International Conference on Database Programming Languages, DBPL’05, page 138–153, Berlin, Heidelberg, 2005. Springer-Verlag
work page 2005
-
[3]
O’Reilly Media, Sebastopol, CA, 2009
Steven Bird, Ewan Klein, and Edward Loper.Natural Language Processing with Python. O’Reilly Media, Sebastopol, CA, 2009. Chapter on sentence segmentation
work page 2009
-
[4]
Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Liber to Salvador Barrena, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan Das, Donald Metzler, Slav Petrov, and Kellie Webster. Attributed question answering: Eva...
-
[5]
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. Dense X retrieval: What retrieval granularity should we use? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[6]
Glass, Shang-Wen Li, and Wen-Tau Yih
Yung-Sung Chuang, Benjamin Cohen-Wang, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James R. Glass, Shang-Wen Li, and Wen-Tau Yih. SelfCite: Self-supervised alignment 23 for context attribution in large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, edit...
work page 2025
-
[7]
Crichton and Shriram Krishnamurthi
William J. Crichton and Shriram Krishnamurthi. A core calculus for documents.Proceedings of the ACM on Programming Languages, 8(POPL):667–694, 2024
work page 2024
-
[8]
INEX: Initiative for the Evaluation of XML Retrieval,
Norbert Fuhr, Mounia Lalmas, Saadia Malik, and Zoltán Szlávik. Advances in XML retrieval: The INEX initiative. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Poster), 2005. The INEX initiative ran from 2002–2010; see also Fuhr et al., “INEX: Initiative for the Evaluation of XML Ret...
work page 2005
-
[9]
Enabling large language models to generate text with citations
Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
work page 2023
-
[10]
Williams, Bo Wang, and Han Xiao
Michael Günther, Isabelle Mohr, Daniel J. Williams, Bo Wang, and Han Xiao. Late chunk- ing: Contextual chunk embeddings using long-context embedding models.arXiv preprint arXiv:2409.04701, 2024
-
[11]
Frank G. Halasz and Mayer Schwartz. The dexter hypertext reference model.Communications of the ACM, 37(2):30–39, 1994
work page 1994
-
[12]
Context rot: How increasing input tokens impacts llm performance
Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma Research, July 2025. Technical report
work page 2025
-
[13]
Wilson Lin. Building a web search engine from scratch in two months with 3 billion neural embeddings, August 2025. Accessed: 2025-11-17
work page 2025
-
[14]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024
work page 2024
-
[15]
John MacFarlane, Albert Krewinkel, and Jesse Rosenthal. Pandoc, December 2025
work page 2025
-
[16]
Theodor H. Nelson.Literary Machines. Mindful Press, 1981. Reprinted through 1993
work page 1981
-
[17]
Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[18]
Qwen3-235B-A22B-Instruct model card
Qwen Team. Qwen3-235B-A22B-Instruct model card. Hugging Face, 2025. Accessed 2026-01-28
work page 2025
-
[19]
Qwen Team. Qwen3-32B model card. Hugging Face, 2025. Accessed 2026-01-28
work page 2025
-
[20]
Qwen3-Embedding-0.6B model card
Qwen Team. Qwen3-Embedding-0.6B model card. Hugging Face, 2025. Accessed 2026-01-28
work page 2025
-
[21]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024. 24
work page 2024
-
[22]
ASQA: Factoid questions meet long-form answers
Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2022
work page 2022
-
[23]
HtmlRAG: HTML is better than plain text for modeling retrieved knowledge in rag systems
Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, and Ji-Rong Wen. HtmlRAG: HTML is better than plain text for modeling retrieved knowledge in rag systems. InProceedings of the ACM on Web Conference 2025, WWW ’25, page 1733–1746, New York, NY, USA, 2025. Association for Computing Machinery
work page 2025
-
[24]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page 2025
-
[25]
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018
work page 2018
-
[26]
Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li. LongCite: Enabling LLMs to generate fine-grained citations in long-context QA.arXiv preprint arXiv:2409.02897, 2024. A Prompts used in experiments The following is the prompt we use for scoring citations via our LLM judge. You a...
-
[27]
Read the QUESTION carefully - understand exactly what is being asked
-
[28]
Read the CITATION - analyze what information it contains
-
[29]
Answer: Does this citation help answer the question? (YES or NO)
-
[30]
If YES: Extract the specific part of the citation that helps
-
[31]
What is the speed of light in a vacuum?
If YES: Identify which part of the question this citation addresses EVALUATION PRINCIPLES: **When to answer YES (helps_answer_question = true):** - Citation contains facts, names, dates, or details that directly answer the question - Citation provides information needed to verify or find the answer - Citation discusses the specific entities, events, or co...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.