pith. machine review for the scientific record. sign in

arxiv: 2604.20849 · v1 · submitted 2026-02-12 · 💻 cs.IR · cs.AI· cs.CL

Recognition: no theorem link

SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

Authors on Pith no claims yet

Pith reviewed 2026-05-16 03:27 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords structurecontextcontextualizationretrievalstructuralwhilecandidatesdocument
0
0 comments X

The pith

SPIRE presents a tree-structured retrieval method using subdocuments, paths, and dual contextualization that produces higher-quality and more diverse citations than passage-based baselines on HTML QA benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The core problem is that current AI retrieval systems break web pages into flat pieces of text, losing the headings, lists, and tables that make information understandable. SPIRE instead treats each document as a tree and lets the system pick precise sub-parts called subdocuments. These subdocuments keep their original address in the tree so the system can later add the right amount of surrounding context. Global contextualization brings in titles and overall structure to make a selection readable. Local contextualization grows the selection outward within its immediate neighborhood until it reaches a size limit. The system first finds candidate subdocuments with embeddings, then aggregates shared context across them, and finally re-scores the results with the locally expanded views. On standard HTML question-answering tests, this approach returned more useful and varied citations than the usual chunking methods while staying fast enough for practical use.

Core claim

Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

Load-bearing premise

That the defined document primitives (paths, path sets, pruning-based extraction, global and local contextualization) can be implemented efficiently and that the embedding-based candidate generator plus document-aware aggregation will reliably surface better evidence than standard linearization without introducing structural biases or scalability issues.

Figures

Figures reproduced from arXiv: 2604.20849 by Mike Rainey, Muhammed Sezer, Umut Acar.

Figure 1
Figure 1. Figure 1: Sample excerpts generated while retrieving for the query: “How many state parks are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sentence-anchored subdocuments. Sentence boundaries are inferred and recorded using [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents SPIRE, a structure-aware retrieval pipeline for semi-structured documents such as HTML. It defines document primitives including paths and path sets, subdocument extraction by pruning, global contextualization (adding non-local scaffolding like titles and headers), and local contextualization (expanding within structural neighborhoods). These support an embedding-based candidate generator indexing sentence-seeded subdocuments, a query-time document-aware aggregation step, and a contextual filtering stage. Experiments on HTML question-answering benchmarks claim that this yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines while maintaining scalability.

Significance. If the results hold after addressing the comparison design, the work could meaningfully advance retrieval-augmented generation over structured sources by reducing the mismatch between tree-structured documents and flat embedding interfaces. The explicit focus on addressable, citation-ready subdocuments with amortized context is a practical strength for interpretability in RAG pipelines.

major comments (2)
  1. [Abstract / Experimental evaluation] Abstract and experimental evaluation: The central claim that preserving structure via the defined primitives (paths, pruning, document-aware aggregation) produces superior citations rests on comparisons to passage-based baselines. However, SPIRE explicitly incorporates global contextualization (non-local scaffolding) and local contextualization (structural neighborhood expansion) that standard linearization baselines typically omit. Without a control that supplies equivalent context to the passage baselines, it is impossible to isolate whether gains derive from the tree primitives themselves or from the added contextual information. This is load-bearing for the structure-specific advantage asserted in the abstract.
  2. [Method description] Method description: The embedding-based candidate generator indexes sentence-seeded subdocuments and the aggregation step amortizes shared structural context, but the manuscript provides no quantitative analysis of how pruning-based extraction interacts with embedding similarity under varying document depths or table/list structures. If pruning introduces systematic biases in candidate selection (e.g., favoring shallower paths), this would undermine the claim of reliable evidence surfacing.
minor comments (1)
  1. [Abstract] The abstract introduces several new terms (subdocument, global contextualization, local contextualization) without a compact forward reference or diagram; a small illustrative figure early in the paper would improve readability.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim rests on the introduction of several new document primitives whose correctness and utility are asserted rather than derived from prior results.

axioms (1)
  • domain assumption HTML and similar documents possess reliable tree structure that can be parsed and addressed via paths
    Invoked throughout the description of subdocument extraction and contextualization.
invented entities (3)
  • subdocument no independent evidence
    purpose: precise, addressable selection that preserves structural identity while deferring context choice
    Core new abstraction introduced to replace flat chunks.
  • global contextualization no independent evidence
    purpose: adds non-local scaffolding (titles, headers, list/table structure) to make a selection intelligible
    New mechanism defined in the pipeline.
  • local contextualization no independent evidence
    purpose: expands a seed selection within its structural neighborhood under a budget
    New mechanism defined in the pipeline.

pith-pipeline@v0.9.0 · 5570 in / 1456 out tokens · 135708 ms · 2026-05-16T03:27:38.449833+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Introducing contextual retrieval

    Anthropic. Introducing contextual retrieval. Anthropic Blog, September 2024. Accessed: 2026-02-09

  2. [2]

    XML subtree queries: specification and composition

    Michael Benedikt and Irini Fundulaki. XML subtree queries: specification and composition. InProceedings of the 10th International Conference on Database Programming Languages, DBPL’05, page 138–153, Berlin, Heidelberg, 2005. Springer-Verlag

  3. [3]

    O’Reilly Media, Sebastopol, CA, 2009

    Steven Bird, Ewan Klein, and Edward Loper.Natural Language Processing with Python. O’Reilly Media, Sebastopol, CA, 2009. Chapter on sentence segmentation

  4. [4]

    Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Jacob Eisenstein, Kuzman Ganchev, Jonathan Herzig, Kai Hui, Tom Kwiatkowski, Ji Ma, Jianmo Ni, Liber to Salvador Barrena, Tal Schuster, William W. Cohen, Michael Collins, Dipanjan Das, Donald Metzler, Slav Petrov, and Kellie Webster. Attributed question answering: Eva...

  5. [5]

    Dense X retrieval: What retrieval granularity should we use? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

    Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. Dense X retrieval: What retrieval granularity should we use? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  6. [6]

    Glass, Shang-Wen Li, and Wen-Tau Yih

    Yung-Sung Chuang, Benjamin Cohen-Wang, Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James R. Glass, Shang-Wen Li, and Wen-Tau Yih. SelfCite: Self-supervised alignment 23 for context attribution in large language models. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, edit...

  7. [7]

    Crichton and Shriram Krishnamurthi

    William J. Crichton and Shriram Krishnamurthi. A core calculus for documents.Proceedings of the ACM on Programming Languages, 8(POPL):667–694, 2024

  8. [8]

    INEX: Initiative for the Evaluation of XML Retrieval,

    Norbert Fuhr, Mounia Lalmas, Saadia Malik, and Zoltán Szlávik. Advances in XML retrieval: The INEX initiative. InProceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Poster), 2005. The INEX initiative ran from 2002–2010; see also Fuhr et al., “INEX: Initiative for the Evaluation of XML Ret...

  9. [9]

    Enabling large language models to generate text with citations

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  10. [10]

    Williams, Bo Wang, and Han Xiao

    Michael Günther, Isabelle Mohr, Daniel J. Williams, Bo Wang, and Han Xiao. Late chunk- ing: Contextual chunk embeddings using long-context embedding models.arXiv preprint arXiv:2409.04701, 2024

  11. [11]

    Halasz and Mayer Schwartz

    Frank G. Halasz and Mayer Schwartz. The dexter hypertext reference model.Communications of the ACM, 37(2):30–39, 1994

  12. [12]

    Context rot: How increasing input tokens impacts llm performance

    Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma Research, July 2025. Technical report

  13. [13]

    Building a web search engine from scratch in two months with 3 billion neural embeddings, August 2025

    Wilson Lin. Building a web search engine from scratch in two months with 3 billion neural embeddings, August 2025. Accessed: 2025-11-17

  14. [14]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics (TACL), 12:157–173, 2024

  15. [15]

    Pandoc, December 2025

    John MacFarlane, Albert Krewinkel, and Jesse Rosenthal. Pandoc, December 2025

  16. [16]

    Nelson.Literary Machines

    Theodor H. Nelson.Literary Machines. Mindful Press, 1981. Reprinted through 1993

  17. [17]

    Passage Re-ranking with BERT

    Rodrigo Nogueira and Kyunghyun Cho. Passage re-ranking with BERT.arXiv preprint arXiv:1901.04085, 2019

  18. [18]

    Qwen3-235B-A22B-Instruct model card

    Qwen Team. Qwen3-235B-A22B-Instruct model card. Hugging Face, 2025. Accessed 2026-01-28

  19. [19]

    Qwen3-32B model card

    Qwen Team. Qwen3-32B model card. Hugging Face, 2025. Accessed 2026-01-28

  20. [20]

    Qwen3-Embedding-0.6B model card

    Qwen Team. Qwen3-Embedding-0.6B model card. Hugging Face, 2025. Accessed 2026-01-28

  21. [21]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations (ICLR), 2024. 24

  22. [22]

    ASQA: Factoid questions meet long-form answers

    Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. ASQA: Factoid questions meet long-form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2022

  23. [23]

    HtmlRAG: HTML is better than plain text for modeling retrieved knowledge in rag systems

    Jiejun Tan, Zhicheng Dou, Wen Wang, Mang Wang, Weipeng Chen, and Ji-Rong Wen. HtmlRAG: HTML is better than plain text for modeling retrieved knowledge in rag systems. InProceedings of the ACM on Web Conference 2025, WWW ’25, page 1733–1746, New York, NY, USA, 2025. Association for Computing Machinery

  24. [24]

    Qwen3 technical report, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  25. [25]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380. Association for Computational Linguistics, 2018

  26. [26]

    LongCite: Enabling LLMs to generate fine-grained citations in long-context QA.arXiv preprint arXiv:2409.02897, 2024

    Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, and Juanzi Li. LongCite: Enabling LLMs to generate fine-grained citations in long-context QA.arXiv preprint arXiv:2409.02897, 2024. A Prompts used in experiments The following is the prompt we use for scoring citations via our LLM judge. You a...

  27. [27]

    Read the QUESTION carefully - understand exactly what is being asked

  28. [28]

    Read the CITATION - analyze what information it contains

  29. [29]

    Answer: Does this citation help answer the question? (YES or NO)

  30. [30]

    If YES: Extract the specific part of the citation that helps

  31. [31]

    What is the speed of light in a vacuum?

    If YES: Identify which part of the question this citation addresses EVALUATION PRINCIPLES: **When to answer YES (helps_answer_question = true):** - Citation contains facts, names, dates, or details that directly answer the question - Citation provides information needed to verify or find the answer - Citation discusses the specific entities, events, or co...