pith. machine review for the scientific record. sign in

arxiv: 2604.06173 · v1 · submitted 2026-01-24 · 💻 cs.IR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:24 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords legal question answeringstatute retrievalgraph-guided retrievalhallucinationmodel safetyregulatory reasoningbenchmarkfire safety regulations
0
0 comments X

The pith

Graph-guided retrieval improves performance on statute-centric legal questions but domain-adapted models hallucinate more when statutory evidence is missing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal question answering has mostly tested case law, but statutes present different problems because rules are spread across linked documents in hierarchies. This paper creates SearchFireSafety, a benchmark based on fire-safety regulations, to test if models can find the right pieces of evidence and know when to say they don't have enough information. Experiments show that using graph structures to guide retrieval helps models answer better. However, models fine-tuned on legal data are more likely to make up answers instead of admitting gaps in their knowledge. The work argues that future benchmarks need to check both retrieval accuracy and safe refusal behavior together.

Core claim

The paper establishes that statute-centric legal QA involves hierarchically fragmented evidence, leading to a retrieval gap in standard methods and hallucinations in incomplete contexts. By creating SearchFireSafety with real-world citation-aware questions and synthetic partial-context tests, it demonstrates that graph-guided retrieval substantially boosts model performance across LLMs, while revealing that domain-adapted models exhibit higher hallucination rates when key statutory evidence is absent.

What carries the argument

SearchFireSafety benchmark, a dual-source evaluation framework that combines real regulatory questions requiring hierarchical citation retrieval with synthetic scenarios testing refusal under missing context, using graph-guided retrieval as the improved method.

If this is right

  • Models using graph-guided retrieval can better handle distributed statutory evidence in regulatory domains.
  • Domain adaptation for legal tasks increases the risk of unsafe hallucinations in incomplete contexts.
  • Benchmarks for legal QA should incorporate both hierarchical retrieval evaluation and safety testing for abstention.
  • Regulatory AI systems may require specific safeguards when evidence from linked statutes is partial.
  • The trade-off suggests that retrieval improvements alone do not ensure safe behavior in statute-based reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hierarchical structures in other regulatory areas like tax or environmental law could benefit from the same graph approach.
  • The findings imply that training data for domain-adapted models should include more examples of abstaining on incomplete info.
  • Extending the benchmark to multiple jurisdictions might reveal if the safety trade-off is universal.
  • Developers of legal AI tools should prioritize refusal training alongside retrieval enhancements.

Load-bearing premise

The assumption that fire-safety regulations are representative of broader statute-centric legal QA challenges and that the synthetic partial-context scenarios accurately simulate real-world cases of missing evidence.

What would settle it

Testing the same models on a different regulatory domain like building codes or tax statutes and measuring if the hallucination increase in domain-adapted models persists when evidence is withheld.

Figures

Figures reproduced from arXiv: 2604.06173 by Hyunbin Jin, Ijun Jang, Jeongjae Park, Jewon Yeom, Jinkwan Jang, Kyubyung Chae, Seunghyun Bae, Taesup Kim.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework and datasets. (1) Construction of a temporally current legal corpus [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCA-based local subgraph visualization (cosine kNN vs explicit). For each query, we embed the local node set and project it to 2D with PCA (PC1/PC2). Seeds (top-k retrieved documents) are shown in blue, ground-truth documents in orange, and other 1-hop neighbor candidates are shown in gray. We draw directed edges from seeds to their neighbors; edges that directly connect a seed to a ground-truth node are h… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of continued pretraining (CPT) across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of Inquiry Types on Real-World [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of Korea National Law Information [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
read the original abstract

Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA instantiated on fire-safety regulations. It combines real-world citation-aware questions with synthetic partial-context scenarios to evaluate hierarchical retrieval (via graph-guided methods) and model safety (hallucination and refusal) across multiple LLMs. The central empirical claims are that graph-guided retrieval yields substantial performance gains while domain-adapted models exhibit a higher tendency to hallucinate when key statutory evidence is absent.

Significance. If the empirical results hold under more rigorous validation, the work usefully shifts legal QA evaluation away from case-law dominance toward regulatory statutes and supplies a dual-source framework that jointly tests retrieval and safety. The reported trade-off between adaptation and hallucination risk is a concrete, falsifiable observation that could guide future benchmark design and model development in high-stakes regulatory domains.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (Benchmark Construction): the claim that fire-safety regulations constitute a 'representative case' for general statute-centric QA is load-bearing for the generalizability of the safety trade-off, yet the manuscript provides no cross-domain replication, citation-pattern statistics, or external validation against practitioner queries from other regulatory areas (e.g., tax or environmental statutes).
  2. [§4, §5] §4 (Evaluation Framework) and §5 (Experiments): the synthetic partial-context scenarios are central to the hallucination/refusal results, but the paper supplies insufficient detail on how missing-evidence contexts are generated, how refusal is scored, and what controls prevent leakage of full statutory text; without these, it is impossible to determine whether the observed domain-adaptation penalty is an artifact of the synthetic construction.
  3. [§5] §5 (Results): the headline performance gains from graph-guided retrieval are reported without baseline retriever ablations (e.g., standard BM25 or dense retrievers with hierarchical chunking) or statistical significance tests across the multiple LLMs, weakening the claim that the improvement is specifically attributable to structure awareness.
minor comments (2)
  1. [§4] Notation for the dual-source framework (real vs. synthetic) is introduced without a clear diagram or table summarizing the two evaluation tracks.
  2. [§5] Model selection criteria and exact prompting templates used for the hallucination tests are not listed in the main text or appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the manuscript's rigor and clarity. We address each major comment point by point below, with planned revisions indicated.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Benchmark Construction): the claim that fire-safety regulations constitute a 'representative case' for general statute-centric QA is load-bearing for the generalizability of the safety trade-off, yet the manuscript provides no cross-domain replication, citation-pattern statistics, or external validation against practitioner queries from other regulatory areas (e.g., tax or environmental statutes).

    Authors: We chose fire-safety regulations as a representative case because it exhibits the hierarchical linking, cross-references, and distributed evidence typical of regulatory statutes. We agree that broader validation would strengthen generalizability claims. In revision, we will add citation-pattern statistics from the corpus (e.g., average depth of hierarchies and cross-reference density) and a discussion of structural commonalities with domains like tax and environmental law. We will also explicitly note the absence of cross-domain replication as a limitation and suggest it for future work. This provides a stronger foundation without new data collection. revision: partial

  2. Referee: [§4, §5] §4 (Evaluation Framework) and §5 (Experiments): the synthetic partial-context scenarios are central to the hallucination/refusal results, but the paper supplies insufficient detail on how missing-evidence contexts are generated, how refusal is scored, and what controls prevent leakage of full statutory text; without these, it is impossible to determine whether the observed domain-adaptation penalty is an artifact of the synthetic construction.

    Authors: We agree that reproducibility requires these details. The revised manuscript will expand §4 to describe: (1) generation of partial contexts by removing graph-linked statutory sections corresponding to question citations; (2) refusal scoring as a binary label based on explicit abstention language versus any substantive response; and (3) leakage controls via automated checks confirming no complete statutory text appears in synthetic inputs. These clarifications will show the domain-adaptation penalty is not an artifact of the construction. revision: yes

  3. Referee: [§5] §5 (Results): the headline performance gains from graph-guided retrieval are reported without baseline retriever ablations (e.g., standard BM25 or dense retrievers with hierarchical chunking) or statistical significance tests across the multiple LLMs, weakening the claim that the improvement is specifically attributable to structure awareness.

    Authors: We will add the requested ablations and tests. The revised §5 will include direct comparisons to BM25 and dense retrievers on hierarchically chunked documents, plus statistical significance testing (e.g., paired t-tests or McNemar's test) across LLMs. These additions will more rigorously attribute gains to the structure-aware graph guidance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark evaluation is self-contained

full rationale

The paper introduces the SearchFireSafety benchmark instantiated on fire-safety regulations and reports direct experimental results across LLMs for graph-guided retrieval performance and hallucination/refusal behavior under partial context. No equations, derivations, fitted parameters, or self-citation chains are present that reduce any claim to its own inputs by construction. The core findings rest on described benchmark construction and model evaluations rather than self-definitional steps or renamed known results. The representativeness of the fire-safety domain is an explicit modeling choice, not a derived quantity that collapses into prior self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that fire-safety regulations capture the general challenges of hierarchical statutory evidence and safety in regulatory QA.

axioms (1)
  • domain assumption Fire-safety regulations are a representative case for statute-centric legal QA
    Explicitly stated as the instantiation choice in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1158 out tokens · 40074 ms · 2026-05-16T11:24:56.943829+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Small Language Models are the Future of Agentic AI

    Natural language processing for the legal do- main: A survey of tasks, datasets, models, and chal- lenges.ACM Computing Surveys, 58(6):1–37. Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Ce- line Lin, and Pavlo Molchanov. 2025. Small lan- guage models are the future of agentic ai.Preprint, arXiv:2506.02153. I...

  2. [2]

    Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras

    Legal-bert: The muppets straight out of law school.arXiv preprint arXiv:2010.02559. Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. Lexglue: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computation...

  3. [3]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar

    Legalsearchlm: Rethinking legal case re- trieval as legal elements generation.arXiv preprint arXiv:2505.23832. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for un- certainty estimation in natural language generation. arXiv preprint arXiv:2302.09664. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio ...

  4. [4]

    Qwen3 Technical Report

    Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...

  5. [5]

    Enforce- ment Decree Table 7, Note 1

    [Explanation] (Explaining both the unanswerable and answerable scenarios) ,→ ,→ **Language Instruction:** Your entire response must be **in Korean**.,→ *** ## Provided Context Documents ### Document A: {document_a} ### Document B: {document_b} *** Table 8: Prompt Template for Multi-Hop QA Generation (Section 3.3). H Prompts for Real-World Expert QA Evalua...