arxiv: 2604.06173 · v1 · submitted 2026-01-24 · 💻 cs.IR · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Beyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA

Kyubyung Chae , Jewon Yeom , Jeongjae Park , Seunghyun Bae , Ijun Jang , Hyunbin Jin , Jinkwan Jang , Taesup Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:24 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords legal question answeringstatute retrievalgraph-guided retrievalhallucinationmodel safetyregulatory reasoningbenchmarkfire safety regulations

0 comments

The pith

Graph-guided retrieval improves performance on statute-centric legal questions but domain-adapted models hallucinate more when statutory evidence is missing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal question answering has mostly tested case law, but statutes present different problems because rules are spread across linked documents in hierarchies. This paper creates SearchFireSafety, a benchmark based on fire-safety regulations, to test if models can find the right pieces of evidence and know when to say they don't have enough information. Experiments show that using graph structures to guide retrieval helps models answer better. However, models fine-tuned on legal data are more likely to make up answers instead of admitting gaps in their knowledge. The work argues that future benchmarks need to check both retrieval accuracy and safe refusal behavior together.

Core claim

The paper establishes that statute-centric legal QA involves hierarchically fragmented evidence, leading to a retrieval gap in standard methods and hallucinations in incomplete contexts. By creating SearchFireSafety with real-world citation-aware questions and synthetic partial-context tests, it demonstrates that graph-guided retrieval substantially boosts model performance across LLMs, while revealing that domain-adapted models exhibit higher hallucination rates when key statutory evidence is absent.

What carries the argument

SearchFireSafety benchmark, a dual-source evaluation framework that combines real regulatory questions requiring hierarchical citation retrieval with synthetic scenarios testing refusal under missing context, using graph-guided retrieval as the improved method.

If this is right

Models using graph-guided retrieval can better handle distributed statutory evidence in regulatory domains.
Domain adaptation for legal tasks increases the risk of unsafe hallucinations in incomplete contexts.
Benchmarks for legal QA should incorporate both hierarchical retrieval evaluation and safety testing for abstention.
Regulatory AI systems may require specific safeguards when evidence from linked statutes is partial.
The trade-off suggests that retrieval improvements alone do not ensure safe behavior in statute-based reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar hierarchical structures in other regulatory areas like tax or environmental law could benefit from the same graph approach.
The findings imply that training data for domain-adapted models should include more examples of abstaining on incomplete info.
Extending the benchmark to multiple jurisdictions might reveal if the safety trade-off is universal.
Developers of legal AI tools should prioritize refusal training alongside retrieval enhancements.

Load-bearing premise

The assumption that fire-safety regulations are representative of broader statute-centric legal QA challenges and that the synthetic partial-context scenarios accurately simulate real-world cases of missing evidence.

What would settle it

Testing the same models on a different regulatory domain like building codes or tax statutes and measuring if the hallucination increase in domain-adapted models persists when evidence is withheld.

Figures

Figures reproduced from arXiv: 2604.06173 by Hyunbin Jin, Ijun Jang, Jeongjae Park, Jewon Yeom, Jinkwan Jang, Kyubyung Chae, Seunghyun Bae, Taesup Kim.

**Figure 2.** Figure 2: PCA-based local subgraph visualization (cosine kNN vs explicit). For each query, we embed the local node set and project it to 2D with PCA (PC1/PC2). Seeds (top-k retrieved documents) are shown in blue, ground-truth documents in orange, and other 1-hop neighbor candidates are shown in gray. We draw directed edges from seeds to their neighbors; edges that directly connect a seed to a ground-truth node are h… view at source ↗

**Figure 3.** Figure 3: Effect of continued pretraining (CPT) across [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of Inquiry Types on Real-World [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of Korea National Law Information [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SearchFireSafety benchmark shows graph retrieval helps with hierarchical statutes but domain-adapted models hallucinate more under missing evidence.

read the letter

This paper's core point is that statute-centric legal QA needs its own benchmarks because evidence is scattered across linked documents in ways case law does not match. They built SearchFireSafety on fire-safety regulations with a dual-source setup: real questions that require pulling the right citations plus synthetic partial-context tests to check when models should refuse instead of guessing. The experiments across several LLMs indicate graph-guided retrieval lifts performance while adapted models show higher hallucination rates when key text is absent.

Referee Report

3 major / 2 minor

Summary. The paper introduces SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA instantiated on fire-safety regulations. It combines real-world citation-aware questions with synthetic partial-context scenarios to evaluate hierarchical retrieval (via graph-guided methods) and model safety (hallucination and refusal) across multiple LLMs. The central empirical claims are that graph-guided retrieval yields substantial performance gains while domain-adapted models exhibit a higher tendency to hallucinate when key statutory evidence is absent.

Significance. If the empirical results hold under more rigorous validation, the work usefully shifts legal QA evaluation away from case-law dominance toward regulatory statutes and supplies a dual-source framework that jointly tests retrieval and safety. The reported trade-off between adaptation and hallucination risk is a concrete, falsifiable observation that could guide future benchmark design and model development in high-stakes regulatory domains.

major comments (3)

[Abstract, §3] Abstract and §3 (Benchmark Construction): the claim that fire-safety regulations constitute a 'representative case' for general statute-centric QA is load-bearing for the generalizability of the safety trade-off, yet the manuscript provides no cross-domain replication, citation-pattern statistics, or external validation against practitioner queries from other regulatory areas (e.g., tax or environmental statutes).
[§4, §5] §4 (Evaluation Framework) and §5 (Experiments): the synthetic partial-context scenarios are central to the hallucination/refusal results, but the paper supplies insufficient detail on how missing-evidence contexts are generated, how refusal is scored, and what controls prevent leakage of full statutory text; without these, it is impossible to determine whether the observed domain-adaptation penalty is an artifact of the synthetic construction.
[§5] §5 (Results): the headline performance gains from graph-guided retrieval are reported without baseline retriever ablations (e.g., standard BM25 or dense retrievers with hierarchical chunking) or statistical significance tests across the multiple LLMs, weakening the claim that the improvement is specifically attributable to structure awareness.

minor comments (2)

[§4] Notation for the dual-source framework (real vs. synthetic) is introduced without a clear diagram or table summarizing the two evaluation tracks.
[§5] Model selection criteria and exact prompting templates used for the hallucination tests are not listed in the main text or appendix.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving the manuscript's rigor and clarity. We address each major comment point by point below, with planned revisions indicated.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Benchmark Construction): the claim that fire-safety regulations constitute a 'representative case' for general statute-centric QA is load-bearing for the generalizability of the safety trade-off, yet the manuscript provides no cross-domain replication, citation-pattern statistics, or external validation against practitioner queries from other regulatory areas (e.g., tax or environmental statutes).

Authors: We chose fire-safety regulations as a representative case because it exhibits the hierarchical linking, cross-references, and distributed evidence typical of regulatory statutes. We agree that broader validation would strengthen generalizability claims. In revision, we will add citation-pattern statistics from the corpus (e.g., average depth of hierarchies and cross-reference density) and a discussion of structural commonalities with domains like tax and environmental law. We will also explicitly note the absence of cross-domain replication as a limitation and suggest it for future work. This provides a stronger foundation without new data collection. revision: partial
Referee: [§4, §5] §4 (Evaluation Framework) and §5 (Experiments): the synthetic partial-context scenarios are central to the hallucination/refusal results, but the paper supplies insufficient detail on how missing-evidence contexts are generated, how refusal is scored, and what controls prevent leakage of full statutory text; without these, it is impossible to determine whether the observed domain-adaptation penalty is an artifact of the synthetic construction.

Authors: We agree that reproducibility requires these details. The revised manuscript will expand §4 to describe: (1) generation of partial contexts by removing graph-linked statutory sections corresponding to question citations; (2) refusal scoring as a binary label based on explicit abstention language versus any substantive response; and (3) leakage controls via automated checks confirming no complete statutory text appears in synthetic inputs. These clarifications will show the domain-adaptation penalty is not an artifact of the construction. revision: yes
Referee: [§5] §5 (Results): the headline performance gains from graph-guided retrieval are reported without baseline retriever ablations (e.g., standard BM25 or dense retrievers with hierarchical chunking) or statistical significance tests across the multiple LLMs, weakening the claim that the improvement is specifically attributable to structure awareness.

Authors: We will add the requested ablations and tests. The revised §5 will include direct comparisons to BM25 and dense retrievers on hierarchically chunked documents, plus statistical significance testing (e.g., paired t-tests or McNemar's test) across LLMs. These additions will more rigorously attribute gains to the structure-aware graph guidance. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark evaluation is self-contained

full rationale

The paper introduces the SearchFireSafety benchmark instantiated on fire-safety regulations and reports direct experimental results across LLMs for graph-guided retrieval performance and hallucination/refusal behavior under partial context. No equations, derivations, fitted parameters, or self-citation chains are present that reduce any claim to its own inputs by construction. The core findings rest on described benchmark construction and model evaluations rather than self-definitional steps or renamed known results. The representativeness of the fire-safety domain is an explicit modeling choice, not a derived quantity that collapses into prior self-citations or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that fire-safety regulations capture the general challenges of hierarchical statutory evidence and safety in regulatory QA.

axioms (1)

domain assumption Fire-safety regulations are a representative case for statute-centric legal QA
Explicitly stated as the instantiation choice in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1158 out tokens · 40074 ms · 2026-05-16T11:24:56.943829+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce SEARCHFIRESAFETY, a structure- and safety-aware benchmark for statute-centric legal QA... graph-guided retrieval substantially improves performance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Small Language Models are the Future of Agentic AI

Natural language processing for the legal do- main: A survey of tasks, datasets, models, and chal- lenges.ACM Computing Surveys, 58(6):1–37. Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Ce- line Lin, and Pavlo Molchanov. 2025. Small lan- guage models are the future of agentic ai.Preprint, arXiv:2506.02153. I...

work page internal anchor Pith review arXiv 2025
[2]

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras

Legal-bert: The muppets straight out of law school.arXiv preprint arXiv:2010.02559. Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. Lexglue: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computation...

work page arXiv 2010
[3]

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar

Legalsearchlm: Rethinking legal case re- trieval as legal elements generation.arXiv preprint arXiv:2505.23832. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for un- certainty estimation in natural language generation. arXiv preprint arXiv:2302.09664. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio ...

work page arXiv 2023
[4]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Enforce- ment Decree Table 7, Note 1

[Explanation] (Explaining both the unanswerable and answerable scenarios) ,→ ,→ **Language Instruction:** Your entire response must be **in Korean**.,→ *** ## Provided Context Documents ### Document A: {document_a} ### Document B: {document_b} *** Table 8: Prompt Template for Multi-Hop QA Generation (Section 3.3). H Prompts for Real-World Expert QA Evalua...

work page