Recognition: 1 theorem link
· Lean TheoremBeyond Case Law: Evaluating Structure-Aware Retrieval and Safety in Statute-Centric Legal QA
Pith reviewed 2026-05-16 11:24 UTC · model grok-4.3
The pith
Graph-guided retrieval improves performance on statute-centric legal questions but domain-adapted models hallucinate more when statutory evidence is missing.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that statute-centric legal QA involves hierarchically fragmented evidence, leading to a retrieval gap in standard methods and hallucinations in incomplete contexts. By creating SearchFireSafety with real-world citation-aware questions and synthetic partial-context tests, it demonstrates that graph-guided retrieval substantially boosts model performance across LLMs, while revealing that domain-adapted models exhibit higher hallucination rates when key statutory evidence is absent.
What carries the argument
SearchFireSafety benchmark, a dual-source evaluation framework that combines real regulatory questions requiring hierarchical citation retrieval with synthetic scenarios testing refusal under missing context, using graph-guided retrieval as the improved method.
If this is right
- Models using graph-guided retrieval can better handle distributed statutory evidence in regulatory domains.
- Domain adaptation for legal tasks increases the risk of unsafe hallucinations in incomplete contexts.
- Benchmarks for legal QA should incorporate both hierarchical retrieval evaluation and safety testing for abstention.
- Regulatory AI systems may require specific safeguards when evidence from linked statutes is partial.
- The trade-off suggests that retrieval improvements alone do not ensure safe behavior in statute-based reasoning.
Where Pith is reading between the lines
- Similar hierarchical structures in other regulatory areas like tax or environmental law could benefit from the same graph approach.
- The findings imply that training data for domain-adapted models should include more examples of abstaining on incomplete info.
- Extending the benchmark to multiple jurisdictions might reveal if the safety trade-off is universal.
- Developers of legal AI tools should prioritize refusal training alongside retrieval enhancements.
Load-bearing premise
The assumption that fire-safety regulations are representative of broader statute-centric legal QA challenges and that the synthetic partial-context scenarios accurately simulate real-world cases of missing evidence.
What would settle it
Testing the same models on a different regulatory domain like building codes or tax statutes and measuring if the hallucination increase in domain-adapted models persists when evidence is withheld.
Figures
read the original abstract
Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-source evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA instantiated on fire-safety regulations. It combines real-world citation-aware questions with synthetic partial-context scenarios to evaluate hierarchical retrieval (via graph-guided methods) and model safety (hallucination and refusal) across multiple LLMs. The central empirical claims are that graph-guided retrieval yields substantial performance gains while domain-adapted models exhibit a higher tendency to hallucinate when key statutory evidence is absent.
Significance. If the empirical results hold under more rigorous validation, the work usefully shifts legal QA evaluation away from case-law dominance toward regulatory statutes and supplies a dual-source framework that jointly tests retrieval and safety. The reported trade-off between adaptation and hallucination risk is a concrete, falsifiable observation that could guide future benchmark design and model development in high-stakes regulatory domains.
major comments (3)
- [Abstract, §3] Abstract and §3 (Benchmark Construction): the claim that fire-safety regulations constitute a 'representative case' for general statute-centric QA is load-bearing for the generalizability of the safety trade-off, yet the manuscript provides no cross-domain replication, citation-pattern statistics, or external validation against practitioner queries from other regulatory areas (e.g., tax or environmental statutes).
- [§4, §5] §4 (Evaluation Framework) and §5 (Experiments): the synthetic partial-context scenarios are central to the hallucination/refusal results, but the paper supplies insufficient detail on how missing-evidence contexts are generated, how refusal is scored, and what controls prevent leakage of full statutory text; without these, it is impossible to determine whether the observed domain-adaptation penalty is an artifact of the synthetic construction.
- [§5] §5 (Results): the headline performance gains from graph-guided retrieval are reported without baseline retriever ablations (e.g., standard BM25 or dense retrievers with hierarchical chunking) or statistical significance tests across the multiple LLMs, weakening the claim that the improvement is specifically attributable to structure awareness.
minor comments (2)
- [§4] Notation for the dual-source framework (real vs. synthetic) is introduced without a clear diagram or table summarizing the two evaluation tracks.
- [§5] Model selection criteria and exact prompting templates used for the hallucination tests are not listed in the main text or appendix.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for improving the manuscript's rigor and clarity. We address each major comment point by point below, with planned revisions indicated.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (Benchmark Construction): the claim that fire-safety regulations constitute a 'representative case' for general statute-centric QA is load-bearing for the generalizability of the safety trade-off, yet the manuscript provides no cross-domain replication, citation-pattern statistics, or external validation against practitioner queries from other regulatory areas (e.g., tax or environmental statutes).
Authors: We chose fire-safety regulations as a representative case because it exhibits the hierarchical linking, cross-references, and distributed evidence typical of regulatory statutes. We agree that broader validation would strengthen generalizability claims. In revision, we will add citation-pattern statistics from the corpus (e.g., average depth of hierarchies and cross-reference density) and a discussion of structural commonalities with domains like tax and environmental law. We will also explicitly note the absence of cross-domain replication as a limitation and suggest it for future work. This provides a stronger foundation without new data collection. revision: partial
-
Referee: [§4, §5] §4 (Evaluation Framework) and §5 (Experiments): the synthetic partial-context scenarios are central to the hallucination/refusal results, but the paper supplies insufficient detail on how missing-evidence contexts are generated, how refusal is scored, and what controls prevent leakage of full statutory text; without these, it is impossible to determine whether the observed domain-adaptation penalty is an artifact of the synthetic construction.
Authors: We agree that reproducibility requires these details. The revised manuscript will expand §4 to describe: (1) generation of partial contexts by removing graph-linked statutory sections corresponding to question citations; (2) refusal scoring as a binary label based on explicit abstention language versus any substantive response; and (3) leakage controls via automated checks confirming no complete statutory text appears in synthetic inputs. These clarifications will show the domain-adaptation penalty is not an artifact of the construction. revision: yes
-
Referee: [§5] §5 (Results): the headline performance gains from graph-guided retrieval are reported without baseline retriever ablations (e.g., standard BM25 or dense retrievers with hierarchical chunking) or statistical significance tests across the multiple LLMs, weakening the claim that the improvement is specifically attributable to structure awareness.
Authors: We will add the requested ablations and tests. The revised §5 will include direct comparisons to BM25 and dense retrievers on hierarchically chunked documents, plus statistical significance testing (e.g., paired t-tests or McNemar's test) across LLMs. These additions will more rigorously attribute gains to the structure-aware graph guidance. revision: yes
Circularity Check
No circularity; empirical benchmark evaluation is self-contained
full rationale
The paper introduces the SearchFireSafety benchmark instantiated on fire-safety regulations and reports direct experimental results across LLMs for graph-guided retrieval performance and hallucination/refusal behavior under partial context. No equations, derivations, fitted parameters, or self-citation chains are present that reduce any claim to its own inputs by construction. The core findings rest on described benchmark construction and model evaluations rather than self-definitional steps or renamed known results. The representativeness of the fire-safety domain is an explicit modeling choice, not a derived quantity that collapses into prior self-citations or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Fire-safety regulations are a representative case for statute-centric legal QA
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SEARCHFIRESAFETY, a structure- and safety-aware benchmark for statute-centric legal QA... graph-guided retrieval substantially improves performance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Small Language Models are the Future of Agentic AI
Natural language processing for the legal do- main: A survey of tasks, datasets, models, and chal- lenges.ACM Computing Surveys, 58(6):1–37. Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Ce- line Lin, and Pavlo Molchanov. 2025. Small lan- guage models are the future of agentic ai.Preprint, arXiv:2506.02153. I...
work page internal anchor Pith review arXiv 2025
-
[2]
Legal-bert: The muppets straight out of law school.arXiv preprint arXiv:2010.02559. Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. 2022. Lexglue: A benchmark dataset for legal language understanding in english. InProceedings of the 60th Annual Meeting of the Association for Computation...
-
[3]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar
Legalsearchlm: Rethinking legal case re- trieval as legal elements generation.arXiv preprint arXiv:2505.23832. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for un- certainty estimation in natural language generation. arXiv preprint arXiv:2302.09664. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio ...
-
[4]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christo- pher D Manning. 2018. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empiri- cal methods in natural language processing, pages 236...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[5]
Enforce- ment Decree Table 7, Note 1
[Explanation] (Explaining both the unanswerable and answerable scenarios) ,→ ,→ **Language Instruction:** Your entire response must be **in Korean**.,→ *** ## Provided Context Documents ### Document A: {document_a} ### Document B: {document_b} *** Table 8: Prompt Template for Multi-Hop QA Generation (Section 3.3). H Prompts for Real-World Expert QA Evalua...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.