Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering
Pith reviewed 2026-05-25 04:27 UTC · model grok-4.3
The pith
Reliable LLM legal QA requires treating temporal validity as a hard constraint rather than an optional retrieval step.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Post-cutoff staleness and recency bias are systematic failure modes in LLM statutory question answering; retrieval-augmented generation that enforces temporal validity through fact-date extraction and version filtering substantially reduces both errors, whereas vanilla inference degrades sharply and web search exhibits marked recency bias on historically anchored tasks.
What carries the argument
The 312-question benchmark spanning Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions, together with RAG pipelines that extract the fact date and filter retrieved provisions to the temporally valid version.
If this is right
- Legal QA systems must treat the correct statutory version as a hard filter rather than a soft preference.
- Web search alone is unreliable for historical legal questions because it tends to surface newer provisions.
- Date-aware version filtering in retrieval provides consistent gains across model families for both post-cutoff and pre-amendment queries.
- The three question categories reveal distinct failure patterns that require separate evaluation.
Where Pith is reading between the lines
- The same date-filtering requirement likely applies to other domains with versioned rules such as tax codes or regulatory compliance.
- Models could be trained or prompted to output an explicit fact date before answering, reducing dependence on external retrieval.
- Expanding the benchmark to include conflicting amendments across multiple provisions would test whether current filtering scales to more complex cases.
Load-bearing premise
The 312 expert-validated questions are representative of real statutory QA tasks and the LLM-as-a-judge, validated against human experts, accurately measures answer correctness across the tested models and settings.
What would settle it
A drop in accuracy when the same models and RAG pipelines are evaluated on a new set of time-sensitive questions drawn from a different jurisdiction or from statutes with more frequent amendments.
Figures
read the original abstract
Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLMs suffer from two temporal failure modes in statutory QA—post-cutoff staleness and recency bias—demonstrated via a new benchmark of 312 expert-validated German statutory QA pairs across three categories. Evaluations of five LLMs (OpenAI, Anthropic, DeepSeek) in vanilla, web-search, and two RAG-with-temporal-filtering settings, using an LLM-as-a-judge validated against human experts, show severe vanilla degradation on post-cutoff questions, substantial RAG gains, and unstable web-search results with recency bias; the authors conclude that reliable legal QA requires treating temporal validity as a hard constraint.
Significance. If the benchmark is representative and the judge faithful, the work supplies concrete empirical evidence of temporal limitations in LLM legal applications and identifies a practical mitigation (RAG with fact-date extraction and version filtering) that outperforms vanilla and web-search baselines. The expert-validated benchmark and human-checked judge are methodological strengths that could inform future work on dynamic knowledge in specialized domains.
major comments (2)
- [Benchmark construction] Benchmark section: the 312-question set is presented without a sampling frame, exclusion criteria, inter-annotator statistics, or coverage metrics across jurisdictions, amendment types, and fact patterns; this directly undermines the claim that observed deltas generalize beyond the tested slice and support the hard-constraint conclusion.
- [Evaluation methodology] Evaluation section: the LLM-as-a-judge validation against human experts is asserted but supplies no validation-set size, agreement rates per model/setting, or statistical details; because correctness labels drive all reported performance differences, this gap is load-bearing for the central empirical claims.
minor comments (1)
- [Abstract] Abstract: quantitative metrics, confidence intervals, and exact degradation/gain figures are omitted, reducing the standalone informativeness of the summary.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive major comments. We address each point below and will make revisions to improve the clarity and rigor of the manuscript.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark section: the 312-question set is presented without a sampling frame, exclusion criteria, inter-annotator statistics, or coverage metrics across jurisdictions, amendment types, and fact patterns; this directly undermines the claim that observed deltas generalize beyond the tested slice and support the hard-constraint conclusion.
Authors: We agree that additional details on benchmark construction are needed to strengthen claims of broader applicability. The 312 questions were curated by legal experts specifically to isolate temporal failure modes (post-cutoff staleness and recency bias) in German statutory law, with selection focused on amendments near common LLM cutoffs and historical provisions. We will revise the benchmark section to explicitly describe the sampling frame (expert-driven targeting of amendment dates), exclusion criteria (e.g., removal of ambiguous or non-statutory items), inter-annotator agreement statistics from the expert validation, and coverage metrics across jurisdictions, amendment types, and fact patterns. These additions will better support the generalizability of the observed performance deltas. revision: yes
-
Referee: [Evaluation methodology] Evaluation section: the LLM-as-a-judge validation against human experts is asserted but supplies no validation-set size, agreement rates per model/setting, or statistical details; because correctness labels drive all reported performance differences, this gap is load-bearing for the central empirical claims.
Authors: We acknowledge that the manuscript asserts validation of the LLM-as-a-judge against human experts without supplying the requested quantitative details. We will revise the evaluation section to report the validation-set size, agreement rates (both overall and broken down by model and inference setting), and statistical measures such as percentage agreement or Cohen's kappa. This will directly substantiate the reliability of the correctness labels used for all performance comparisons. revision: yes
Circularity Check
No significant circularity; purely empirical benchmark evaluation
full rationale
The paper presents an empirical study: creation of a 312-question benchmark, evaluation of five LLMs under four settings, and use of an LLM-as-a-judge validated against human experts. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims reduce to direct measurements on the benchmark rather than any self-referential construction. This matches the default expectation of no circularity for empirical papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM-as-a-judge outputs can be validated against human expert ratings for correctness on statutory QA tasks
Reference graph
Works this paper leans on
-
[1]
Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2015–2027. doi:10.18653/v1/2024.eacl-long.122
-
[2]
buzer.de. 2026. Online Database of German Laws. https://www.buzer.de. Accessed December 16, 2025
work page 2026
-
[3]
Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu
Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Ar- ijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Han- nah R. Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu. 2025. LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation. InProceedings of the Natural Legal Language Processing Worksho...
-
[4]
2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft
Bayerischer Anwaltverband e.V. 2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft. Technical Report. Bayerischer Anwaltverband e.V. https://www.bayerischer-anwaltverband.de/site/assets/files/1765/umfrage_ zur_ki_nutzung_der_bay_anwaltschaft.pdf
work page 2024
-
[5]
Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. https://arxiv.org/abs/2505.12864
-
[6]
Hanpei Fang, Sijie Tao, Nuo Chen, Kai-Xin Chang, and Tetsuya Sakai. 2025. Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking. InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. Association for Computing Machinery, 85...
-
[7]
Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom
-
[8]
InAdvances in Neural Information Processing Systems, Vol
Mind the Gap: Assessing Temporal Generalization in Neural Language Models. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 29348–29363. https://proceedings.neurips.cc/paper_files/paper/ 2021/file/f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf
work page 2021
-
[9]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 9459–947...
work page 2020
-
[10]
Max Prior, Adrian Hof, Niklas Wais, and Matthias Grabmair. 2025. Risks and Limits of Automatic Consolidation of Statutes. InProceedings of the Natural Legal Language Processing Workshop 2025. Association for Computational Linguistics, 396–407. doi:10.18653/v1/2025.nllp-1.29
-
[11]
2025.Generative AI in Professional Services Report
Thomson Reuters. 2025.Generative AI in Professional Services Report. Technical Report. https://www.thomsonreuters.com/content/dam/ewp- m/documents/thomsonreuters/en/pdf/reports/2025-generative-ai-in- professional-services-report-tr5433489-rgb.pdf
work page 2025
-
[12]
Manning, Peter Hender- son, and Daniel E
Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Hender- son, and Daniel E. Ho. 2025. AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County. https://arxiv.org/abs/2503.03888
-
[13]
Santosh T.y.s.s and Tuan-Quang Vuong. 2025. LexTempus: Enhancing Temporal Generalizability of Legal Language Models Through Dynamic Mixture of Experts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6608–6624. doi:10.18653/v1/2025.acl-long.329
-
[14]
Santosh T.y.s.s, Tuan-Quang Vuong, and Matthias Grabmair. 2024. ChronosLex: Time-aware Incremental Training for Temporal Generalization of Legal Classifica- tion Tasks. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 3022–3039. doi:10.18653/v1/2024.acl- long.166
-
[15]
Juraj Vladika, Mahdi Dhaini, and Florian Matthes. 2025. Facts Fade Fast: Evaluat- ing Memorization of Outdated Medical Knowledge in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025. 9161–9174. doi:10.18653/v1/2025.findings-emnlp.487
-
[16]
Bernard L Welch. 1947. The generalization of ’Student’s’ problem when several different population variances are involved.Biometrika34, 1/2 (1947), 28–35. doi:10.1093/biomet/34.1-2.28
-
[17]
Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2025. Know Your Limits: A Survey of Abstention in Large Language Models.Transactions of the Association for Computational Linguistics 13 (2025), 529–556. doi:10.1162/tacl_a_00754
- [18]
-
[19]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. doi:10.48550/arXiv.2306.05685
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
-
[20]
Manning, Peter Henderson, and Daniel E
Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning- Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law. Association for Computing Machinery, 169–193. doi:10.1145/3709025.3712219
-
[21]
Chenghao Zhu, Nuo Chen, Yufei Gao, Yunyi Zhang, Prayag Tiwari, and Benyou Wang. 2025. Is Your LLM Outdated? A Deep Look at Temporal Generalization. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.