pith. sign in

arxiv: 2605.23497 · v1 · pith:XVPJB77Tnew · submitted 2026-05-22 · 💻 cs.CL

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

Pith reviewed 2026-05-25 04:27 UTC · model grok-4.3

classification 💻 cs.CL
keywords legal question answeringtemporal failure modesstatutory lawretrieval augmented generationLLM evaluationGerman lawversion filtering
0
0 comments X

The pith

Reliable LLM legal QA requires treating temporal validity as a hard constraint rather than an optional retrieval step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how large language models handle statutory questions whose correct answer depends on the law as it stood on a specific past date. It identifies two clear failure modes: models apply rules superseded after their training cutoff, and they favor newer provisions even when older ones govern the facts. A new benchmark of 312 expert-validated German questions across post-cutoff, pre-amendment, and multi-provision categories shows that standard prompting collapses on these cases. Two retrieval-augmented methods that first extract the fact date and then restrict results to the matching statutory version restore performance across five tested models, while web search produces unstable gains and strong recency bias.

Core claim

Post-cutoff staleness and recency bias are systematic failure modes in LLM statutory question answering; retrieval-augmented generation that enforces temporal validity through fact-date extraction and version filtering substantially reduces both errors, whereas vanilla inference degrades sharply and web search exhibits marked recency bias on historically anchored tasks.

What carries the argument

The 312-question benchmark spanning Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions, together with RAG pipelines that extract the fact date and filter retrieved provisions to the temporally valid version.

If this is right

  • Legal QA systems must treat the correct statutory version as a hard filter rather than a soft preference.
  • Web search alone is unreliable for historical legal questions because it tends to surface newer provisions.
  • Date-aware version filtering in retrieval provides consistent gains across model families for both post-cutoff and pre-amendment queries.
  • The three question categories reveal distinct failure patterns that require separate evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same date-filtering requirement likely applies to other domains with versioned rules such as tax codes or regulatory compliance.
  • Models could be trained or prompted to output an explicit fact date before answering, reducing dependence on external retrieval.
  • Expanding the benchmark to include conflicting amendments across multiple provisions would test whether current filtering scales to more complex cases.

Load-bearing premise

The 312 expert-validated questions are representative of real statutory QA tasks and the LLM-as-a-judge, validated against human experts, accurately measures answer correctness across the tested models and settings.

What would settle it

A drop in accuracy when the same models and RAG pipelines are evaluated on a new set of time-sensitive questions drawn from a different jurisdiction or from statutes with more frequent amendments.

Figures

Figures reproduced from arXiv: 2605.23497 by Andreas Schultz, Matthias Grabmair, Max Prior.

Figure 1
Figure 1. Figure 1: Overview of Pipeline for Post-Cutoff Amendment [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Pipeline for Multi-Provision Pre [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RAG-kNN and RAG-ToC We evaluate model answers following an LLM-as-a-judge ap￾proach, where a separate LLM scores each candidate answer against a fixed rubric and returns structured ratings. This approach follows prior work showing that rubric-guided LLM judges can align well with human evaluations while enabling scalable, open-ended as￾sessment [3, 18]. We used Gemini 3 Flash Preview as the judge LLM which… view at source ↗
Figure 4
Figure 4. Figure 4: Post-Cutoff Questions: Comparison of metrics across models and methods. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pre-Amendment Questions: Single vs. Multi-Provision (Outcome Correctness). [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Recency Bias: Outcome Correctness split by time interval (Interval 1: Older, Interval 2: Recent). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLMs suffer from two temporal failure modes in statutory QA—post-cutoff staleness and recency bias—demonstrated via a new benchmark of 312 expert-validated German statutory QA pairs across three categories. Evaluations of five LLMs (OpenAI, Anthropic, DeepSeek) in vanilla, web-search, and two RAG-with-temporal-filtering settings, using an LLM-as-a-judge validated against human experts, show severe vanilla degradation on post-cutoff questions, substantial RAG gains, and unstable web-search results with recency bias; the authors conclude that reliable legal QA requires treating temporal validity as a hard constraint.

Significance. If the benchmark is representative and the judge faithful, the work supplies concrete empirical evidence of temporal limitations in LLM legal applications and identifies a practical mitigation (RAG with fact-date extraction and version filtering) that outperforms vanilla and web-search baselines. The expert-validated benchmark and human-checked judge are methodological strengths that could inform future work on dynamic knowledge in specialized domains.

major comments (2)
  1. [Benchmark construction] Benchmark section: the 312-question set is presented without a sampling frame, exclusion criteria, inter-annotator statistics, or coverage metrics across jurisdictions, amendment types, and fact patterns; this directly undermines the claim that observed deltas generalize beyond the tested slice and support the hard-constraint conclusion.
  2. [Evaluation methodology] Evaluation section: the LLM-as-a-judge validation against human experts is asserted but supplies no validation-set size, agreement rates per model/setting, or statistical details; because correctness labels drive all reported performance differences, this gap is load-bearing for the central empirical claims.
minor comments (1)
  1. [Abstract] Abstract: quantitative metrics, confidence intervals, and exact degradation/gain figures are omitted, reducing the standalone informativeness of the summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive major comments. We address each point below and will make revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark section: the 312-question set is presented without a sampling frame, exclusion criteria, inter-annotator statistics, or coverage metrics across jurisdictions, amendment types, and fact patterns; this directly undermines the claim that observed deltas generalize beyond the tested slice and support the hard-constraint conclusion.

    Authors: We agree that additional details on benchmark construction are needed to strengthen claims of broader applicability. The 312 questions were curated by legal experts specifically to isolate temporal failure modes (post-cutoff staleness and recency bias) in German statutory law, with selection focused on amendments near common LLM cutoffs and historical provisions. We will revise the benchmark section to explicitly describe the sampling frame (expert-driven targeting of amendment dates), exclusion criteria (e.g., removal of ambiguous or non-statutory items), inter-annotator agreement statistics from the expert validation, and coverage metrics across jurisdictions, amendment types, and fact patterns. These additions will better support the generalizability of the observed performance deltas. revision: yes

  2. Referee: [Evaluation methodology] Evaluation section: the LLM-as-a-judge validation against human experts is asserted but supplies no validation-set size, agreement rates per model/setting, or statistical details; because correctness labels drive all reported performance differences, this gap is load-bearing for the central empirical claims.

    Authors: We acknowledge that the manuscript asserts validation of the LLM-as-a-judge against human experts without supplying the requested quantitative details. We will revise the evaluation section to report the validation-set size, agreement rates (both overall and broken down by model and inference setting), and statistical measures such as percentage agreement or Cohen's kappa. This will directly substantiate the reliability of the correctness labels used for all performance comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark evaluation

full rationale

The paper presents an empirical study: creation of a 312-question benchmark, evaluation of five LLMs under four settings, and use of an LLM-as-a-judge validated against human experts. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims reduce to direct measurements on the benchmark rather than any self-referential construction. This matches the default expectation of no circularity for empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about expert annotation quality and the validity of LLM-as-judge for legal answer evaluation; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption LLM-as-a-judge outputs can be validated against human expert ratings for correctness on statutory QA tasks
    Used to scale evaluation of model answers across the 312 questions.

pith-pipeline@v0.9.0 · 5745 in / 1219 out tokens · 28544 ms · 2026-05-25T04:27:18.204005+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2015–2027. doi:10.18653/v1/2024.eacl-long.122

  2. [2]

    buzer.de. 2026. Online Database of German Laws. https://www.buzer.de. Accessed December 16, 2025

  3. [3]

    Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu

    Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Ar- ijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Han- nah R. Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu. 2025. LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation. InProceedings of the Natural Legal Language Processing Worksho...

  4. [4]

    2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft

    Bayerischer Anwaltverband e.V. 2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft. Technical Report. Bayerischer Anwaltverband e.V. https://www.bayerischer-anwaltverband.de/site/assets/files/1765/umfrage_ zur_ki_nutzung_der_bay_anwaltschaft.pdf

  5. [5]

    Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. https://arxiv.org/abs/2505.12864

  6. [6]

    Hanpei Fang, Sijie Tao, Nuo Chen, Kai-Xin Chang, and Tetsuya Sakai. 2025. Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking. InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. Association for Computing Machinery, 85...

  7. [7]

    Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom

  8. [8]

    InAdvances in Neural Information Processing Systems, Vol

    Mind the Gap: Assessing Temporal Generalization in Neural Language Models. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 29348–29363. https://proceedings.neurips.cc/paper_files/paper/ 2021/file/f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf

  9. [9]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 9459–947...

  10. [10]

    Max Prior, Adrian Hof, Niklas Wais, and Matthias Grabmair. 2025. Risks and Limits of Automatic Consolidation of Statutes. InProceedings of the Natural Legal Language Processing Workshop 2025. Association for Computational Linguistics, 396–407. doi:10.18653/v1/2025.nllp-1.29

  11. [11]

    2025.Generative AI in Professional Services Report

    Thomson Reuters. 2025.Generative AI in Professional Services Report. Technical Report. https://www.thomsonreuters.com/content/dam/ewp- m/documents/thomsonreuters/en/pdf/reports/2025-generative-ai-in- professional-services-report-tr5433489-rgb.pdf

  12. [12]

    Manning, Peter Hender- son, and Daniel E

    Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Hender- son, and Daniel E. Ho. 2025. AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County. https://arxiv.org/abs/2503.03888

  13. [13]

    Santosh T.y.s.s and Tuan-Quang Vuong. 2025. LexTempus: Enhancing Temporal Generalizability of Legal Language Models Through Dynamic Mixture of Experts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6608–6624. doi:10.18653/v1/2025.acl-long.329

  14. [14]

    Santosh T.y.s.s, Tuan-Quang Vuong, and Matthias Grabmair. 2024. ChronosLex: Time-aware Incremental Training for Temporal Generalization of Legal Classifica- tion Tasks. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 3022–3039. doi:10.18653/v1/2024.acl- long.166

  15. [15]

    Juraj Vladika, Mahdi Dhaini, and Florian Matthes. 2025. Facts Fade Fast: Evaluat- ing Memorization of Outdated Medical Knowledge in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025. 9161–9174. doi:10.18653/v1/2025.findings-emnlp.487

  16. [16]

    Bernard L Welch. 1947. The generalization of ’Student’s’ problem when several different population variances are involved.Biometrika34, 1/2 (1947), 28–35. doi:10.1093/biomet/34.1-2.28

  17. [17]

    Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2025. Know Your Limits: A Survey of Abstention in Large Language Models.Transactions of the Association for Computational Linguistics 13 (2025), 529–556. doi:10.1162/tacl_a_00754

  18. [18]

    Li Zhang, Jaromír Savelka, and Kevin Ashley. 2025. Do LLMs Truly Understand When a Precedent Is Overruled? (2025). https://arxiv.org/abs/2510.20941

  19. [19]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. doi:10.48550/arXiv.2306.05685

  20. [20]

    Manning, Peter Henderson, and Daniel E

    Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning- Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law. Association for Computing Machinery, 169–193. doi:10.1145/3709025.3712219

  21. [21]

    Chenghao Zhu, Nuo Chen, Yufei Gao, Yunyi Zhang, Prayag Tiwari, and Benyou Wang. 2025. Is Your LLM Outdated? A Deep Look at Temporal Generalization. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational ...