Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

Andreas Schultz; Matthias Grabmair; Max Prior

arxiv: 2605.23497 · v1 · pith:XVPJB77Tnew · submitted 2026-05-22 · 💻 cs.CL

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

Max Prior , Andreas Schultz , Matthias Grabmair This is my paper

Pith reviewed 2026-05-25 04:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords legal question answeringtemporal failure modesstatutory lawretrieval augmented generationLLM evaluationGerman lawversion filtering

0 comments

The pith

Reliable LLM legal QA requires treating temporal validity as a hard constraint rather than an optional retrieval step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how large language models handle statutory questions whose correct answer depends on the law as it stood on a specific past date. It identifies two clear failure modes: models apply rules superseded after their training cutoff, and they favor newer provisions even when older ones govern the facts. A new benchmark of 312 expert-validated German questions across post-cutoff, pre-amendment, and multi-provision categories shows that standard prompting collapses on these cases. Two retrieval-augmented methods that first extract the fact date and then restrict results to the matching statutory version restore performance across five tested models, while web search produces unstable gains and strong recency bias.

Core claim

Post-cutoff staleness and recency bias are systematic failure modes in LLM statutory question answering; retrieval-augmented generation that enforces temporal validity through fact-date extraction and version filtering substantially reduces both errors, whereas vanilla inference degrades sharply and web search exhibits marked recency bias on historically anchored tasks.

What carries the argument

The 312-question benchmark spanning Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions, together with RAG pipelines that extract the fact date and filter retrieved provisions to the temporally valid version.

If this is right

Legal QA systems must treat the correct statutory version as a hard filter rather than a soft preference.
Web search alone is unreliable for historical legal questions because it tends to surface newer provisions.
Date-aware version filtering in retrieval provides consistent gains across model families for both post-cutoff and pre-amendment queries.
The three question categories reveal distinct failure patterns that require separate evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same date-filtering requirement likely applies to other domains with versioned rules such as tax codes or regulatory compliance.
Models could be trained or prompted to output an explicit fact date before answering, reducing dependence on external retrieval.
Expanding the benchmark to include conflicting amendments across multiple provisions would test whether current filtering scales to more complex cases.

Load-bearing premise

The 312 expert-validated questions are representative of real statutory QA tasks and the LLM-as-a-judge, validated against human experts, accurately measures answer correctness across the tested models and settings.

What would settle it

A drop in accuracy when the same models and RAG pipelines are evaluated on a new set of time-sensitive questions drawn from a different jurisdiction or from statutes with more frequent amendments.

Figures

Figures reproduced from arXiv: 2605.23497 by Andreas Schultz, Matthias Grabmair, Max Prior.

**Figure 2.** Figure 2: Overview of Pipeline for Multi-Provision Pre [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: RAG-kNN and RAG-ToC We evaluate model answers following an LLM-as-a-judge approach, where a separate LLM scores each candidate answer against a fixed rubric and returns structured ratings. This approach follows prior work showing that rubric-guided LLM judges can align well with human evaluations while enabling scalable, open-ended assessment [3, 18]. We used Gemini 3 Flash Preview as the judge LLM which… view at source ↗

**Figure 4.** Figure 4: Post-Cutoff Questions: Comparison of metrics across models and methods. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Pre-Amendment Questions: Single vs. Multi-Provision (Outcome Correctness). [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Recency Bias: Outcome Correctness split by time interval (Interval 1: Older, Interval 2: Recent). [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean empirical comparison showing temporal filtering in RAG fixes most of the staleness and recency problems on their 312-question German statutory set, but the broad claim about hard constraints rests on unproven representativeness.

read the letter

The main thing here is that vanilla LLMs drop sharply on statutory questions that require the right version of the law, and retrieval that explicitly filters by fact date and version recovers a lot of that performance while web search adds recency bias instead. They built a 312-pair benchmark with three clear categories—post-cutoff amendments, pre-amendment questions, and multi-provision pre-amendment ones—and ran five models across vanilla, web search, and two RAG setups. The LLM judge was checked against human experts, which is a reasonable step. That setup is new enough and the head-to-head on mitigation strategies is direct, so the results on degradation and RAG gains look usable as evidence for this specific setting. The paper does a solid job keeping the evaluation tied to time-anchored questions rather than generic retrieval tests. The soft spot is the leap to “reliable legal QA requires treating temporal validity as a hard constraint.” That follows from the deltas on these 312 pairs, but the abstract gives no sampling frame, no inter-annotator numbers, and no breakdown of how the questions were chosen across jurisdictions or amendment types. If the set is narrow or skewed toward certain fact patterns, the observed gaps could be larger or smaller in practice. The judge validation is mentioned but without the actual agreement figures or validation-set size, so some label noise remains possible. Minor on its own, but it matters for how far the conclusion travels. This is worth a serious referee for groups building legal or regulatory QA tools, because the problem is practical and the comparison is empirical. I would bring it to a reading group as maybe, would not cite it in my own work soon, and would accept it for peer review.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLMs suffer from two temporal failure modes in statutory QA—post-cutoff staleness and recency bias—demonstrated via a new benchmark of 312 expert-validated German statutory QA pairs across three categories. Evaluations of five LLMs (OpenAI, Anthropic, DeepSeek) in vanilla, web-search, and two RAG-with-temporal-filtering settings, using an LLM-as-a-judge validated against human experts, show severe vanilla degradation on post-cutoff questions, substantial RAG gains, and unstable web-search results with recency bias; the authors conclude that reliable legal QA requires treating temporal validity as a hard constraint.

Significance. If the benchmark is representative and the judge faithful, the work supplies concrete empirical evidence of temporal limitations in LLM legal applications and identifies a practical mitigation (RAG with fact-date extraction and version filtering) that outperforms vanilla and web-search baselines. The expert-validated benchmark and human-checked judge are methodological strengths that could inform future work on dynamic knowledge in specialized domains.

major comments (2)

[Benchmark construction] Benchmark section: the 312-question set is presented without a sampling frame, exclusion criteria, inter-annotator statistics, or coverage metrics across jurisdictions, amendment types, and fact patterns; this directly undermines the claim that observed deltas generalize beyond the tested slice and support the hard-constraint conclusion.
[Evaluation methodology] Evaluation section: the LLM-as-a-judge validation against human experts is asserted but supplies no validation-set size, agreement rates per model/setting, or statistical details; because correctness labels drive all reported performance differences, this gap is load-bearing for the central empirical claims.

minor comments (1)

[Abstract] Abstract: quantitative metrics, confidence intervals, and exact degradation/gain figures are omitted, reducing the standalone informativeness of the summary.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive major comments. We address each point below and will make revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Benchmark construction] Benchmark section: the 312-question set is presented without a sampling frame, exclusion criteria, inter-annotator statistics, or coverage metrics across jurisdictions, amendment types, and fact patterns; this directly undermines the claim that observed deltas generalize beyond the tested slice and support the hard-constraint conclusion.

Authors: We agree that additional details on benchmark construction are needed to strengthen claims of broader applicability. The 312 questions were curated by legal experts specifically to isolate temporal failure modes (post-cutoff staleness and recency bias) in German statutory law, with selection focused on amendments near common LLM cutoffs and historical provisions. We will revise the benchmark section to explicitly describe the sampling frame (expert-driven targeting of amendment dates), exclusion criteria (e.g., removal of ambiguous or non-statutory items), inter-annotator agreement statistics from the expert validation, and coverage metrics across jurisdictions, amendment types, and fact patterns. These additions will better support the generalizability of the observed performance deltas. revision: yes
Referee: [Evaluation methodology] Evaluation section: the LLM-as-a-judge validation against human experts is asserted but supplies no validation-set size, agreement rates per model/setting, or statistical details; because correctness labels drive all reported performance differences, this gap is load-bearing for the central empirical claims.

Authors: We acknowledge that the manuscript asserts validation of the LLM-as-a-judge against human experts without supplying the requested quantitative details. We will revise the evaluation section to report the validation-set size, agreement rates (both overall and broken down by model and inference setting), and statistical measures such as percentage agreement or Cohen's kappa. This will directly substantiate the reliability of the correctness labels used for all performance comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical benchmark evaluation

full rationale

The paper presents an empirical study: creation of a 312-question benchmark, evaluation of five LLMs under four settings, and use of an LLM-as-a-judge validated against human experts. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims reduce to direct measurements on the benchmark rather than any self-referential construction. This matches the default expectation of no circularity for empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard assumptions about expert annotation quality and the validity of LLM-as-judge for legal answer evaluation; no free parameters or invented entities are introduced.

axioms (1)

domain assumption LLM-as-a-judge outputs can be validated against human expert ratings for correctness on statutory QA tasks
Used to scale evaluation of model answers across the 312 questions.

pith-pipeline@v0.9.0 · 5745 in / 1219 out tokens · 28544 ms · 2026-05-25T04:27:18.204005+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[1]

Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2015–2027. doi:10.18653/v1/2024.eacl-long.122

work page doi:10.18653/v1/2024.eacl-long.122 2024
[2]

buzer.de. 2026. Online Database of German Laws. https://www.buzer.de. Accessed December 16, 2025

work page 2026
[3]

Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu

Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Ar- ijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Han- nah R. Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu. 2025. LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation. InProceedings of the Natural Legal Language Processing Worksho...

work page doi:10.18653/v1/2025.nllp-1.23 2025
[4]

2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft

Bayerischer Anwaltverband e.V. 2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft. Technical Report. Bayerischer Anwaltverband e.V. https://www.bayerischer-anwaltverband.de/site/assets/files/1765/umfrage_ zur_ki_nutzung_der_bay_anwaltschaft.pdf

work page 2024
[5]

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. https://arxiv.org/abs/2505.12864

work page arXiv 2025
[6]

Hanpei Fang, Sijie Tao, Nuo Chen, Kai-Xin Chang, and Tetsuya Sakai. 2025. Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking. InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. Association for Computing Machinery, 85...

work page arXiv 2025
[7]

Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom

work page
[8]

InAdvances in Neural Information Processing Systems, Vol

Mind the Gap: Assessing Temporal Generalization in Neural Language Models. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 29348–29363. https://proceedings.neurips.cc/paper_files/paper/ 2021/file/f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf

work page 2021
[9]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 9459–947...

work page 2020
[10]

Max Prior, Adrian Hof, Niklas Wais, and Matthias Grabmair. 2025. Risks and Limits of Automatic Consolidation of Statutes. InProceedings of the Natural Legal Language Processing Workshop 2025. Association for Computational Linguistics, 396–407. doi:10.18653/v1/2025.nllp-1.29

work page doi:10.18653/v1/2025.nllp-1.29 2025
[11]

2025.Generative AI in Professional Services Report

Thomson Reuters. 2025.Generative AI in Professional Services Report. Technical Report. https://www.thomsonreuters.com/content/dam/ewp- m/documents/thomsonreuters/en/pdf/reports/2025-generative-ai-in- professional-services-report-tr5433489-rgb.pdf

work page 2025
[12]

Manning, Peter Hender- son, and Daniel E

Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Hender- son, and Daniel E. Ho. 2025. AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County. https://arxiv.org/abs/2503.03888

work page arXiv 2025
[13]

Santosh T.y.s.s and Tuan-Quang Vuong. 2025. LexTempus: Enhancing Temporal Generalizability of Legal Language Models Through Dynamic Mixture of Experts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6608–6624. doi:10.18653/v1/2025.acl-long.329

work page doi:10.18653/v1/2025.acl-long.329 2025
[14]

Santosh T.y.s.s, Tuan-Quang Vuong, and Matthias Grabmair. 2024. ChronosLex: Time-aware Incremental Training for Temporal Generalization of Legal Classifica- tion Tasks. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 3022–3039. doi:10.18653/v1/2024.acl- long.166

work page doi:10.18653/v1/2024.acl- 2024
[15]

Juraj Vladika, Mahdi Dhaini, and Florian Matthes. 2025. Facts Fade Fast: Evaluat- ing Memorization of Outdated Medical Knowledge in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025. 9161–9174. doi:10.18653/v1/2025.findings-emnlp.487

work page doi:10.18653/v1/2025.findings-emnlp.487 2025
[16]

Bernard L Welch. 1947. The generalization of ’Student’s’ problem when several different population variances are involved.Biometrika34, 1/2 (1947), 28–35. doi:10.1093/biomet/34.1-2.28

work page doi:10.1093/biomet/34.1-2.28 1947
[17]

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2025. Know Your Limits: A Survey of Abstention in Large Language Models.Transactions of the Association for Computational Linguistics 13 (2025), 529–556. doi:10.1162/tacl_a_00754

work page doi:10.1162/tacl_a_00754 2025
[18]

Li Zhang, Jaromír Savelka, and Kevin Ashley. 2025. Do LLMs Truly Understand When a Precedent Is Overruled? (2025). https://arxiv.org/abs/2510.20941

work page arXiv 2025
[19]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. doi:10.48550/arXiv.2306.05685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023
[20]

Manning, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning- Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law. Association for Computing Machinery, 169–193. doi:10.1145/3709025.3712219

work page doi:10.1145/3709025.3712219 2025
[21]

Chenghao Zhu, Nuo Chen, Yufei Gao, Yunyi Zhang, Prayag Tiwari, and Benyou Wang. 2025. Is Your LLM Outdated? A Deep Look at Temporal Generalization. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational ...

work page doi:10.18653/v1/2025.naacl-long.381 2025

[1] [1]

Marius Büttner and Ivan Habernal. 2024. Answering legal questions from laymen in German civil law system. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2015–2027. doi:10.18653/v1/2024.eacl-long.122

work page doi:10.18653/v1/2024.eacl-long.122 2024

[2] [2]

buzer.de. 2026. Online Database of German Laws. https://www.buzer.de. Accessed December 16, 2025

work page 2026

[3] [3]

Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu

Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Ar- ijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Han- nah R. Marlowe, Carina Suzana Negreanu, Kitty Boxall, and Diana Mincu. 2025. LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation. InProceedings of the Natural Legal Language Processing Worksho...

work page doi:10.18653/v1/2025.nllp-1.23 2025

[4] [4]

2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft

Bayerischer Anwaltverband e.V. 2024.Auswertung der Umfrage zur KI-Nutzung der bayerischen Anwaltschaft. Technical Report. Bayerischer Anwaltverband e.V. https://www.bayerischer-anwaltverband.de/site/assets/files/1765/umfrage_ zur_ki_nutzung_der_bay_anwaltschaft.pdf

work page 2024

[5] [5]

Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Etienne Salimbeni, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, and Joel Niklaus. 2025. LEXam: Benchmarking Legal Reasoning on 340 Law Exams. https://arxiv.org/abs/2505.12864

work page arXiv 2025

[6] [6]

Hanpei Fang, Sijie Tao, Nuo Chen, Kai-Xin Chang, and Tetsuya Sakai. 2025. Do Large Language Models Favor Recent Content? A Study on Recency Bias in LLM-Based Reranking. InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. Association for Computing Machinery, 85...

work page arXiv 2025

[7] [7]

Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and Phil Blunsom

work page

[8] [8]

InAdvances in Neural Information Processing Systems, Vol

Mind the Gap: Assessing Temporal Generalization in Neural Language Models. InAdvances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 29348–29363. https://proceedings.neurips.cc/paper_files/paper/ 2021/file/f5bf0ba0a17ef18f9607774722f5698c-Paper.pdf

work page 2021

[9] [9]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 9459–947...

work page 2020

[10] [10]

Max Prior, Adrian Hof, Niklas Wais, and Matthias Grabmair. 2025. Risks and Limits of Automatic Consolidation of Statutes. InProceedings of the Natural Legal Language Processing Workshop 2025. Association for Computational Linguistics, 396–407. doi:10.18653/v1/2025.nllp-1.29

work page doi:10.18653/v1/2025.nllp-1.29 2025

[11] [11]

2025.Generative AI in Professional Services Report

Thomson Reuters. 2025.Generative AI in Professional Services Report. Technical Report. https://www.thomsonreuters.com/content/dam/ewp- m/documents/thomsonreuters/en/pdf/reports/2025-generative-ai-in- professional-services-report-tr5433489-rgb.pdf

work page 2025

[12] [12]

Manning, Peter Hender- son, and Daniel E

Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Hender- son, and Daniel E. Ho. 2025. AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County. https://arxiv.org/abs/2503.03888

work page arXiv 2025

[13] [13]

Santosh T.y.s.s and Tuan-Quang Vuong. 2025. LexTempus: Enhancing Temporal Generalizability of Legal Language Models Through Dynamic Mixture of Experts. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6608–6624. doi:10.18653/v1/2025.acl-long.329

work page doi:10.18653/v1/2025.acl-long.329 2025

[14] [14]

Santosh T.y.s.s, Tuan-Quang Vuong, and Matthias Grabmair. 2024. ChronosLex: Time-aware Incremental Training for Temporal Generalization of Legal Classifica- tion Tasks. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). 3022–3039. doi:10.18653/v1/2024.acl- long.166

work page doi:10.18653/v1/2024.acl- 2024

[15] [15]

Juraj Vladika, Mahdi Dhaini, and Florian Matthes. 2025. Facts Fade Fast: Evaluat- ing Memorization of Outdated Medical Knowledge in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025. 9161–9174. doi:10.18653/v1/2025.findings-emnlp.487

work page doi:10.18653/v1/2025.findings-emnlp.487 2025

[16] [16]

Bernard L Welch. 1947. The generalization of ’Student’s’ problem when several different population variances are involved.Biometrika34, 1/2 (1947), 28–35. doi:10.1093/biomet/34.1-2.28

work page doi:10.1093/biomet/34.1-2.28 1947

[17] [17]

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, and Lucy Lu Wang. 2025. Know Your Limits: A Survey of Abstention in Large Language Models.Transactions of the Association for Computational Linguistics 13 (2025), 529–556. doi:10.1162/tacl_a_00754

work page doi:10.1162/tacl_a_00754 2025

[18] [18]

Li Zhang, Jaromír Savelka, and Kevin Ashley. 2025. Do LLMs Truly Understand When a Precedent Is Overruled? (2025). https://arxiv.org/abs/2510.20941

work page arXiv 2025

[19] [19]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT- Bench and Chatbot Arena. doi:10.48550/arXiv.2306.05685

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.05685 2023

[20] [20]

Manning, Peter Henderson, and Daniel E

Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christo- pher D. Manning, Peter Henderson, and Daniel E. Ho. 2025. A Reasoning- Focused Legal Retrieval Benchmark. InProceedings of the 2025 Symposium on Computer Science and Law. Association for Computing Machinery, 169–193. doi:10.1145/3709025.3712219

work page doi:10.1145/3709025.3712219 2025

[21] [21]

Chenghao Zhu, Nuo Chen, Yufei Gao, Yunyi Zhang, Prayag Tiwari, and Benyou Wang. 2025. Is Your LLM Outdated? A Deep Look at Temporal Generalization. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational ...

work page doi:10.18653/v1/2025.naacl-long.381 2025