pith. machine review for the scientific record. sign in

arxiv: 2605.10186 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.AI

Recognition: no theorem link

LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords legal language modelscitation reliabilityLLM benchmarksclosed-book retrievallegal AI evaluationmisleading answerscase citation
0
0 comments X

The pith

Legal language models frequently generate incorrect citations when not given external sources.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LegalCiteBench to test how reliably LLMs can recall and generate legal case citations without access to databases or search tools. Using 24,000 test cases drawn from real U.S. court opinions, it evaluates 21 models across five citation tasks including retrieval and completion. Even the strongest models recover exact citations correctly less than 7 percent of the time. Most models instead supply plausible but wrong authorities, producing misleading answers in over 94 percent of retrieval-heavy cases. This matters because incorrect citations in legal work can lead to professional harm when models operate without external grounding.

Core claim

In a closed-book setting without external grounding, large language models achieve exact citation recovery scores below 7 out of 100 on retrieval and completion tasks within LegalCiteBench, while exhibiting misleading answer rates above 94 percent for 20 of 21 evaluated models.

What carries the argument

LegalCiteBench, a collection of approximately 24,000 evaluation instances spanning citation retrieval, citation completion, citation error detection, case matching, and case verification and correction, derived from 1,000 real judicial opinions.

If this is right

  • Legal drafting and research workflows that use LLMs for citations must incorporate external retrieval or verification to prevent reliance on fabricated authorities.
  • Explicit instructions for models to abstain when uncertain reduce some confident errors but leave citation accuracy largely unchanged.
  • Neither larger model scale nor legal-domain pretraining resolves the core difficulty of closed-book citation recovery.
  • Diagnostic tools focused on authority generation can help identify when models should defer to search systems rather than generate citations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Legal AI systems may need mandatory integration with verified case databases to reach usable reliability levels in practice.
  • The pattern of high misleading answer rates could extend to other knowledge-intensive professional domains where precise recall matters.
  • Future work could test whether combining LegalCiteBench with retrieval-augmented setups measurably improves outcomes on the same tasks.

Load-bearing premise

The 24,000 instances constructed from 1,000 judicial opinions accurately capture the distribution and ambiguity of real-world legal citation tasks without introducing selection bias or overly simplified matching criteria.

What would settle it

A direct comparison of model scores on LegalCiteBench against performance by practicing lawyers or law students on the identical tasks would show whether the reported failures reflect a genuine gap in model capability.

Figures

Figures reproduced from arXiv: 2605.10186 by Hang Yin, Shunfan Zhou, Sijia Chen.

Figure 1
Figure 1. Figure 1: Overview of the LegalCiteBench construction and evaluation pipeline. We transform real [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Category-level performance summary across 21 evaluated models. Cat1 and Cat2 remain [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Misleading Answer Rate (MAR) across evaluated models. MAR remains high for nearly [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LegalCiteBench, a benchmark of ~24K instances derived from 1,000 U.S. judicial opinions, to evaluate LLMs on five closed-book citation tasks: retrieval, completion, error detection, case matching, and verification/correction. Evaluation across 21 models shows even the strongest achieve <7/100 on retrieval and completion, with Misleading Answer Rates (MAR) >94% for 20/21 models on retrieval tasks; scale and legal pretraining yield limited gains, and abstention prompts reduce some fabrication but not correctness.

Significance. If the benchmark's construction and matching rules accurately reflect genuine closed-book citation failures rather than artifacts, the results highlight a serious limitation in current LLMs for legal authority generation, with direct implications for professional use where fabricated citations pose harm. The work provides a diagnostic framework that could guide improvements in training, retrieval-augmented generation, and uncertainty handling for legal-domain models.

major comments (2)
  1. [Benchmark construction] Benchmark construction section: the process for selecting the 1,000 opinions and deriving the 24K instances must specify sampling strategy (e.g., random vs. stratified by jurisdiction, era, or citation density) and how multi-citation or ambiguous opinions are handled, as unrepresentative selection could make the low scores and high MAR reflect benchmark artifacts rather than model limitations.
  2. [Evaluation protocol] Evaluation protocol and metrics section: the criteria for scoring a citation as 'correct' or 'misleading' (including the overlap threshold for case matching and handling of low-overlap authorities) must explicitly state whether it normalizes for Bluebook variants, parallel citations, reporter differences, case name aliases, or jurisdiction-specific styles; without such normalization, models surfacing valid but non-identical authorities will be penalized, directly undermining the central claim that scores <7/100 and MAR >94% indicate absence of knowledge rather than metric strictness.
minor comments (2)
  1. [Experimental setup] The prompt templates for each of the five tasks should be provided in an appendix or table to allow exact reproduction of the abstention experiment and other results.
  2. [Results] Clarify in the results section whether the reported scores are macro-averaged across tasks or weighted, and include per-task breakdowns for all 21 models rather than aggregated MAR figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on LegalCiteBench. The comments highlight important areas for improving clarity in benchmark construction and evaluation details. We address each major comment below and will revise the manuscript to add the requested specifications while preserving the core claims and methodology.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction section: the process for selecting the 1,000 opinions and deriving the 24K instances must specify sampling strategy (e.g., random vs. stratified by jurisdiction, era, or citation density) and how multi-citation or ambiguous opinions are handled, as unrepresentative selection could make the low scores and high MAR reflect benchmark artifacts rather than model limitations.

    Authors: We agree that explicit details on sampling are needed to demonstrate representativeness. The current manuscript identifies the source corpus but does not elaborate on selection. In the revised version, we will expand the Benchmark Construction section to describe the process: opinions were drawn from the Case Law Access Project to ensure coverage across federal and state jurisdictions and multiple decades, with instances generated by extracting every citation present in each opinion. Multi-citation opinions contributed multiple independent instances, and opinions with incomplete or ambiguous metadata were excluded after initial filtering to maintain data quality. These additions will allow readers to assess whether the low performance reflects model limitations rather than sampling artifacts. revision: yes

  2. Referee: [Evaluation protocol] Evaluation protocol and metrics section: the criteria for scoring a citation as 'correct' or 'misleading' (including the overlap threshold for case matching and handling of low-overlap authorities) must explicitly state whether it normalizes for Bluebook variants, parallel citations, reporter differences, case name aliases, or jurisdiction-specific styles; without such normalization, models surfacing valid but non-identical authorities will be penalized, directly undermining the central claim that scores <7/100 and MAR >94% indicate absence of knowledge rather than metric strictness.

    Authors: We will revise the Evaluation Protocol and Metrics section to fully specify the scoring rules, including the exact overlap threshold applied for case matching and the treatment of low-overlap outputs as misleading. Our protocol performs direct matching against the precise citation strings and case identifiers appearing in the source opinions and does not apply broad normalization for Bluebook variants, parallel citations, or aliases. This design choice is deliberate: legal practice requires accurate, usable citations, and the benchmark aims to measure whether models can produce them from internal knowledge alone. We will also add a paragraph discussing the implications of this strictness, noting that while some semantically related authorities might be scored as incorrect, the high misleading answer rates still indicate a fundamental limitation in reliable authority generation. This clarification will strengthen rather than undermine the central claims. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper constructs LegalCiteBench from 1,000 external judicial opinions in the Case Law Access Project and reports observed performance of 21 LLMs on five citation tasks using metrics such as exact recovery scores and Misleading Answer Rates. No equations, derivations, fitted parameters, predictions, or self-citations appear that reduce any claim to its inputs by construction. All central results are direct empirical measurements against an independently sourced corpus, making the work self-contained without any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim that models exhibit high rates of incorrect citation generation rests on the assumption that the constructed test instances provide an objective and representative ground truth for citation correctness.

axioms (1)
  • domain assumption The 1,000 selected U.S. judicial opinions contain citation patterns that are representative of common-law citation usage.
    Benchmark construction begins from these opinions; if they are atypical, the measured failure rates may not generalize.

pith-pipeline@v0.9.0 · 5583 in / 1276 out tokens · 104597 ms · 2026-05-12T03:56:35.848871+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 5 internal anchors

  1. [1]

    ACM Transactions on Information Systems , volume=

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. ACM Transactions on Information Systems , volume=. 2025 , publisher=

  2. [2]

    Avianca Inc

    Doubling Down on Dumb: Lessons from Mata v. Avianca Inc. , author=. American Bankruptcy Institute Journal , volume=. 2023 , publisher=

  3. [3]

    arXiv preprint arXiv:2509.09969 , year=

    Large language models meet legal artificial intelligence: A survey , author=. arXiv preprint arXiv:2509.09969 , year=

  4. [4]

    arXiv preprint arXiv:2103.06268 , year=

    Cuad: An expert-annotated nlp dataset for legal contract review , author=. arXiv preprint arXiv:2103.06268 , year=

  5. [5]

    Advances in neural information processing systems , volume=

    Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models , author=. Advances in neural information processing systems , volume=

  6. [6]

    Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

    Lawbench: Benchmarking legal knowledge of large language models , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

  7. [7]

    Proceedings of the 31st International conference on computational linguistics , pages=

    LAiW: A Chinese legal large language models benchmark , author=. Proceedings of the 31st International conference on computational linguistics , pages=

  8. [8]

    arXiv preprint arXiv:2505.12864 , year=

    Lexam: Benchmarking legal reasoning on 340 law exams , author=. arXiv preprint arXiv:2505.12864 , year=

  9. [9]

    Advances in Neural Information Processing Systems , volume=

    Lexeval: A comprehensive chinese legal benchmark for evaluating large language models , author=. Advances in Neural Information Processing Systems , volume=

  10. [10]

    arXiv preprint arXiv:2601.16669 , year=

    PLawBench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice , author=. arXiv preprint arXiv:2601.16669 , year=

  11. [11]

    arXiv preprint arXiv:2511.07979 , year=

    Benchmarking multi-step legal reasoning and analyzing chain-of-thought effects in large language models , author=. arXiv preprint arXiv:2511.07979 , year=

  12. [12]

    Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

    Lecardv2: A large-scale chinese legal case retrieval dataset , author=. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages=

  13. [13]

    Proceedings of the nineteenth international conference on artificial intelligence and law , pages=

    Summary of the competition on legal information, extraction/entailment (COLIEE) 2023 , author=. Proceedings of the nineteenth international conference on artificial intelligence and law , pages=

  14. [14]

    2024 IEEE International Conference on Big Data (BigData) , pages=

    Myanmar law cases and proceedings retrieval with graphrag , author=. 2024 IEEE International Conference on Big Data (BigData) , pages=. 2024 , organization=

  15. [15]

    Information Processing & Management , volume=

    Low-resource court judgment summarization for common law systems , author=. Information Processing & Management , volume=. 2024 , publisher=

  16. [16]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Interpretable long-form legal question answering with retrieval-augmented large language models , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  17. [17]

    Computer Science Review , volume=

    Large language models hallucination: A comprehensive survey , author=. Computer Science Review , volume=. 2026 , publisher=

  18. [18]

    arXiv preprint arXiv:2505.11413 , year=

    Cares: Comprehensive evaluation of safety and adversarial robustness in medical llms , author=. arXiv preprint arXiv:2505.11413 , year=

  19. [19]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Retrieval-augmented generation for large language models: A survey , author=. arXiv preprint arXiv:2312.10997 , volume=

  20. [20]

    Journal of empirical legal studies , volume=

    Hallucination-free? Assessing the reliability of leading AI legal research tools , author=. Journal of empirical legal studies , volume=. 2025 , publisher=

  21. [21]

    Nguyen, Dong and Le, Thang V. Q. and Nguyen, Nguyen P. and others , booktitle =

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Saullm-54b & saullm-141b: Scaling up domain adaptation for the legal domain , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Lin, Stephanie and Hilton, Jacob and Evans, Owain , booktitle =

  24. [24]

    Li, Junyi and Cheng, Xiaoxue and Zhao, Wayne Xin and Nie, Jian-Yun and Wen, Ji-Rong , booktitle =

  25. [25]

    Bang, Yejin and others , journal =

  26. [26]

    Advances in neural information processing systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

  27. [27]

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang , booktitle =

  28. [28]

    Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and others , journal =. The

  29. [29]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  30. [30]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  31. [31]

    Mixtral of Experts

    Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

  32. [32]

    Phi-4 Technical Report

    Phi-4 technical report , author=. arXiv preprint arXiv:2412.08905 , year=

  33. [33]

    Proceedings of the 29th symposium on operating systems principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th symposium on operating systems principles , pages=