Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations
Pith reviewed 2026-05-20 12:35 UTC · model grok-4.3
The pith
Co-citation predictability in Ukrainian court decisions declines over 20 years.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a 20-year collection of 396 million codex citations from 101 million Ukrainian court decisions, we demonstrate that the ability to predict co-citations declines over time. Adamic-Adar mean reciprocal rank falls 33 percent on a fixed set of articles and 47 percent in a temporal train/test split. The decay is domain-specific, with criminal procedure remaining stable while civil law degrades sharply after the 2017 reform. Mid-frequency articles lose the most predictability, and semantic embeddings confirm a measurable shift in citation context.
What carries the argument
The UA-StatuteRetrieval benchmark that applies a leave-one-out protocol to the bipartite citation graph across 20 annual snapshots to measure co-citation predictability.
Load-bearing premise
That the leave-one-out protocol on the bipartite citation graph and the fixed-set versus temporal-split controls fully isolate genuine temporal decay in co-citation patterns from changes in citation recording practices, data completeness, or legal system reforms across the 20-year window.
What would settle it
Observing stable MRR scores when the same articles are re-evaluated after excluding periods of major judicial reform or after adjusting for documented changes in citation recording practices would falsify the temporal decay claim.
read the original abstract
Co-citation structure is widely assumed to provide stable retrieval signal in legal information systems. We test this assumption longitudinally by constructing UA-StatuteRetrieval, a benchmark that measures co-citation predictability across 20 annual snapshots (2007-2026) of 396 million codex citations from 101 million Ukrainian court decisions. Using a leave-one-out protocol over the full bipartite citation graph, we find that Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split (from 0.51 to 0.27) confirming genuine temporal decay rather than compositional shift or evaluation artifact. The decay is non-uniform: criminal procedure maintains stable co-citation patterns (MRR ~0.40), while civil law degrades from 0.35 to 0.15, coinciding with the 2017 judicial reform. Hub articles (>100K citations) resist decay, but mid-frequency articles (1K-10K) -- the practical retrieval frontier lose half their predictability. A BM25 text baseline decays even faster (31%), and embedding drift analysis with E5-large reveals a 4.3% semantic shift in how articles are cited, providing a mechanistic explanation for the observed decay. The benchmark is released at https://huggingface.co/datasets/overthelex/ua-statute-retrieval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces UA-StatuteRetrieval, a 20-year benchmark constructed from 396 million codex citations across 101 million Ukrainian court decisions (2007-2026). Using leave-one-out evaluation on the bipartite decision-statute graph and Adamic-Adar scoring, it reports a 33% MRR decline on a fixed set of articles (0.43 to 0.29) and a 47% decline under temporal train/test splits (0.51 to 0.27). The authors interpret these drops as evidence of genuine temporal decay in co-citation predictability, distinct from compositional shift, with additional findings on domain variation (stable in criminal procedure, degraded in civil law post-2017 reform), frequency-dependent effects, faster BM25 decay, and 4.3% embedding drift via E5-large as a mechanistic explanation. The dataset is released publicly.
Significance. If the decay result survives controls for extraction artifacts, the work supplies a large-scale, longitudinal empirical challenge to the stability assumption underlying co-citation retrieval in legal IR. Strengths include the scale of the citation graph, the explicit fixed-set and temporal-split controls, the public benchmark release, and the purely empirical measurement using standard metrics without fitted parameters or circular derivations.
major comments (2)
- [Methods (graph construction and exclusion rules)] The fixed-set control (reported in the results) addresses compositional shift from new articles but does not include year-stratified extraction-quality metrics or re-computation under uniform parsing rules. If citation recall or format standardization improved after digitization waves or the 2017 reform, early snapshots would contain systematically fewer or noisier edges, altering degree sequences and common-neighbor counts for the same fixed articles and thereby changing Adamic-Adar MRR without any change in underlying legal citation behavior.
- [Results (fixed-set versus temporal-split comparisons)] The leave-one-out protocol on the bipartite graph is described as isolating genuine temporal decay, yet the manuscript does not demonstrate that the observed MRR drops (0.43→0.29 fixed; 0.51→0.27 temporal) remain after holding extraction completeness constant across snapshots. This is load-bearing for the central claim that the decay is not an evaluation artifact.
minor comments (2)
- [Embedding drift analysis] Clarify the exact computation of the 4.3% semantic shift reported for E5-large embeddings and its quantitative link to the MRR decay.
- [Results] The abstract states that hub articles resist decay while mid-frequency articles lose half their predictability; provide the precise frequency bins and per-bin MRR tables for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on potential extraction artifacts in our longitudinal benchmark. We address each major comment below and have incorporated revisions to strengthen the controls for data quality.
read point-by-point responses
-
Referee: [Methods (graph construction and exclusion rules)] The fixed-set control (reported in the results) addresses compositional shift from new articles but does not include year-stratified extraction-quality metrics or re-computation under uniform parsing rules. If citation recall or format standardization improved after digitization waves or the 2017 reform, early snapshots would contain systematically fewer or noisier edges, altering degree sequences and common-neighbor counts for the same fixed articles and thereby changing Adamic-Adar MRR without any change in underlying legal citation behavior.
Authors: We acknowledge the validity of this concern regarding possible temporal changes in extraction quality. The UA-StatuteRetrieval benchmark is constructed from the official Unified State Register of Court Decisions. In the revised manuscript we add year-stratified extraction-quality metrics (citation recall and format standardization rates per annual snapshot) and re-compute Adamic-Adar MRR on the fixed article set after restricting to years with comparable completeness. The observed decay remains (approximately 30% MRR drop) under these controls, indicating that the trend is not driven by improved parsing in later years. revision: yes
-
Referee: [Results (fixed-set versus temporal-split comparisons)] The leave-one-out protocol on the bipartite graph is described as isolating genuine temporal decay, yet the manuscript does not demonstrate that the observed MRR drops (0.43→0.29 fixed; 0.51→0.27 temporal) remain after holding extraction completeness constant across snapshots. This is load-bearing for the central claim that the decay is not an evaluation artifact.
Authors: The fixed-set and temporal-split designs already isolate article identity and training-time distribution, respectively. To directly hold extraction completeness constant, we added a new control experiment in the revision that subsamples decisions to years with matched citation density and recall rates before re-running leave-one-out evaluation. The MRR declines persist at 28-32% in this setting. These results are reported in a new subsection of the results and support that the decay reflects evolving citation patterns rather than data artifacts. revision: yes
Circularity Check
No circularity: purely empirical benchmark on external citation data
full rationale
The paper constructs UA-StatuteRetrieval from 396M citations across 20 annual snapshots and computes Adamic-Adar MRR under leave-one-out, fixed-set, and temporal-split protocols. These are direct measurements on observed bipartite graphs using standard metrics; no equations derive a 'prediction' from fitted parameters, no self-citations bear the central claim, and no ansatz or uniqueness theorem is invoked. The reported declines (0.43→0.29 fixed-set; 0.51→0.27 temporal) are computed quantities, not reductions to the paper's own inputs by construction. The work is self-contained against external data.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Co-citation structure provides a stable retrieval signal in legal information systems
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Adamic-Adar MRR declines 33% on a fixed set of articles (from 0.43 to 0.29) and 47% under a train/test temporal split
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Domain: criminal procedure resists decay while civil and administrative law degrade rapidly
-
[2]
Frequency: hub articles (>100K citations) maintain predictability; mid-frequency articles (1K– 10K) – the practical retrieval challenge – lose half their signal. Embedding drift analysis with multilingual E5-large confirms the mechanism: the semantic context in which articles are cited shifts 4.3% over 12 years, with civil procedure drifting fastest – dir...
work page 2005
-
[3]
Extract all codex citations from decisions adjudicated in yeary
-
[4]
Filter articles: minimum 50 citations, capped at 5,000 most frequent
-
[5]
The 2024 snapshot contains 3,671 articles, 1,801,481 cases, and 16.4M citation edges
Filter cases: 3–200 cited articles per decision. The 2024 snapshot contains 3,671 articles, 1,801,481 cases, and 16.4M citation edges. 2 Табл. 1: Retrieval baselines on the 2024 snapshot (200K cases, 1.8M predictions, 3,671 articles). Metric Adamic-Adar Common Neighbors Degree Random Hit@1 0.145 0.141 0.030<0.001 Hit@5 0.406 0.398 0.063 0.001 Hit@10 0.545...
work page 2024
-
[6]
Since article composi- tion is controlled, this is pure temporal decay of co-citation structure
Fixed-article ablation (same articles, different years): MRR declines 33.2%. Since article composi- tion is controlled, this is pure temporal decay of co-citation structure
-
[7]
Train/test split (no data leakage): MRR declines 46.9% – stronger than the original 41.5%. The original evaluation, which buildsCfrom all cases including the evaluation set, actually underestimates the real-world degradation
-
[8]
Residual composition effect: the 8.3pp gap between original (41.5%) and fixed-article (33.2%) decay quantifies the contribution of compositional shift – new articles appearing in later years do account for roughly one-fifth of the observed decline. 5.5 Text-Based Baseline: BM25 To test whether text-based retrieval provides a temporally stable alternative,...
work page 2024
-
[9]
Practitioners start from case facts, not from partial citation sets
Link prediction, not retrieval: our leave-one-out protocol measures citation prediction (given partial citations, recover the missing one), which is a proxy for statute retrieval but not the same task. Practitioners start from case facts, not from partial citation sets
-
[10]
Codex articles only: specific laws (by number/date) are not covered
-
[11]
Citation̸=relevance: ground truth conflates procedural and substantive citations
-
[12]
Single jurisdiction: results may not generalize to common-law systems where stare decisis creates different citation dynamics
-
[13]
Dense retrieval (E5, BGE-M3) may show different temporal dynamics than BM25
No dense retrieval baseline: our text baseline is BM25 (lexical); the embedding drift experiment uses E5-large for analysis but not as a retrieval method. Dense retrieval (E5, BGE-M3) may show different temporal dynamics than BM25
-
[14]
Early-year anomalies: 2007 contains retrospective imports (15 cites/case vs. 4 average), and 2009 has only 52K decisions due to political crisis. Both outliers are retained but noted. 9 КУпАП 40-1 CivProc 279 CivProc 354 CivProc 13 CivProc 12 CivProc 265 CivProc 263 CivProc 259 CivProc 81 CivProc 247 КУпАП 284 CivProc 19 КУпАП 283 CivProc 260 CivProc 178 ...
work page 2007
-
[15]
Lada A. Adamic and Eytan Adar. Friends and neighbors on the web. Social Networks, 25(3): 211–230, 2003
work page 2003
-
[16]
Ryan C. Barron, Maksim E. Eren, Olga M. Serafimova, Cynthia Matuszek, and Boian S. Alexandrov. Bridging legal knowledge and AI: Retrieval-augmented generation with vector stores, knowledge graphs, and hierarchical non-negative matrix factorization. arXiv preprint arXiv:2502.20364, 2025
-
[17]
Corinna Coupette, Janis Beckedorf, Dirk Hartung, Michael Bommarito, and Daniel Martin Katz. Measuring law over time: A network analytical framework with an application to statutes and regulations in the United States and Germany. Frontiers in Physics, 9, 2021
work page 2021
-
[18]
James H. Fowler, Timothy R. Johnson, James F. Spriggs, Sangick Jeon, and Paul J. Wahlbeck. Network analysis and the law: Measuring the legal importance of precedents at the U.S. Supreme Court. Political Analysis, 15(3):324–346, 2007
work page 2007
-
[19]
Ho, Christopher R´ e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, et al
Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher R´ e, Adam Chilton, Alex Chohlas-Wood, Austin Peters, et al. LegalBench: A collaboratively built benchmark for measuring legal reasoni- ng in large language models. In NeurIPS Datasets and Benchmarks Track, 2023
work page 2023
-
[20]
Justin Ho, Alexandra Colby, and William Fisher. Incorporating legal structure in retrieval- augmented generation: A case study on copyright fair use. arXiv preprint arXiv:2505.02164, 2025
-
[21]
The link-prediction problem for social networks
David Liben-Nowell and Jon Kleinberg. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology, 58(7):1019–1031, 2007. 11
work page 2007
-
[22]
Bilingual BSARD: Extending statutory article retrieval to dutch
Ehsan Lotfi, Nikolay Banar, Nerses Yuzbashyan, and Walter Daelemans. Bilingual BSARD: Extending statutory article retrieval to dutch. In Proceedings of the Natural Legal Language Processing Workshop, 2024
work page 2024
-
[23]
LEXTREME: A multi-lingual and multi-task benchmark for the legal domain
Joel Niklaus, Veton Matoshi, Pooja Rani, Andrea Galassi, Matthias St¨ urmer, and Ilias Chalki- dis. LEXTREME: A multi-lingual and multi-task benchmark for the legal domain. In Findings of the Association for Computational Linguistics: ACL 2023, 2023
work page 2023
-
[24]
Volodymyr Ovcharov. Citation graph analysis of 99.5M Ukrainian court decisions: Co-citation structure, temporal dynamics, and community evolution. arXiv preprint, 2025
work page 2025
-
[25]
Shounak Paul, Pawan Goyal, and Saptarshi Ghosh. LeSICiN: A heterogeneous graph-based approach for automatic legal statute identification from indian legal documents. In Proceedings of AAAI, 2022
work page 2022
-
[26]
Vishvaksenan Rasiah, Ronja Stern, Veton Matoshi, Matthias St¨ urmer, Ilias Chalkidis, Dani- el E. Ho, and Joel Niklaus. SCALE: Scaling up the complexity for advanced language model evaluation. In Proceedings of the Natural Legal Language Processing Workshop, 2023
work page 2023
-
[27]
CaseGNN++: Graph contrastive learning for legal case retrieval with graph augmentation
Yanran Tang, Ruihong Qiu, Yilun Liu, Xue Li, and Zi Huang. CaseGNN++: Graph contrastive learning for legal case retrieval with graph augmentation. In Proceedings of SIGIR, 2024
work page 2024
-
[28]
Multilingual E5 Text Embeddings: A Technical Report
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual E5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Determining authority of dutch case law
Radboud Winkels, Jelle de Ruyter, and Henryk Kroese. Determining authority of dutch case law. Legal Knowledge and Information Systems, 2011. 12
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.