pith. sign in

arxiv: 2604.17674 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Towards Intelligent Legal Document Analysis: CNN-Driven Classification of Case Law Texts

Pith reviewed 2026-05-10 05:51 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords legal document classificationconvolutional neural networkscase law analysisFastText embeddingstext classificationcitation treatmentlightweight models
0
0 comments X

The pith

A lightweight CNN classifies legal case texts at 97.26 percent accuracy while using 5.1 million parameters and running over 13 times faster than BERT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Legal case documents contain formal language and specialized terms that slow down manual review. The paper demonstrates that lemmatisation followed by FastText embeddings and a multi-kernel one-dimensional convolutional neural network can sort citation treatments in these texts more accurately and far more efficiently than larger models. On a corpus of 25,000 annotated documents the system records 97.26 percent accuracy and 96.82 percent macro F1-score. It also shows the highest AUC-ROC among tested baselines while keeping inference latency at 0.31 milliseconds per document.

Core claim

The authors establish that a convolutional architecture with subword-aware FastText embeddings and targeted preprocessing outperforms fine-tuned BERT, LSTM, random-embedding CNN, and TF-IDF KNN on citation-treatment classification of 25,000 legal documents, delivering 97.26 percent accuracy, 96.82 percent macro F1-score, and 97.83 percent AUC-ROC with only 5.1 million parameters and 0.31 ms inference time per document.

What carries the argument

A multi-kernel one-dimensional convolutional neural network that processes lemmatised text through subword-aware FastText embeddings to predict citation-treatment categories.

If this is right

  • Courts and law firms can triage incoming case filings at scale without requiring GPU clusters or cloud-scale compute.
  • The low parameter count permits deployment on standard laptops or mobile devices for field use by legal staff.
  • Because errors occur mainly between semantically adjacent categories, the model can flag borderline cases for quick human review.
  • Ablation results indicate that removing lemmatisation or FastText embeddings measurably degrades performance, confirming the pipeline's design choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Specialised preprocessing and compact convolutional designs may offer practical advantages over general-purpose large language models when the task domain is narrow and labelled data is available.
  • The latency reduction could support live assistance tools during court proceedings where immediate document classification is needed.
  • Extending the same architecture to multi-label citation classification or to joint prediction with metadata fields remains an open direction suggested by the current error patterns.

Load-bearing premise

The 25,000 annotated documents with a 75/25 split represent typical real-world legal texts and that the baseline models including fine-tuned BERT received comparable hyperparameter tuning without data leakage.

What would settle it

Re-running the identical training pipeline on an independent legal corpus from a different jurisdiction or court system and finding that accuracy falls below 95 percent or that the speed advantage disappears.

Figures

Figures reproduced from arXiv: 2604.17674 by Ahnaf Atef Choudhury, Jia Uddin, Moinul Hossain, Sadia Afrin Promi, Shakila Rahman, Sourav Rabi Das, Zikrul Shariar Ayon.

Figure 1
Figure 1. Figure 1: Proposed CNN + FastText + Lemmatisation architecture. Raw legal text is cleaned, lemmatised, and converted to FastText embeddings. Three parallel 1D con￾volutional branches (kernel sizes 2, 3, 5) extract multi-scale legal patterns, which are globally max-pooled, concatenated, and passed through a fully connected softmax layer to predict the citation-treatment class. Formally, for filter r applied with kern… view at source ↗
Figure 2
Figure 2. Figure 2: ROC curves of the proposed and baseline models. Each curve plots the true￾positive rate against the false-positive rate. AUC values (parentheses in the legend) increase monotonically from KNN (TF-IDF) at 90.10% to the proposed CNN + Fast￾Text + Lemmatisation model at 97.83%, confirming superior discrimination across all five citation-treatment classes. 4.3 Discussion of Results The performance advantage of… view at source ↗
Figure 3
Figure 3. Figure 3: Confusion matrix of the proposed CNN + FastText + Lemmatisation model evaluated on the 6,250-document test set. Rows represent true citation-treatment la￾bels; columns represent predicted labels. Strong diagonal dominance verifies high per￾class accuracy, and the mass off diagonal is concentrated at the Positive-Neutral and Distinguished-Overruled category boundaries, indicative of semantic proximity betwe… view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation accuracy (top) and loss (bottom) of the proposed CNN + FastText + Lemmatisation model over 20 representative epochs. Both curves con￾verge smoothly without oscillation, training accuracy increases from approximately 70% to 98.5%, and validation accuracy tracks closely at 97.2%, confirming effective generalisation and minimal overfitting. this trajectory and stabilises at 97.2% — con… view at source ↗
read the original abstract

Legal practitioners and judicial institutions face an ever-growing volume of case-law documents characterised by formalised language, lengthy sentence structures, and highly specialised terminology, making manual triage both time-consuming and error-prone. This work presents a lightweight yet high-accuracy framework for citation-treatment classification that pairs lemmatisation-based preprocessing with subword-aware FastText embeddings and a multi-kernel one-dimensional Convolutional Neural Network (CNN). Evaluated on a publicly available corpus of 25,000 annotated legal documents with a 75/25 training-test partition, the proposed system achieves 97.26% classification accuracy and a macro F1-score of 96.82%, surpassing established baselines including fine-tuned BERT, Long Short-Term Memory (LSTM) with FastText, CNN with random embeddings, and a Term Frequency-Inverse Document Frequency (TF-IDF) k-Nearest Neighbour (KNN) classifier. The model also attains the highest Area Under the Receiver Operating Characteristic (AUC-ROC) curve of 97.83% among all compared systems while operating with only 5.1 million parameters and an inference latency of 0.31 ms per document - more than 13 times faster than BERT. Ablation experiments confirm the individual contribution of each pipeline component, and the confusion matrix reveals that residual errors are confined to semantically adjacent citation categories. These findings indicate that carefully designed convolutional architectures represent a scalable, resource-efficient alternative to heavyweight transformers for intelligent legal document analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a lightweight CNN framework for citation-treatment classification of legal case-law documents. It combines lemmatisation preprocessing with subword-aware FastText embeddings and a multi-kernel 1D CNN. On a public corpus of 25,000 annotated documents using a 75/25 train-test split, the model reports 97.26% accuracy, 96.82% macro F1, and 97.83% AUC-ROC, outperforming fine-tuned BERT, LSTM+FastText, CNN+random embeddings, and TF-IDF KNN. The system uses 5.1 million parameters and achieves 0.31 ms inference latency per document (claimed >13x faster than BERT). Ablation studies are said to confirm component contributions, with errors limited to semantically adjacent categories.

Significance. If the results hold under properly documented baselines, the work shows that a resource-efficient CNN can match or exceed transformer performance on a specialized legal NLP task while offering substantial gains in parameter count and inference speed. This is potentially valuable for practical deployment in legal institutions with limited compute. The efficiency metrics and domain-specific preprocessing are clear strengths that could influence future lightweight legal-document models.

major comments (2)
  1. [Abstract] Abstract and experimental results: The central claim of consistent superiority over fine-tuned BERT (97.26% accuracy and 96.82% macro F1) is load-bearing, yet no details are supplied on BERT's fine-tuning protocol, including epochs, learning rate, batch size, optimizer, legal-domain adaptation, or whether identical lemmatisation and preprocessing were applied to the same 75/25 split. Without these, the performance gap cannot be attributed to the CNN architecture rather than differences in baseline optimization or data handling.
  2. [Abstract] Abstract: Ablation experiments are invoked to confirm the contribution of lemmatisation, FastText embeddings, and the multi-kernel CNN, but no quantitative ablation numbers (accuracy or F1 drops upon removal of each component) are reported. This leaves the individual impact of each pipeline element unverified and weakens the supporting evidence for the proposed design.
minor comments (2)
  1. [Abstract] Abstract: The categories in 'citation-treatment classification' are not defined or exemplified, which reduces accessibility for readers unfamiliar with legal document analysis.
  2. [Abstract] Abstract: The reported inference latency (0.31 ms) and parameter count (5.1 million) are useful, but no hardware platform or batch size is specified for the latency measurement, limiting reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped strengthen the experimental documentation in our manuscript. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: The central claim of consistent superiority over fine-tuned BERT (97.26% accuracy and 96.82% macro F1) is load-bearing, yet no details are supplied on BERT's fine-tuning protocol, including epochs, learning rate, batch size, optimizer, legal-domain adaptation, or whether identical lemmatisation and preprocessing were applied to the same 75/25 split. Without these, the performance gap cannot be attributed to the CNN architecture rather than differences in baseline optimization or data handling.

    Authors: We agree that the BERT fine-tuning protocol requires explicit documentation to support the performance claims. In the revised manuscript, we have expanded Section 4.2 (Baselines and Implementation Details) to specify the full fine-tuning configuration for BERT, including the number of epochs, learning rate schedule, batch size, optimizer, and any domain adaptation steps. We also explicitly state that the identical lemmatisation preprocessing pipeline and the same 75/25 train-test split were applied to BERT and all other baselines. These additions allow the performance differences to be attributed to architectural choices rather than experimental inconsistencies. revision: yes

  2. Referee: [Abstract] Abstract: Ablation experiments are invoked to confirm the contribution of lemmatisation, FastText embeddings, and the multi-kernel CNN, but no quantitative ablation numbers (accuracy or F1 drops upon removal of each component) are reported. This leaves the individual impact of each pipeline element unverified and weakens the supporting evidence for the proposed design.

    Authors: We acknowledge that while the original text states that ablation studies confirm component contributions, the quantitative metrics were not presented. In the revised manuscript, we have added a dedicated 'Ablation Study' subsection (Section 5.4) containing a new table that reports accuracy and macro F1 for the complete model versus each ablated variant. This table quantifies the performance impact of removing lemmatisation, replacing FastText embeddings, and using a single kernel instead of the multi-kernel design, thereby providing the requested verification of each element's contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on held-out test set with no self-referential derivations

full rationale

The paper reports standard supervised classification performance (accuracy, F1, AUC-ROC) for a CNN+FastText pipeline on a fixed 75/25 split of 25k documents. No equations, uniqueness theorems, ansatzes, or fitted parameters are redefined as predictions. Ablation studies and baseline comparisons are conventional empirical checks without load-bearing self-citations or reductions by construction. The central claims rest on experimental outcomes rather than tautological constructions, making the derivation self-contained.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard supervised learning assumptions plus several tuned architectural choices whose values are not reported in the abstract.

free parameters (3)
  • CNN kernel sizes and filter counts
    Multiple kernel sizes are used; exact values and counts are chosen to maximize reported accuracy.
  • FastText embedding dimension and training settings
    Subword-aware embeddings are employed; dimension and training hyperparameters are selected for the task.
  • Training hyperparameters (learning rate, batch size, epochs)
    Standard optimization settings required to reach the stated 97.26% accuracy.
axioms (2)
  • domain assumption The 75/25 train-test split on the 25,000-document corpus is random and free of leakage between citation categories.
    Standard practice but critical for the generalization claim.
  • domain assumption Lemmatisation and subword tokenization preserve all semantically relevant information for citation-treatment classification.
    Preprocessing step assumed sufficient without loss of legal meaning.

pith-pipeline@v0.9.0 · 5591 in / 1414 out tokens · 41396 ms · 2026-05-10T05:51:20.398044+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    In: Proceedings of the 2025 Conference of the Nations of 14 M

    Li, A., Wu, Y., Cai, M., Jatowt, A., Zhou, X., Lu, W., Sun, C., Wu, F., Kuang, K.: Legal judgment prediction based on knowledge-enhanced multi-task and multi- label text classification. In: Proceedings of the 2025 Conference of the Nations of 14 M. Hossain et al. the Americas Chapter of the Association for Computational Linguistics: Human Language Technol...

  2. [2]

    SN Computer Science 6(7), 784 (2025)

    McCarroll, N., McShane, P., O’Connell, E., Curran, K., Singh, M., McNamee, E., Clist, A., Brammer, A.: Evaluating shallow and deep learning strategies for legal text classification of clauses in non-disclosure agreements. SN Computer Science 6(7), 784 (2025)

  3. [3]

    arXiv preprint arXiv:2509.22119 (2025)

    Chi, X., Zhong, W., Wu, Y., Wang, W., Kuang, K., Wu, F., Xiong, M.: Universal legal article prediction via tight collaboration between supervised classification model and LLM. arXiv preprint arXiv:2509.22119 (2025)

  4. [4]

    arXiv preprint arXiv:2508.00709 (2025)

    Nigam, S.K., Patnaik, B.D., Mishra, S., Thomas, A.V., Shallum, N., Ghosh, K., Bhattacharya,A.:NyayaRAG:RealisticlegaljudgmentpredictionwithRAGunder the Indian common law system. arXiv preprint arXiv:2508.00709 (2025)

  5. [5]

    arXiv preprint arXiv:2505.02172 (2025)

    Arvin, C.: Identifying legal holdings with LLMs: A systematic study of perfor- mance, scale, and memorization. arXiv preprint arXiv:2505.02172 (2025)

  6. [6]

    arXiv preprint arXiv:2504.01349 (2025)

    Koenecke,A.,Stiglitz,J.,Mimno,D.,Wilkens,M.:TasksandrolesinlegalAI:Data curation, annotation, and verification. arXiv preprint arXiv:2504.01349 (2025)

  7. [7]

    arXiv preprint arXiv:2509.09969 , year=

    Hou, Z., Ye, Z., Zeng, N., Hao, T., Zeng, K.: Large language models meet legal artificial intelligence: A survey. arXiv preprint arXiv:2509.09969 (2025)

  8. [8]

    Annals of Emerging Tech- nologies in Computing6(3), 69–78 (2022).https://doi.org/10.33166/AETiC

    Uddin, J.: A novel data aggregation mechanism using reinforcement learning for cluster heads in wireless multimedia sensor networks. Annals of Emerging Tech- nologies in Computing6(3), 69–78 (2022).https://doi.org/10.33166/AETiC. 2022.03.006

  9. [9]

    Journal ofComputationalMethodsinSciencesandEngineering,14727978251361281(2024)

    Guo, J.: Legal case multi-label text classification using BERT-CNN model. Journal ofComputationalMethodsinSciencesandEngineering,14727978251361281(2024)

  10. [10]

    In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pp

    Kim, J., Jeon, H., Heo, D., Lee, J., Suh, B.: LegisFlow: Enhancing Korean legal research with temporal-aware LLM interfaces. In: Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pp. 1–29. ACM, New York (2025)

  11. [11]

    Knowledge and Information Systems, 1–22 (2025)

    Duffy, W., O’Connell, E., McCarroll, N., Sloan, K., Curran, K., McNamee, E., Clist, A., Brammer, A.: Evaluating rule-based and generative data augmentation techniques for legal document classification. Knowledge and Information Systems, 1–22 (2025)

  12. [12]

    Bridging Legal Knowledge and AI: Retrieval-Augmented Generation with Vector Stores, Knowledge Graphs, and Hierarchical Non-negative Matrix Factorization; 2025

    Barron, R.C., Eren, M.E., Serafimova, O.M., Matuszek, C., Alexandrov, B.S.: Bridging legal knowledge and AI: Retrieval-augmented generation with vector stores, knowledge graphs, and hierarchical non-negative matrix factorization. arXiv preprint arXiv:2502.20364 (2025)

  13. [13]

    Artificial Intelligence and Law, 1–49 (2025)

    Sargeant, H., Izzidien, A., Steffek, F.: Topic classification of case law using a large language model and a new taxonomy for UK law: AI insights into summary judg- ment. Artificial Intelligence and Law, 1–49 (2025)

  14. [14]

    arXiv preprint arXiv:2505.21281 (2025)

    Zhang, Y., Tian, Z., Zhou, S., Wang, H., Hou, W., Liu, Y., Zhao, X., Huang, M., Wang, Y., Zhou, B.: RLJP: Legal judgment prediction via first-order logic rule- enhanced with large language models. arXiv preprint arXiv:2505.21281 (2025)

  15. [15]

    Artificial Intelligence Review58(12), 380 (2025)

    Singh, A., Joshi, A., Jiang, J., Paik, H.-y.: A survey of classification tasks and approaches for legal contracts. Artificial Intelligence Review58(12), 380 (2025)

  16. [16]

    Kmainasi, M.B., Shahroor, A.E., Al-Ghraibah, A.: Can large language models pre- dict the outcome of judicial decisions? arXiv preprint arXiv:2501.09768 (2025)

  17. [17]

    Acta Technica Jaurinensis15(1), 15–21 (2022)

    Csányi, G., Orosz, T.: Comparison of data augmentation methods for legal docu- ment classification. Acta Technica Jaurinensis15(1), 15–21 (2022)

  18. [18]

    Kaggle (2022).https://www

    Bansal, S.: Legal citation text classification dataset. Kaggle (2022).https://www. kaggle.com/datasets/shivamb/legal-citation-text-classification Title Suppressed Due to Excessive Length 15

  19. [19]

    In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

    Chalkidis, I., Fergadiotis, M., Androutsopoulos, I.: Legal-BERT: The muppets straight out of law school. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2898–2904. Association for Computational Linguistics, Online (2020)