Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification

Aditya Joshi; Amrita Singh; H. Suhan Karaca; Hye-Young Paik; Jiaojiao Jiang

arxiv: 2508.07849 · v2 · pith:NCLVCF46new · submitted 2025-08-11 · 💻 cs.CL

Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification

Amrita Singh , H. Suhan Karaca , Aditya Joshi , Hye-young Paik , Jiaojiao Jiang This is my paper

Pith reviewed 2026-05-25 07:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords legal NLPtransformer modelscontract classificationdomain-specific modelsgeneralist modelsBERT variantsimbalanced datasetsstate-of-the-art results

0 comments

The pith

Legal-specific transformer models outperform generalist models on contract classification tasks, setting new state-of-the-art results with fewer parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates 13 legal-specific transformer models against 9 generalist ones across three English contract classification tasks. It finds that the domain-adapted models deliver better accuracy, particularly when legal nuance is required, and they misclassify rare classes less often in imbalanced data. Legal-BERT and Contracts-BERT reach new best results on two tasks while using 69 percent fewer parameters than the strongest generalist models. The work underscores that general models fall short for specialized legal work and points to the value of customization.

Core claim

Legal-specific models such as Legal-BERT and Contracts-BERT establish new state-of-the-art performance on two of the three contract classification tasks, despite having 69% fewer parameters than the best generalist models. These models also reduce misclassifications on rare classes in imbalanced datasets, demonstrating consistent superiority over generalist transformers especially on tasks needing nuanced legal understanding. CaseLaw-BERT and LexLM emerge as additional strong baselines.

What carries the argument

Comparative evaluation of 13 legal-specific versus 9 generalist transformer-based models on three contract classification tasks.

If this is right

Legal-specific models should be prioritized for contract classification work over generalist alternatives.
Smaller domain-adapted models can surpass larger general models in legal settings.
Generalist models show shortcomings in handling nuanced legal distinctions and imbalanced classes.
CaseLaw-BERT and LexLM provide reliable additional baselines for future legal contract tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gains from domain customization may appear in other specialized fields like medicine or finance.
Future work could test whether the advantage holds when generalist models are further fine-tuned on legal data.
Organizations handling legal documents might achieve better results and lower compute costs by adopting these smaller legal models.

Load-bearing premise

The three selected contract classification tasks and the 22 chosen models sufficiently represent the full range of legal contract classification needs.

What would settle it

A new evaluation on a broader set of contract tasks or with updated model versions where generalist models achieve equal or higher accuracy than legal-specific ones would challenge the claim.

read the original abstract

Despite advances in legal NLP, no comprehensive evaluation of Transformer-based models customized for legal tasks (referred to as `legal-specific' models in this paper) exists for contract classification tasks. To address this gap, we present an evaluation of 13 legal-specific transformer-based models on 3 English-language contract classification tasks and compare them with 9 generalist models. The results show that legal-specific models consistently outperform generalist models, especially on tasks requiring nuanced legal understanding. They also help reduce misclassification of rare classes in imbalanced datasets. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing generalist models. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract classification. Our results highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Legal-specific models beat generalists on three contract tasks with fewer parameters, but the broad superiority claim rests on unverified task representativeness and missing experimental details.

read the letter

The paper's main contribution is running the first explicit comparison of 13 legal-specific transformers against 9 generalist ones on three English contract classification tasks. Legal-BERT and Contracts-BERT set new SOTAs on two tasks while using 69% fewer parameters, and the legal models appear to cut misclassifications on rare classes in imbalanced data. They also flag CaseLaw-BERT and LexLM as useful additional baselines. This supplies concrete numbers where legal NLP previously had mostly assumptions about domain adaptation helping.

Referee Report

3 major / 0 minor

Summary. The paper evaluates 13 legal-specific Transformer-based models against 9 generalist models across three English contract classification tasks. It claims legal-specific models consistently outperform generalists (especially on nuanced legal understanding and rare-class handling in imbalanced data), with Legal-BERT and Contracts-BERT setting new SOTAs on two tasks despite 69% fewer parameters than top generalists; CaseLaw-BERT and LexLM are identified as strong baselines. The work concludes that these results demonstrate shortcomings of generalist models and the need for domain-specific customization in legal applications.

Significance. If the empirical results hold after supplying full experimental details, the paper would provide a useful first systematic comparison of legal-specific versus generalist Transformers on contract tasks, offering practical model-selection guidance and evidence for domain adaptation benefits in handling imbalance. The efficiency finding (SOTA with smaller models) would be a concrete contribution if statistically substantiated.

major comments (3)

[Methods/Experimental Setup] Methods/Experimental Setup section: The description of the three tasks supplies no dataset sizes, class distributions, selection criteria for representativeness across contract types or jurisdictions, training protocols, hyperparameter search details, or statistical significance testing. These omissions make the central claims of consistent outperformance and new SOTAs unverifiable from the reported results.
[Abstract and Conclusion] Abstract and Conclusion: The generalization that the results 'highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications' rests on only three tasks and 22 models total. No justification is given for why these tasks cover the space of legal contract classification sufficiently to support the broad claim.
[Results] Results section: Performance comparisons are presented without error bars, multiple-run statistics, or significance tests, so it is impossible to assess whether reported differences (including the SOTA claims) are reliable rather than artifacts of single runs or hyperparameter choices.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that several aspects of the experimental reporting require expansion to ensure verifiability. Below we respond point-by-point to the major comments. We will incorporate the requested details and statistical analyses in a revised manuscript.

read point-by-point responses

Referee: [Methods/Experimental Setup] The description of the three tasks supplies no dataset sizes, class distributions, selection criteria for representativeness across contract types or jurisdictions, training protocols, hyperparameter search details, or statistical significance testing. These omissions make the central claims of consistent outperformance and new SOTAs unverifiable from the reported results.

Authors: We agree that the Methods section lacks sufficient detail for independent verification. In the revision we will add: (i) exact dataset sizes and class distributions for each task, (ii) explicit selection criteria and representativeness arguments for the chosen contract types and jurisdictions, (iii) full training protocols and hyperparameter search ranges, and (iv) statistical significance testing (e.g., McNemar or paired t-tests across multiple seeds). These additions will directly address the verifiability concern. revision: yes
Referee: [Abstract and Conclusion] The generalization that the results 'highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications' rests on only three tasks and 22 models total. No justification is given for why these tasks cover the space of legal contract classification sufficiently to support the broad claim.

Authors: The three tasks were selected to span distinct contract-classification challenges (nuanced legal reasoning, rare-class handling under imbalance, and multi-label settings). In the revision we will add an explicit subsection justifying task selection with reference to prior legal-NLP benchmarks and will qualify the generalization statement to reflect the evaluated scope while preserving the practical takeaway that domain-specific models showed advantages on these representative tasks. revision: partial
Referee: [Results] Performance comparisons are presented without error bars, multiple-run statistics, or significance tests, so it is impossible to assess whether reported differences (including the SOTA claims) are reliable rather than artifacts of single runs or hyperparameter choices.

Authors: We acknowledge that single-run reporting limits assessment of reliability. The revised manuscript will report mean and standard deviation over at least five random seeds for all models, include error bars in tables and figures, and apply appropriate significance tests (e.g., Wilcoxon signed-rank) between legal-specific and generalist models. This will allow readers to evaluate whether the reported SOTA margins are statistically supported. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical model evaluation

full rationale

The paper conducts a direct empirical comparison of 13 existing legal-specific and 9 generalist transformer models on three fixed English contract classification tasks, reporting observed performance metrics and SOTA results. No equations, derivations, fitted parameters, or predictions are present that could reduce to inputs by construction. No self-citation chains or ansatzes are invoked to justify core claims. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are present; the paper is an empirical model comparison study.

pith-pipeline@v0.9.0 · 5702 in / 1083 out tokens · 34795 ms · 2026-05-25T07:56:12.633216+00:00 · methodology

Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)