Large Scale Legal Text Classification Using Transformer Models

Erwin Filtz; Gerhard Wohlgenannt; Zein Shaheen

arxiv: 2010.12871 · v1 · pith:7OS6E4A5new · submitted 2020-10-24 · 💻 cs.CL · cs.AI

Large Scale Legal Text Classification Using Transformer Models

Zein Shaheen , Gerhard Wohlgenannt , Erwin Filtz This is my paper

classification 💻 cs.CL cs.AI

keywords classificationlegaltextcreateddatasetseurlex57keurovocgradual

0 comments

read the original abstract

Large multi-label text classification is a challenging Natural Language Processing (NLP) problem that is concerned with text classification for datasets with thousands of labels. We tackle this problem in the legal domain, where datasets, such as JRC-Acquis and EURLEX57K labeled with the EuroVoc vocabulary were created within the legal information systems of the European Union. The EuroVoc taxonomy includes around 7000 concepts. In this work, we study the performance of various recent transformer-based models in combination with strategies such as generative pretraining, gradual unfreezing and discriminative learning rates in order to reach competitive classification performance, and present new state-of-the-art results of 0.661 (F1) for JRC-Acquis and 0.754 for EURLEX57K. Furthermore, we quantify the impact of individual steps, such as language model fine-tuning or gradual unfreezing in an ablation study, and provide reference dataset splits created with an iterative stratification algorithm.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Transformer-Based Language Models Across Domain Verticals: Architectures, Applications and Critical Assessment
cs.CL 2026-06 unverdicted novelty 2.0

A survey paper that taxonomizes transformer architectures, reviews domain applications, and critically assesses deployment trade-offs including parameter-energy costs and alignment issues.