arxiv: 2604.19593 · v2 · submitted 2026-04-21 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

Mircea Timpuriu , Mihaela-Claudia Cercel , Dumitru-Clementin Cercel

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Romaniangrammatical error correctionlegal domainparallel dataseterror detectionneural modelslow-resource language

0 comments

The pith

Romanian legal writing gets its first parallel dataset of 350,000 grammar errors

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RoLegalGEC, the first Romanian parallel dataset built specifically for detecting and correcting grammatical errors in legal texts. It supplies 350,000 annotated examples drawn from legal passages, addressing the shortage of realistic training data for this low-resource language and domain. The authors evaluate neural models including knowledge-distillation Transformers for correction and sequence-tagging architectures for detection, showing that the dataset turns into usable tools. A sympathetic reader would care because accurate legal documents matter for professionals, and general-language tools often miss the specialized patterns found in law. The work therefore supplies both the resource and initial proof that domain-tuned models can be trained from it.

Core claim

RoLegalGEC aggregates 350,000 examples of errors in legal passages together with error annotations, forming the first Romanian-language parallel dataset for grammatical error detection and correction in the legal domain; several neural models, including knowledge-distillation Transformers, sequence-tagging detectors, and pre-trained text-to-text Transformers, can be trained on the data to perform both tasks.

What carries the argument

The RoLegalGEC dataset of parallel erroneous and corrected legal passages in Romanian with accompanying error annotations.

Load-bearing premise

The 350,000 aggregated examples form a representative and high-quality collection of real grammatical errors that appear in Romanian legal writing.

What would settle it

Train any of the reported models on RoLegalGEC, then test it on a fresh collection of real Romanian legal documents whose errors have been marked by human experts; performance no better than a general-domain Romanian grammar model would undermine the dataset's claimed value.

Figures

Figures reproduced from arXiv: 2604.19593 by Dumitru-Clementin Cercel, Mihaela-Claudia Cercel, Mircea Timpuriu.

**Figure 2.** Figure 2: Common Romanian language grammatical mistake example (Nedelcu, 2012), marked in red, along with the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Punctuation transition probability matrix. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: The LLM few-shot prompting error generation pipeline. When a clean sentence is extracted from the corpus [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: RoLegalGEC dataset example. The erroneous sequences are highlighted in red, with their corrections [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases the first Romanian legal-domain GEC parallel dataset with 350k examples and baseline models, and the construction details hold up.

read the letter

The main thing to know is that the authors have built RoLegalGEC, a new parallel dataset for grammatical error detection and correction aimed at Romanian legal texts. It contains 350,000 examples drawn from legal passages, with error annotations, and they treat it as the first resource of its kind for this language and domain. They also run a set of baseline experiments using sequence taggers for detection and various pre-trained transformers for correction, including some knowledge-distillation setups. The data pipeline, error taxonomy, and validation steps are described in enough detail to make the resource usable for training and further work. That is the concrete addition here, and it addresses a real shortage of domain-specific data for Romanian. The evaluations stay at the baseline level without claiming large gains or deep analysis, which keeps the claims proportionate. One soft spot is the heavy use of synthetic data generation; even with human checks noted, it would help to see more evidence that the introduced errors match patterns in real legal writing rather than introducing artifacts. The results section does not include extensive error breakdowns or comparisons against human performance, so the practical impact on downstream legal tools remains to be tested by others. This is the sort of paper that serves researchers who need data for low-resource GEC or legal NLP experiments. It is not trying to push state-of-the-art numbers, but the dataset release itself gives others something to build on. I would send it to peer review because the contribution is a verifiable new resource with reproducible elements and no load-bearing gaps in the description.

Referee Report

0 major / 3 minor

Summary. The paper introduces RoLegalGEC, claimed to be the first Romanian-language parallel dataset for grammatical error detection and correction in the legal domain. It aggregates 350,000 annotated error examples from legal passages and evaluates baseline neural models including knowledge-distillation Transformers, sequence tagging architectures for detection, and pre-trained text-to-text Transformers for correction, with the goal of enriching resources for Romanian legal NLP.

Significance. If the data-construction pipeline, error taxonomy, and human-validation steps hold as described, this dataset fills a clear gap in low-resource, domain-specific GEC resources for Romanian. The provision of a large-scale parallel corpus with annotations plus reproducible baselines is a concrete strength that can directly support training and benchmarking of legal-domain tools, where textual accuracy carries high stakes.

minor comments (3)

[Abstract] Abstract: the claim of model evaluations is stated without any quantitative metrics, error analysis, or key performance figures; adding a one-sentence summary of results (e.g., F1 or correction accuracy ranges) would make the abstract self-contained.
[Dataset Construction] The manuscript should include an explicit statement of the train/dev/test split sizes and the proportion of synthetic versus human-validated examples to support reproducibility claims.
[Experiments] Figure captions and table headers use inconsistent terminology for error types (e.g., 'syntactic' vs. 'grammatical'); standardizing notation across visuals and text would improve clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review of our work and for recommending minor revision. The referee accurately summarizes the contribution of RoLegalGEC as the first Romanian legal-domain GEC dataset. No major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is the introduction of the RoLegalGEC dataset, formed by aggregating 350,000 parallel examples of grammatical errors in Romanian legal passages together with annotations. This is a data-construction and resource-release claim whose validity rests on the described collection, error taxonomy, and validation pipeline rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. Standard baseline models (Transformers, sequence taggers, text-to-text models) are evaluated on the new resource using conventional training and metrics; no derivation chain reduces a claimed prediction or uniqueness result back to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Since only the abstract is available, the ledger is limited. The claim relies on the dataset being novel and useful, with standard assumptions in NLP about data quality and model applicability. No free parameters or invented entities are identifiable from the abstract.

pith-pipeline@v0.9.0 · 5544 in / 1048 out tokens · 52061 ms · 2026-05-10T03:22:18.620419+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages · 3 internal anchors

[1]

PaLM 2 Technical Report

Anil Rohan, Dai Andrew M, Firat Orhan, Johnson Melvin, Lepikhin Dmitry, Passos Alexandre, Shakeri Siamak, Taropa Emanuel, Bailey Paige, Chen Zhifeng, others. Palm 2 technical report // arXiv preprint arXiv:2305.10403

work page internal anchor Pith review arXiv
[2]

Bryant Christopher, Felice Mariano, Andersen Øistein E., Briscoe Ted

103–113. Bryant Christopher, Felice Mariano, Andersen Øistein E., Briscoe Ted. The BEA-2019 Shared Task on Grammatical Error Correction // Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, VIII

2019
[3]

Cotet Teodor-Mihai, Ruseti Stefan, Dascalu Mihai

8440–8451. Cotet Teodor-Mihai, Ruseti Stefan, Dascalu Mihai. Neural Grammatical Error Correction for Romanian // 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)

2020
[4]

Dahlmeier Daniel, Ng Hwee Tou

625–631. Dahlmeier Daniel, Ng Hwee Tou. Better Evaluation for Grammatical Error Correction // Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics, VI

2012
[5]

22 APREPRINT- APRIL23, 2026 Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina

568–572. 22 APREPRINT- APRIL23, 2026 Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding // Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

2026
[6]

Faruqui Manaal, Pavlick Ellie, Tenney Ian, Das Dipanjan. WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse // Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, X-XI

2018
[7]

Felice Mariano, Briscoe Ted

305–315. Felice Mariano, Briscoe Ted. Towards a standard evaluation method for grammatical error detection and correction // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics, V–VI

2015
[8]

Distilling the Knowledge in a Neural Network

578–587. Hinton Geoffrey, Vinyals Oriol, Dean Jeff. Distilling the knowledge in a neural network // arXiv preprint arXiv:1503.02531

work page internal anchor Pith review Pith/arXiv arXiv
[9]

GECTurk: Grammatical Error Correction and Detection Dataset for Turkish // Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

Kara Atakan, Marouf Sofian Farrin, Bond Andrew, ¸ Sahin Gözde. GECTurk: Grammatical Error Correction and Detection Dataset for Turkish // Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings). Nusa Dua, Bali: Association for Computational Linguistics, XI

2023
[10]

Kiyono Shun, Suzuki Jun, Mita Masato, Mizumoto Tomoya, Inui Kentaro

e0155305. Kiyono Shun, Suzuki Jun, Mita Masato, Mizumoto Tomoya, Inui Kentaro. An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China...

2019
[11]

23 APREPRINT- APRIL23, 2026 Pais Vasile, Mitrofan Maria, Gasan Carol Luca, Coneschi Vlad, Ianov Alexandru

163–170. 23 APREPRINT- APRIL23, 2026 Pais Vasile, Mitrofan Maria, Gasan Carol Luca, Coneschi Vlad, Ianov Alexandru. Named Entity Recognition in the Romanian Legal Domain // Proceedings of the Natural Legal Language Processing Workshop

2026
[12]

Park Jeiyoon, Park Chanjun, Lim Heuiseok

9–18. Park Jeiyoon, Park Chanjun, Lim Heuiseok. Chatlang-8: An llm-based synthetic data generation framework for grammatical error correction // arXiv preprint arXiv:2406.03202

work page arXiv
[13]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

abs/1910.01108. Sharma Ujjwal, Bhattacharyya Pushpak. IndiGEC: Multilingual Grammar Error Correction for Low-Resource Indian Languages // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI

work page internal anchor Pith review arXiv 1910
[14]

Stahlberg Felix, Kumar Shankar

305–321. Stahlberg Felix, Kumar Shankar. Synthetic Data Generation for Low-resource Grammatical Error Correction with Tagged Corruption Models // Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). Mexico City, Mexico: Association for Computational Linguistics, VI

2024
[15]

Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, Raffel Colin

149–158. Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, Raffel Colin. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association f...

2021