Recognition: unknown
RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
Pith reviewed 2026-05-10 03:22 UTC · model grok-4.3
The pith
Romanian legal writing gets its first parallel dataset of 350,000 grammar errors
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RoLegalGEC aggregates 350,000 examples of errors in legal passages together with error annotations, forming the first Romanian-language parallel dataset for grammatical error detection and correction in the legal domain; several neural models, including knowledge-distillation Transformers, sequence-tagging detectors, and pre-trained text-to-text Transformers, can be trained on the data to perform both tasks.
What carries the argument
The RoLegalGEC dataset of parallel erroneous and corrected legal passages in Romanian with accompanying error annotations.
Load-bearing premise
The 350,000 aggregated examples form a representative and high-quality collection of real grammatical errors that appear in Romanian legal writing.
What would settle it
Train any of the reported models on RoLegalGEC, then test it on a fresh collection of real Romanian legal documents whose errors have been marked by human experts; performance no better than a general-domain Romanian grammar model would undermine the dataset's claimed value.
Figures
read the original abstract
The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RoLegalGEC, claimed to be the first Romanian-language parallel dataset for grammatical error detection and correction in the legal domain. It aggregates 350,000 annotated error examples from legal passages and evaluates baseline neural models including knowledge-distillation Transformers, sequence tagging architectures for detection, and pre-trained text-to-text Transformers for correction, with the goal of enriching resources for Romanian legal NLP.
Significance. If the data-construction pipeline, error taxonomy, and human-validation steps hold as described, this dataset fills a clear gap in low-resource, domain-specific GEC resources for Romanian. The provision of a large-scale parallel corpus with annotations plus reproducible baselines is a concrete strength that can directly support training and benchmarking of legal-domain tools, where textual accuracy carries high stakes.
minor comments (3)
- [Abstract] Abstract: the claim of model evaluations is stated without any quantitative metrics, error analysis, or key performance figures; adding a one-sentence summary of results (e.g., F1 or correction accuracy ranges) would make the abstract self-contained.
- [Dataset Construction] The manuscript should include an explicit statement of the train/dev/test split sizes and the proportion of synthetic versus human-validated examples to support reproducibility claims.
- [Experiments] Figure captions and table headers use inconsistent terminology for error types (e.g., 'syntactic' vs. 'grammatical'); standardizing notation across visuals and text would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their positive review of our work and for recommending minor revision. The referee accurately summarizes the contribution of RoLegalGEC as the first Romanian legal-domain GEC dataset. No major comments were listed in the report.
Circularity Check
No significant circularity identified
full rationale
The paper's central contribution is the introduction of the RoLegalGEC dataset, formed by aggregating 350,000 parallel examples of grammatical errors in Romanian legal passages together with annotations. This is a data-construction and resource-release claim whose validity rests on the described collection, error taxonomy, and validation pipeline rather than any fitted parameters, self-referential definitions, or load-bearing self-citations. Standard baseline models (Transformers, sequence taggers, text-to-text models) are evaluated on the new resource using conventional training and metrics; no derivation chain reduces a claimed prediction or uniqueness result back to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Anil Rohan, Dai Andrew M, Firat Orhan, Johnson Melvin, Lepikhin Dmitry, Passos Alexandre, Shakeri Siamak, Taropa Emanuel, Bailey Paige, Chen Zhifeng, others. Palm 2 technical report // arXiv preprint arXiv:2305.10403
work page internal anchor Pith review arXiv
-
[2]
Bryant Christopher, Felice Mariano, Andersen Øistein E., Briscoe Ted
103–113. Bryant Christopher, Felice Mariano, Andersen Øistein E., Briscoe Ted. The BEA-2019 Shared Task on Grammatical Error Correction // Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy: Association for Computational Linguistics, VIII
2019
-
[3]
Cotet Teodor-Mihai, Ruseti Stefan, Dascalu Mihai
8440–8451. Cotet Teodor-Mihai, Ruseti Stefan, Dascalu Mihai. Neural Grammatical Error Correction for Romanian // 2020 IEEE 32nd International Conference on Tools with Artificial Intelligence (ICTAI)
2020
-
[4]
Dahlmeier Daniel, Ng Hwee Tou
625–631. Dahlmeier Daniel, Ng Hwee Tou. Better Evaluation for Grammatical Error Correction // Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Montréal, Canada: Association for Computational Linguistics, VI
2012
-
[5]
22 APREPRINT- APRIL23, 2026 Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina
568–572. 22 APREPRINT- APRIL23, 2026 Devlin Jacob, Chang Ming-Wei, Lee Kenton, Toutanova Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding // Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)
2026
-
[6]
Faruqui Manaal, Pavlick Ellie, Tenney Ian, Das Dipanjan. WikiAtomicEdits: A Multilingual Corpus of Wikipedia Edits for Modeling Language and Discourse // Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, X-XI
2018
-
[7]
Felice Mariano, Briscoe Ted
305–315. Felice Mariano, Briscoe Ted. Towards a standard evaluation method for grammatical error detection and correction // Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Denver, Colorado: Association for Computational Linguistics, V–VI
2015
-
[8]
Distilling the Knowledge in a Neural Network
578–587. Hinton Geoffrey, Vinyals Oriol, Dean Jeff. Distilling the knowledge in a neural network // arXiv preprint arXiv:1503.02531
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
GECTurk: Grammatical Error Correction and Detection Dataset for Turkish // Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)
Kara Atakan, Marouf Sofian Farrin, Bond Andrew, ¸ Sahin Gözde. GECTurk: Grammatical Error Correction and Detection Dataset for Turkish // Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings). Nusa Dua, Bali: Association for Computational Linguistics, XI
2023
-
[10]
Kiyono Shun, Suzuki Jun, Mita Masato, Mizumoto Tomoya, Inui Kentaro
e0155305. Kiyono Shun, Suzuki Jun, Mita Masato, Mizumoto Tomoya, Inui Kentaro. An Empirical Study of Incorporating Pseudo Data into Grammatical Error Correction // Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China...
2019
-
[11]
23 APREPRINT- APRIL23, 2026 Pais Vasile, Mitrofan Maria, Gasan Carol Luca, Coneschi Vlad, Ianov Alexandru
163–170. 23 APREPRINT- APRIL23, 2026 Pais Vasile, Mitrofan Maria, Gasan Carol Luca, Coneschi Vlad, Ianov Alexandru. Named Entity Recognition in the Romanian Legal Domain // Proceedings of the Natural Legal Language Processing Workshop
2026
-
[12]
Park Jeiyoon, Park Chanjun, Lim Heuiseok
9–18. Park Jeiyoon, Park Chanjun, Lim Heuiseok. Chatlang-8: An llm-based synthetic data generation framework for grammatical error correction // arXiv preprint arXiv:2406.03202
-
[13]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
abs/1910.01108. Sharma Ujjwal, Bhattacharyya Pushpak. IndiGEC: Multilingual Grammar Error Correction for Low-Resource Indian Languages // Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, XI
work page internal anchor Pith review arXiv 1910
-
[14]
Stahlberg Felix, Kumar Shankar
305–321. Stahlberg Felix, Kumar Shankar. Synthetic Data Generation for Low-resource Grammatical Error Correction with Tagged Corruption Models // Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2024). Mexico City, Mexico: Association for Computational Linguistics, VI
2024
-
[15]
Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, Raffel Colin
149–158. Xue Linting, Constant Noah, Roberts Adam, Kale Mihir, Al-Rfou Rami, Siddhant Aditya, Barua Aditya, Raffel Colin. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer // Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association f...
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.