Fine-tuned GPT-4o reaches state-of-the-art on grammatical error correction while reference-based metrics underestimate performance by missing 73.76 percent of valid or superior outputs.
arXiv preprint arXiv:1605.02592 (2016), https://arxiv.org/abs/1605.02592
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the modified metric does not require tuning. We recommend that this version be used instead of the original version.
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Edit-level majority voting on multiple LLM-generated candidates reduces over-correction in grammatical error correction and outperforms greedy and MBR decoding on nine multilingual benchmarks while remaining stable to prompt variations.
citing papers explorer
-
Multi-Dimensional Evaluation of LLMs for Grammatical Error Correction
Fine-tuned GPT-4o reaches state-of-the-art on grammatical error correction while reference-based metrics underestimate performance by missing 73.76 percent of valid or superior outputs.
-
Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction
Edit-level majority voting on multiple LLM-generated candidates reduces over-correction in grammatical error correction and outperforms greedy and MBR decoding on nine multilingual benchmarks while remaining stable to prompt variations.