How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

Artem Chernodub; Kateryna Karpo

arxiv: 2606.09334 · v1 · pith:CPX25CAYnew · submitted 2026-06-08 · 💻 cs.CL

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

Kateryna Karpo , Artem Chernodub This is my paper

Pith reviewed 2026-06-27 16:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords grammatical error correctionUkrainian languageprompt engineeringlarge language modelsminimal editsfew-shot learningzero-shot learningUNLP benchmark

0 comments

The pith

Ukrainian minimal-edit prompting with commercial LLMs closes over 90 percent of the gap to fine-tuned grammatical error correction systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether prompting alone can handle minimal-edit Ukrainian grammatical error correction on commercial LLMs. Using the UNLP 2023 benchmark, it finds that Gemini 3.1-Pro with Ukrainian minimal-edits prompts and optimization reaches an F0.5 of 69.22, compared to 73.14 for the fine-tuned best system. Ukrainian instructions are necessary to express precise rules, and they work best when combined with few-shot examples and LLM help for prompt tuning. This shows prompting can substitute for fine-tuning in many cases but still misses some error categories and introduces specific overcorrections.

Core claim

Our best configuration using Gemini 3.1-Pro with LLM-assisted prompt optimization on minimal-edits and few-shot prompts achieves F0.5=69.22 on the UNLP 2023 GEC-only benchmark. This closes over 90% of the gap to the fine-tuned SOTA of F0.5=73.14. Zero-shot Ukrainian instructions help only Claude models, while all models perform best with Ukrainian minimal-edits prompts. Detailed instructions improve results on punctuation and case errors but lead models to ignore several low-frequency categories. Five recurring overcorrection patterns related to Ukrainian linguistic phenomena are identified in the error analysis.

What carries the argument

Ukrainian minimal-edits prompts that specify language-specific correction rules, combined with few-shot examples and LLM-assisted optimization.

If this is right

Zero-shot prompts in Ukrainian improve performance only for Claude models among the tested LLMs.
Minimal-edits prompts in Ukrainian outperform those in English for every model tested.
LLM-assisted prompt optimization yields the single highest score when added to minimal-edits plus few-shot.
Detailed minimal-edits instructions produce the largest gains on punctuation and case errors.
Five recurring overcorrection patterns appear that are linked to Ukrainian-specific features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar prompting strategies might allow rapid deployment of GEC for other low-resource languages without large training sets.
Hybrid systems could combine these prompts with targeted rules to address the observed overcorrection patterns.
The abandonment of low-frequency error categories suggests that prompting may require supplementary mechanisms for complete coverage.
Evaluating the same prompts on out-of-domain Ukrainian text would test whether the benchmark scores generalize.

Load-bearing premise

The UNLP 2023 GEC-only benchmark and its minimal-edit evaluation protocol match the distribution and desired behavior of real-world Ukrainian grammatical errors.

What would settle it

Running the same prompted models on a freshly collected set of Ukrainian errors from authentic sources like forums or documents and measuring the F0.5 score under minimal-edit rules would confirm or refute the performance claims.

Figures

Figures reproduced from arXiv: 2606.09334 by Artem Chernodub, Kateryna Karpo.

**Figure 2.** Figure 2: RQ4: Per-error-type 𝐹0.5 on the UNLP 2023 test set. Baseline: GPT-4.1-mini (zero-shot (A.1.1), EN); Optimized: Gemini 3.1-Pro (minimal-edits + fewshot + optimized-v2 (A.5.2), UA). Error types sorted as in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Per-sentence inference flow. Document mark [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

read the original abstract

Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, comparing zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. Our best configuration (Gemini 3.1-Pro) reaches F0.5=69.22, closing over 90% of the gap to fine-tuned SOTA (F0.5=73.14). For zero-shot prompts, only Claude models benefit from Ukrainian instructions. However, the best overall results for all models use Ukrainian minimal-edits prompts, whose language-specific rules require Ukrainian to express precisely. LLM-assisted prompt optimization on top of minimal-edits + few-shot achieves the highest score. Detailed minimal-edits instructions yield the largest gains for punctuation and case errors but cause the model to abandon several low-frequency categories. Delving into error analysis, we identify five recurring overcorrection patterns tied to Ukrainian-specific linguistic phenomena. Code, prompts, and outputs are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prompting gets close to fine-tuned results on Ukrainian GEC but the minimal-edit benchmark and protocol limit how far the numbers generalize.

read the letter

The paper shows that commercial LLMs prompted with Ukrainian minimal-edit instructions can reach F0.5 of 69.22 on the UNLP 2023 benchmark, closing most of the gap to the fine-tuned SOTA at 73.14. Gemini 3.1-Pro leads, and Ukrainian-specific prompts outperform English ones for most models.

What stands out is the scale: 11 commercial models plus one open-source Ukrainian model, tested across zero-shot, few-shot, minimal-edit, and optimization setups. They document five recurring overcorrection patterns tied to case and punctuation, and they release the prompts, code, and outputs. That release is the most immediately useful part.

The soft spot is the evaluation setup itself. The abstract notes that detailed minimal-edit rules make models drop low-frequency error categories, and the stress-test concern holds: if real users prefer fluent rewrites over strict minimal changes, these scores do not directly show practical value. No statistical significance or variance numbers are mentioned in the abstract, and commercial model versions change, so the exact ranking may not stick. The work stays empirical and avoids circular claims.

This is for researchers doing GEC on low-resource languages or testing prompting for structured output tasks. It adds concrete numbers where few existed and deserves peer review to check the methods and discuss how well the benchmark matches user needs.

Referee Report

2 major / 3 minor

Summary. The paper evaluates 11 commercial LLMs and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark using zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. It reports that Gemini 3.1-Pro with Ukrainian minimal-edits + few-shot prompting reaches F0.5=69.22 (closing >90% of the gap to fine-tuned SOTA at 73.14), notes that Ukrainian instructions help only for Claude in zero-shot, identifies five recurring over-correction patterns, and observes that detailed minimal-edit rules improve punctuation/case but cause abandonment of low-frequency error categories. Code, prompts, and outputs are released.

Significance. If the UNLP 2023 benchmark and minimal-edit protocol are accepted as representative, the result shows that carefully engineered prompting can nearly match fine-tuned performance for Ukrainian GEC without task-specific training, which is relevant for other low-resource languages. The public release of prompts and outputs supports reproducibility. The work also surfaces Ukrainian-specific linguistic phenomena in over-corrections.

major comments (2)

[Abstract] Abstract: the headline claim that Gemini 3.1-Pro 'closes over 90% of the gap' to fine-tuned SOTA is presented as a primary result, yet the error analysis shows that the minimal-edit protocol causes models to abandon several low-frequency error categories. This makes the numerical gap-closure claim load-bearing only under the specific benchmark protocol and weakens the implied practical conclusion unless qualified.
[Results] Results section (performance table): single-point F0.5 scores are reported without standard deviations, multiple runs, or statistical significance tests against the SOTA baseline. Given that the central claim rests on the 69.22 vs. 73.14 comparison, the absence of uncertainty estimates makes it impossible to judge whether the gap closure is reliable.

minor comments (3)

[Abstract] The abstract states 'Gemini 3.1-Pro'; confirm the exact model name and version against the experimental setup section for consistency.
[Error Analysis] Error analysis identifies five over-correction patterns but does not quantify their frequency or contribution to the overall F0.5 drop; adding counts or a breakdown table would strengthen the section.
The paper would benefit from an explicit limitations paragraph discussing how the minimal-edit preference may diverge from user expectations for fluency in real-world Ukrainian GEC.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation of minor revision. We address each major comment below, agreeing that the abstract claim benefits from additional qualification and that the results reporting can be clarified.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that Gemini 3.1-Pro 'closes over 90% of the gap' to fine-tuned SOTA is presented as a primary result, yet the error analysis shows that the minimal-edit protocol causes models to abandon several low-frequency error categories. This makes the numerical gap-closure claim load-bearing only under the specific benchmark protocol and weakens the implied practical conclusion unless qualified.

Authors: We agree that the gap-closure figure is protocol-specific and that the observed abandonment of low-frequency error categories (already detailed in the error analysis) represents an important trade-off. We will revise the abstract to qualify the primary claim by explicitly noting that the reported performance is achieved under the minimal-edit prompting protocol, which involves such category-specific trade-offs. This will better contextualize the headline result without altering the numerical findings. revision: yes
Referee: [Results] Results section (performance table): single-point F0.5 scores are reported without standard deviations, multiple runs, or statistical significance tests against the SOTA baseline. Given that the central claim rests on the 69.22 vs. 73.14 comparison, the absence of uncertainty estimates makes it impossible to judge whether the gap closure is reliable.

Authors: We acknowledge that single-point estimates without uncertainty quantification limit assessment of the comparison's reliability. Our experiments used single runs primarily due to the prohibitive cost of repeated API calls across 12 models and multiple prompting configurations. We will add a clarifying sentence in the results section noting this practical constraint and the largely deterministic nature of the evaluated prompts. We also observe that the fine-tuned SOTA baseline is itself reported as a single point in the UNLP 2023 literature. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical measurements on external benchmark

full rationale

The paper performs an empirical evaluation of commercial and open-source LLMs on the fixed external UNLP 2023 GEC-only benchmark using various prompting strategies. Reported metrics (F0.5 scores) are direct measurements of model outputs against the benchmark's gold corrections under the authors' minimal-edit protocol. No equations, parameter fitting, derivations, or self-citation chains are used to support any claimed result; the central numbers (e.g., Gemini 3.1-Pro F0.5=69.22 vs. fine-tuned SOTA 73.14) are obtained by running the models and scoring them. The work is therefore self-contained against the external benchmark with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study. No mathematical derivations, fitted constants, or postulated entities are introduced.

pith-pipeline@v0.9.1-grok · 5741 in / 1083 out tokens · 21992 ms · 2026-06-27T16:32:26.462057+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Frank Palma Gomez, Alla Rozovskaya, and Dan Roth

Grammatical error correction: A survey of the stateoftheart.ComputationalLinguistics,49(3):643– 701. Frank Palma Gomez, Alla Rozovskaya, and Dan Roth
[2]

InProceedings of the Second Ukrainian Natural Language Process- ing Workshop (UNLP), pages 114–119, Dubrovnik, Croatia

A low-resource approach to the grammatical error correction of Ukrainian. InProceedings of the Second Ukrainian Natural Language Process- ing Workshop (UNLP), pages 114–119, Dubrovnik, Croatia. Association for Computational Linguistics. Anisia Katinskaia and Roman Yangarber. 2024. GPT- 3.5 for grammatical error correction. InProceed- ings of the 2024 Join...

2024
[3]

Large Language Models as Optimizers

The MultiGEC-2025 shared task on multilin- gual grammatical error correction at NLP4CALL. In Proceedings of the 14th Workshop on Natural Lan- guage Processing for Computer Assisted Language Learning (NLP4CALL 2025), pages 1–33, Tallinn, Estonia. University of Tartu Library. Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault.2017. JFLEG:Afluencycorpus...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

ВАЖЛИВI ПРАВИЛА УКРАЇНСЬКОЇ МОВИ: - Прийменник «у» вживається перед приголосними (у школi, у мiстi, у готелi), «в» — перед голосними та на початку речення

G/Other: iншi граматичнi помилки. ВАЖЛИВI ПРАВИЛА УКРАЇНСЬКОЇ МОВИ: - Прийменник «у» вживається перед приголосними (у школi, у мiстi, у готелi), «в» — перед голосними та на початку речення. - Прийменник «об» вживається перед голосними (об одинадцятiй), «о» — перед приголосними. - У дiалогах вживається тире (—), а не дефiс (-): «Текст», — сказав вiн. — Тек...
[20]

Spelling: obviousspellingerrors(typos,wrong letters)
[21]

Punctuation: missing or extra commas, peri- ods, question marks; use of em-dash (—) instead of hyphen (-) in dialogues and parenthetical con- structions
[22]

G/Case: incorrect case form (especially voca- tive case in forms of address)
[23]

G/Gender: incorrect gender form
[24]

G/Number: incorrect number form
[25]

G/Aspect: incorrect verb aspect form
[26]

G/Tense: incorrect verb tense form
[27]

G/VerbVoice: incorrect verb voice form
[28]

G/PartVoice: incorrect participle voice form
[29]

G/VerbAForm: incorrectanalyticalverbform
[30]

G/Prep: incorrect preposition usage
[31]

G/Participle: incorrect adverbial participle usage
[32]

G/UngrammaticalStructure: grammatical norm violations in syntactic constructions
[33]

G/Comparison: incorrect comparative/su- perlative form
[34]

G/Conjunction: incorrect conjunction usage
[35]

u” is used before consonants (u shkoli,umisti),“v

G/Other: other grammatical errors. IMPORTANT RULES OF UKRAINIAN: - Preposition “u” is used before consonants (u shkoli,umisti),“v”beforevowelsandatsentence start. -Preposition“ob”isusedbeforevowels(obodyn- nadtsiatii), “o” before consonants. - Em-dash (—) is used in dialogues, not hyphen (-). - Parenthetical words (maybe, probably, of course, it seems) ar...
[36]

Орфографiя: явнi орфографiчнi по- милки (друкарськi помилки, неправильнi лiтери)
[37]

Пунктуацiя: пропущенi або зайвi коми, крапки, знаки питання; використання тире (—) замiсть дефiса (-) у дiалогах та вставних конструкцiях
[38]

G/Case: некоректне вживання вiдмiн- кової форми (зокрема кличний вiдмiнок при звертаннях)
[39]

G/Gender: некоректне вживання форми роду
[40]

G/Number: некоректне вживання форми числа
[41]

G/Aspect: некоректне вживання форми виду дiєслова
[42]

G/Tense: некоректне вживання часової форми дiєслова
[43]

G/VerbVoice: некоректне вживання форми стану дiєслова
[44]

G/PartVoice: некоректне вживання форми стану дiєприкметника
[45]

G/VerbAForm: некоректне вживання аналiтичної форми дiєслова
[46]

G/Prep: некоректне вживання при- йменника
[47]

G/Participle: некоректне вживання дiєприслiвника
[48]

G/UngrammaticalStructure: порушення граматичних норм у синтаксичних конструкцiях
[49]

G/Comparison: некоректна форма ступенiв порiвняння
[50]

G/Conjunction: некоректне вживання сполучникiв
[51]

у руля". Вихiд: Так само потерпає Україна i сього- днi вiд того, що насправдi талановитим людям заважають працювати усiлякi посередностi

G/Other: iншi граматичнi помилки. ВАЖЛИВI ПРАВИЛА УКРАЇНСЬКОЇ МОВИ: - Прийменник «у» вживається перед приголосними (у школi, у мiстi, у готелi), «в» — перед голосними та на початку речення. - Прийменник «об» вживається перед голосними (об одинадцятiй), «о» — перед приголосними. - У дiалогах вживається тире (—), а не дефiс (-): «Текст», — сказав вiн. — Тек...

[1] [1]

Frank Palma Gomez, Alla Rozovskaya, and Dan Roth

Grammatical error correction: A survey of the stateoftheart.ComputationalLinguistics,49(3):643– 701. Frank Palma Gomez, Alla Rozovskaya, and Dan Roth

[2] [2]

InProceedings of the Second Ukrainian Natural Language Process- ing Workshop (UNLP), pages 114–119, Dubrovnik, Croatia

A low-resource approach to the grammatical error correction of Ukrainian. InProceedings of the Second Ukrainian Natural Language Process- ing Workshop (UNLP), pages 114–119, Dubrovnik, Croatia. Association for Computational Linguistics. Anisia Katinskaia and Roman Yangarber. 2024. GPT- 3.5 for grammatical error correction. InProceed- ings of the 2024 Join...

2024

[3] [3]

Large Language Models as Optimizers

The MultiGEC-2025 shared task on multilin- gual grammatical error correction at NLP4CALL. In Proceedings of the 14th Workshop on Natural Lan- guage Processing for Computer Assisted Language Learning (NLP4CALL 2025), pages 1–33, Tallinn, Estonia. University of Tartu Library. Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault.2017. JFLEG:Afluencycorpus...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [19]

ВАЖЛИВI ПРАВИЛА УКРАЇНСЬКОЇ МОВИ: - Прийменник «у» вживається перед приголосними (у школi, у мiстi, у готелi), «в» — перед голосними та на початку речення

G/Other: iншi граматичнi помилки. ВАЖЛИВI ПРАВИЛА УКРАЇНСЬКОЇ МОВИ: - Прийменник «у» вживається перед приголосними (у школi, у мiстi, у готелi), «в» — перед голосними та на початку речення. - Прийменник «об» вживається перед голосними (об одинадцятiй), «о» — перед приголосними. - У дiалогах вживається тире (—), а не дефiс (-): «Текст», — сказав вiн. — Тек...

[5] [20]

Spelling: obviousspellingerrors(typos,wrong letters)

[6] [21]

Punctuation: missing or extra commas, peri- ods, question marks; use of em-dash (—) instead of hyphen (-) in dialogues and parenthetical con- structions

[7] [22]

G/Case: incorrect case form (especially voca- tive case in forms of address)

[8] [23]

G/Gender: incorrect gender form

[9] [24]

G/Number: incorrect number form

[10] [25]

G/Aspect: incorrect verb aspect form

[11] [26]

G/Tense: incorrect verb tense form

[12] [27]

G/VerbVoice: incorrect verb voice form

[13] [28]

G/PartVoice: incorrect participle voice form

[14] [29]

G/VerbAForm: incorrectanalyticalverbform

[15] [30]

G/Prep: incorrect preposition usage

[16] [31]

G/Participle: incorrect adverbial participle usage

[17] [32]

G/UngrammaticalStructure: grammatical norm violations in syntactic constructions

[18] [33]

G/Comparison: incorrect comparative/su- perlative form

[19] [34]

G/Conjunction: incorrect conjunction usage

[20] [35]

u” is used before consonants (u shkoli,umisti),“v

G/Other: other grammatical errors. IMPORTANT RULES OF UKRAINIAN: - Preposition “u” is used before consonants (u shkoli,umisti),“v”beforevowelsandatsentence start. -Preposition“ob”isusedbeforevowels(obodyn- nadtsiatii), “o” before consonants. - Em-dash (—) is used in dialogues, not hyphen (-). - Parenthetical words (maybe, probably, of course, it seems) ar...

[21] [36]

Орфографiя: явнi орфографiчнi по- милки (друкарськi помилки, неправильнi лiтери)

[22] [37]

Пунктуацiя: пропущенi або зайвi коми, крапки, знаки питання; використання тире (—) замiсть дефiса (-) у дiалогах та вставних конструкцiях

[23] [38]

G/Case: некоректне вживання вiдмiн- кової форми (зокрема кличний вiдмiнок при звертаннях)

[24] [39]

G/Gender: некоректне вживання форми роду

[25] [40]

G/Number: некоректне вживання форми числа

[26] [41]

G/Aspect: некоректне вживання форми виду дiєслова

[27] [42]

G/Tense: некоректне вживання часової форми дiєслова

[28] [43]

G/VerbVoice: некоректне вживання форми стану дiєслова

[29] [44]

G/PartVoice: некоректне вживання форми стану дiєприкметника

[30] [45]

G/VerbAForm: некоректне вживання аналiтичної форми дiєслова

[31] [46]

G/Prep: некоректне вживання при- йменника

[32] [47]

G/Participle: некоректне вживання дiєприслiвника

[33] [48]

G/UngrammaticalStructure: порушення граматичних норм у синтаксичних конструкцiях

[34] [49]

G/Comparison: некоректна форма ступенiв порiвняння

[35] [50]

G/Conjunction: некоректне вживання сполучникiв

[36] [51]

у руля". Вихiд: Так само потерпає Україна i сього- днi вiд того, що насправдi талановитим людям заважають працювати усiлякi посередностi

G/Other: iншi граматичнi помилки. ВАЖЛИВI ПРАВИЛА УКРАЇНСЬКОЇ МОВИ: - Прийменник «у» вживається перед приголосними (у школi, у мiстi, у готелi), «в» — перед голосними та на початку речення. - Прийменник «об» вживається перед голосними (об одинадцятiй), «о» — перед приголосними. - У дiалогах вживається тире (—), а не дефiс (-): «Текст», — сказав вiн. — Тек...