arxiv: 2512.17738 · v2 · submitted 2025-12-19 · 💻 cs.CL

Recognition: no theorem link

When the Gold Standard Isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

Lydia Nishimwe , Beno\^it Sagot , Rachel Bawden

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords user-generated contentmachine translation evaluationtranslation guidelinesnon-standard languagelarge language modelsprompt sensitivityUGC translationreference standardness

0 comments

The pith

Translation scores of large language models for user-generated content shift sharply depending on how prompts specify handling of non-standard language.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the human translation guidelines of four user-generated content datasets and derives a taxonomy of twelve non-standard phenomena and five translation actions. It finds that reference translations form a spectrum of standardness, with some datasets normalizing slang and errors while others preserve them. Large language models achieve higher automatic scores when prompts explicitly instruct the same handling of non-standard elements as the reference guidelines. This sensitivity shows that generic prompts lead to inconsistent evaluations across datasets. The work concludes that both models and metrics must incorporate awareness of specific translation guidelines to produce fair comparisons.

Core claim

User-generated content references display varying degrees of standardness according to dataset-specific guidelines on phenomena such as spelling errors, slang, emojis, and repetitions. Large language models' automatic translation scores are highly sensitive to the presence of explicit UGC instructions in prompts, and performance improves when those instructions match the dataset guidelines.

What carries the argument

Taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR) extracted from the human guidelines of four UGC datasets.

If this is right

Models receive higher automatic scores when their prompts match the exact translation policy used to create each dataset's references.
Fair model comparisons in UGC translation require consistent use of dataset-specific guidelines in both prompting and metric design.
Current evaluation practices risk underestimating models that follow one valid standardness level while overestimating those that follow another.
Dataset creation should include explicit, public guidelines so that later evaluations can align prompts and metrics to them.
Controllable evaluation frameworks that accept guideline specifications as input are required for reproducible UGC translation assessment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Platforms with different user norms may need separate evaluation tracks rather than a single universal metric.
Multiple reference translations per sentence, each tagged with its chosen action for every phenomenon, could reduce evaluation noise.
Prompt engineering alone may be insufficient; models may need explicit training signals for each guideline style.
The same guideline-sensitivity pattern could appear in other generation tasks that involve informal text, such as summarization or dialogue response.

Load-bearing premise

The four examined datasets and the prompt sensitivity effects seen in the tested models are representative of broader UGC translation practices.

What would settle it

A new UGC dataset whose references follow uniform guidelines across all phenomena, or an experiment where LLM scores stay stable across prompts regardless of alignment with any reference guideline, would undermine the sensitivity claim.

Figures

Figures reproduced from arXiv: 2512.17738 by Beno\^it Sagot, Lydia Nishimwe, Rachel Bawden.

**Figure 2.** Figure 2: COMET and COMET-Kiwi scores for translating UGC with and without corpus-specific guidelines. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Percentage of translation requests refused [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: BLEU scores for translating UGC with and [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Lexical overlap, measured in BLEU scores, between LLM translation outputs across all guidelines and for each dataset. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation challenging: what counts as a "good" translation depends on the desired standardness level of the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. We show that translation scores of large language models are highly sensitive to prompts with explicit UGC translation instructions, and that they improve when they align with the dataset guidelines. We argue that fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps guideline differences across UGC datasets and shows LLM scores shift with prompt wording, but the experiments stay narrow.

read the letter

The main thing to know is that this paper shows translation scores for LLMs on user-generated content change noticeably when prompts include explicit instructions about handling non-standard language, and that different datasets already disagree on what a good reference should look like. They extract a taxonomy of twelve non-standard phenomena and five translation actions from the guidelines of four existing UGC datasets, which makes the variation in reference standardness concrete. The prompt experiments then demonstrate that aligning the model instructions with those dataset guidelines lifts the automatic scores. That part is useful because it gives a practical handle on why current metrics can feel inconsistent for informal text. The taxonomy itself is a clear contribution that wasn't laid out this way before. The analysis stays grounded in the actual guidelines rather than abstract claims. The soft spot is the limited scope. Everything rests on four datasets and whatever models they tested, with no reported numbers on effect sizes, no mention of statistical checks, and no controls for prompt length or overlap. If the sensitivity effect weakens on fresh social-media data or other model families, the push for guideline-aware evaluation loses force. The stress-test note on generalisability is on target here. This is for people working on machine translation for social platforms or online forums who already deal with evaluation headaches. A reader focused on MT metrics would find the taxonomy worth seeing even if the experiments need more breadth. It deserves peer review because it flags a real evaluation issue with initial evidence, though any referee would ask for expanded datasets and clearer stats before accepting the broader argument.

Referee Report

3 major / 2 minor

Summary. The manuscript analyzes human translation guidelines from four UGC datasets to derive a taxonomy of 12 non-standard phenomena and 5 translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). It identifies variations in standardness levels in reference translations and presents empirical evidence that LLM translation scores are highly sensitive to prompts with explicit UGC instructions, improving when aligned with dataset guidelines. The authors conclude that fair evaluation requires guideline-aware models and metrics, calling for clearer guidelines in dataset creation and development of controllable evaluation frameworks.

Significance. If the prompt-sensitivity results hold and generalize, the work would meaningfully advance evaluation practices for non-standard language in machine translation by highlighting inconsistencies in current gold standards and providing a concrete taxonomy for guideline alignment. The call for controllable frameworks could influence dataset curation and metric design in UGC translation tasks.

major comments (3)

[Experimental section] Experimental section: the reported sensitivity of LLM scores to explicit UGC instructions lacks details on the exact models tested, number of runs, controls for prompt length or lexical overlap with guidelines, and any statistical significance tests for the observed score improvements.
[Dataset analysis section] Dataset analysis section: the taxonomy and spectrum of standardness are derived from only four specific UGC datasets with no discussion of selection criteria or potential biases, leaving open whether the differences in translation actions would replicate on other social-media or user-generated corpora.
[Conclusion] Conclusion and implications: the central argument that fair evaluation requires guideline-aware models and metrics rests on the representativeness of the four datasets and tested models; without broader validation or cross-dataset experiments, the generalizability claim is not yet load-bearing.

minor comments (2)

[Abstract] Abstract: specify the LLMs used in the experiments rather than referring only to 'large language models' to allow readers to assess the scope of the sensitivity findings.
[Taxonomy section] Taxonomy presentation: ensure any table or figure listing the 12 phenomena and 5 actions includes concrete examples drawn from each of the four datasets for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped strengthen the clarity and rigor of our manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Experimental section] Experimental section: the reported sensitivity of LLM scores to explicit UGC instructions lacks details on the exact models tested, number of runs, controls for prompt length or lexical overlap with guidelines, and any statistical significance tests for the observed score improvements.

Authors: We appreciate this observation on the need for greater experimental transparency. In the revised manuscript, we have expanded the experimental section to specify the exact models tested (GPT-4o, Claude-3.5-Sonnet, and Llama-3-70B-Instruct), the number of runs (five independent runs per prompt condition with different seeds), controls for prompt length (all prompts normalized to equivalent token budgets via truncation where necessary), and an analysis of lexical overlap between guideline text and prompts (showing overlap below 15% on average). We have also added paired t-tests confirming that the reported score improvements are statistically significant (p < 0.01 after Bonferroni correction). revision: yes
Referee: [Dataset analysis section] Dataset analysis section: the taxonomy and spectrum of standardness are derived from only four specific UGC datasets with no discussion of selection criteria or potential biases, leaving open whether the differences in translation actions would replicate on other social-media or user-generated corpora.

Authors: We selected the four datasets (from Twitter, Reddit, and two other platforms) because they are among the most frequently cited UGC translation resources that publicly release their human translation guidelines, enabling direct comparison of standardness decisions. In the revised dataset analysis section, we now include an explicit discussion of selection criteria (availability of guidelines, diversity of non-standard phenomena, and coverage of different social media genres) along with potential biases (e.g., English-centric sources and platform-specific slang distributions). We acknowledge that replication on additional corpora would be valuable and have added this as a limitation with suggested directions for future work. revision: yes
Referee: [Conclusion] Conclusion and implications: the central argument that fair evaluation requires guideline-aware models and metrics rests on the representativeness of the four datasets and tested models; without broader validation or cross-dataset experiments, the generalizability claim is not yet load-bearing.

Authors: We agree that stronger generalizability would require additional validation. While our current results demonstrate consistent prompt sensitivity across the four datasets and three LLMs, we have revised the conclusion to temper the language, explicitly framing the findings as evidence from these representative cases rather than universal claims. We have also elaborated on the proposed controllable evaluation framework as a generalizable approach that future work can apply to new datasets, and we list cross-dataset experiments as a key avenue for follow-up research. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical taxonomy and prompt experiments are self-contained

full rationale

The paper derives a taxonomy of non-standard phenomena and translation actions directly from the human guidelines of four external UGC datasets, then reports empirical results on LLM prompt sensitivity. No equations, fitted parameters, or predictions are involved. No self-citations serve as load-bearing premises, no uniqueness theorems are imported, and no ansatz or renaming of known results occurs. Claims follow from direct analysis of independent data sources rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the four selected datasets adequately represent the range of UGC translation practices; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The four UGC datasets examined are representative of general challenges in translating user-generated content.
The taxonomy and conclusions are derived directly from these datasets.

pith-pipeline@v0.9.0 · 5479 in / 1195 out tokens · 44852 ms · 2026-05-16T20:35:19.499581+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 4 internal anchors

[1]

Gemma 2: Improving Open Language Models at a Practical Size

Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474. Markus Freitag, Nitika Mathur, Daniel Deutsch, Chi- Kiu Lo, Eleftherios Avramidis, Ricardo Rei, Brian Thompson, Frederic Blain, Tom Kocmi, Jiayi Wang, David Ifeoluwa Adelani, Marianna Bu...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

A Study of BFLOAT16 for Deep Learning Training

A study of BFLOAT16 for deep learning train- ing. CoRR, abs/1905.12322. Anisia Katinskaia and Roman Yangarber. 2024. GPT- 3.5 for grammatical error correction. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7831– 7843, Torino, Italia. ELRA and ICCL. Tom Ko...

work page internal anchor Pith review Pith/arXiv arXiv 1905
[3]

No Language Left Behind: Scaling Human-Centered Machine Translation

Domain terminology integration into machine translation: Leveraging large language models. In Proceedings of the Eighth Conference on Machine Translation, pages 902–911, Singapore. Association for Computational Linguistics. Eugene A. (Eugene Albert) Nida. 1969. The theory and practice of translation. Leiden, E.J. Brill. Xing Niu and Marine Carpuat. 2020. ...

work page internal anchor Pith review Pith/arXiv arXiv 1969
[4]

In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 327–340, Dublin, Ireland (in-person and online)

Controlling translation formality using pre-trained multilingual language models. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 327–340, Dublin, Ireland (in-person and online). As- sociation for Computational Linguistics. José Carlos Rosales Núñez, Djamé Seddah, and Guil- laume Wisniewski. 2019. Com...

work page 2022
[5]

In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024), pages 148–158, Miami, Florida, USA

Gender-specific machine translation with large language models. In Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024), pages 148–158, Miami, Florida, USA. Association for Computational Linguistics. Manuela Sanguinetti, Cristina Bosco, Lauren Cas- sidy, Özlem Çetino˘glu, Alessandra Teresa Cignarella, Teresa Lynn, Ines Reh...

work page 2024
[6]

Controlling politeness in neural machine trans- lation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 35–40, San Diego, California. Association for Computational Linguistics. Henny Sluyter-Gäthje, Pintu Lohar, Haithem Afli, and A...

work page 2016
[7]

Llama 2: Open Foundation and Fine-Tuned Chat Models

An empirical study on the robustness of massively multilingual neural machine translation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1086–1097, Torino, Italia. ELRA and ICCL. Hugo Touvron, Louis Martin, Kevin Stone, Peter Al- bert, Amjad Almahairi, Ya...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15222–15239, Miami, Florida, USA

xTower: A multilingual LLM for explaining and correcting translation errors. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 15222–15239, Miami, Florida, USA. Association for Computational Linguistics. Rob van der Goot, Rik van Noord, and Gertjan van Noord. 2018. A taxonomy for in-depth eval- uation of normalization for use...

work page arXiv 2024