arxiv: 2601.02933 · v3 · submitted 2026-01-06 · 💻 cs.CL · cs.HC

Recognition: 2 theorem links

· Lean Theorem

Pearmut: Human Evaluation of Translation Made Trivial

Vil\'em Zouhar , Tom Kocmi

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:51 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords human evaluationmachine translationevaluation platformmultilingual NLPdirect assessmenterror span annotationtranslation quality

0 comments

The pith

Pearmut makes end-to-end human evaluation of machine translation as easy to run as automatic metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Human evaluation is the most trustworthy way to judge translation quality and other multilingual tasks, yet it is routinely replaced by automatic metrics because existing tools impose heavy setup and operational costs. The paper introduces Pearmut as a lightweight platform that supplies the full range of needed features in one package, including standard protocols and built-in quality controls. If the platform delivers on its promise, researchers could treat careful human judgments as a normal, repeatable step in building and checking models instead of an occasional, high-effort exercise. The work therefore targets the practical gap between what is known to be best and what is actually done in daily practice.

Core claim

Pearmut is a lightweight platform that implements standard human evaluation protocols including DA, ESA, and MQM for machine translation, while supplying document-level context, absolute and contrastive scoring, attention checks, ESAAI pre-annotations, and both static and dynamic assignment strategies so that reliable human evaluation becomes a routine component of model development rather than an occasional effort.

What carries the argument

The Pearmut platform, which integrates standard evaluation protocols with features that remove common entry barriers for multilingual and translation tasks.

Load-bearing premise

The described features can be delivered in a truly lightweight package that removes entry barriers without creating new operational overhead for users.

What would settle it

A timed user study in which setting up and completing a full human evaluation with Pearmut takes substantially more effort or expertise than running a standard automatic metric would falsify the claim of equivalent ease.

Figures

Figures reproduced from arXiv: 2601.02933 by Tom Kocmi, Vil\'em Zouhar.

**Figure 1.** Figure 1: Screenshot of the Pearmut annotation interface with contrastive ESA protocol together with guidelines. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Screenshot of the Pearmut dashboard interface with mock model results. Horizontal lines show statistically [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Screenshot of the Pearmut annotation interface with multimodal inputs or outputs. From top: speech [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Deuteranopia-colorblind simulated screenshot ( [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of various methods of assigning evaluation items to annotators. The single [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Time diagram of actions based on user annotations. The [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Human evaluation is the gold standard for multilingual NLP, but is often skipped in practice and substituted with automatic metrics because it is notoriously complex and slow to set up with existing tools with substantial engineering and operational overhead. We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. Pearmut removes common entry barriers and provides support for evaluating multilingual tasks, with a particular focus on machine translation. The platform implements standard evaluation protocols, including DA, ESA, and MQM, and is extensible to support new protocols. It features document-level context, absolute and contrastive evaluation, attention checks, ESAAI pre-annotations and both static and dynamic assignment strategies. Pearmut enables reliable human evaluation to become a practical, routine component of model development and diagnosis rather than an occasional effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Pearmut bundles useful features for human MT evaluation but supplies no measurements or studies to show it actually cuts overhead to automatic-metric levels.

read the letter

Pearmut is a new platform for running human evaluations on machine translation that integrates standard protocols like DA, ESA, and MQM along with document context, attention checks, pre-annotations, and flexible assignment options. The paper claims this setup makes end-to-end human evaluation as straightforward as automatic metrics, but it offers no data to confirm that. The work does a good job listing the features and explaining how they target common barriers, such as lack of context in judgments and the effort needed for attention checks. It also covers extensibility for new protocols and support for multilingual tasks, which aligns with practical needs in the field. The soft spots are clear and central. There are no user studies, no recorded setup times, no comparisons against existing platforms like MTurk or Appen, and no code or logs released to verify the lightweight claim. Without these, the main assertion that it removes engineering overhead stays untested and the paper reads more like a feature list than a validated tool. This paper is aimed at NLP researchers doing machine translation work who want an integrated option for routine human evaluation. A reader interested in seeing one possible combination of protocols and features will find it straightforward, but anyone needing evidence of real-world ease of use will come away wanting more. It deserves peer review because tool papers can move the field forward when they receive feedback on implementation and usability, and this one has a clear enough structure to build on with added validation.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Pearmut, a platform for human evaluation of machine translation that supports standard protocols (DA, ESA, MQM), document-level context, attention checks, ESAAI pre-annotations, and both static and dynamic assignment. It claims to remove common engineering and operational barriers so that end-to-end human evaluation becomes as easy to run as automatic metrics, with extensibility for new protocols and a focus on multilingual tasks.

Significance. If the platform can be shown to deliver the listed features with genuinely low overhead, it would address a persistent practical barrier in multilingual NLP by enabling routine, reliable human evaluation rather than occasional, high-effort studies. The combination of multiple protocols and document context is a positive design choice that aligns with current best practices in MT evaluation.

major comments (2)

[Abstract] Abstract: the central claim that Pearmut is 'lightweight' and makes human evaluation 'as easy to run as automatic evaluation' is unsupported by any setup-time measurements, configuration-effort comparisons against MTurk/Appen baselines, user-study results, or released code/deployment logs. This absence directly undermines the practicality assertion that constitutes the paper's main contribution.
[Abstract] The feature enumeration (DA/ESA/MQM protocols, document context, attention checks, ESAAI pre-annotations, static/dynamic assignment) is presented without implementation details or any quantitative validation of overhead. Because the manuscript supplies only protocol descriptions, the load-bearing claim that these features can be delivered without introducing new operational costs remains untested.

minor comments (1)

[Abstract] Add explicit citations to existing human-evaluation platforms (e.g., Appen, MTurk, or prior MT-specific tools) to clarify the incremental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the platform's alignment with current best practices in MT evaluation. We address each major comment below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Pearmut is 'lightweight' and makes human evaluation 'as easy to run as automatic evaluation' is unsupported by any setup-time measurements, configuration-effort comparisons against MTurk/Appen baselines, user-study results, or released code/deployment logs. This absence directly undermines the practicality assertion that constitutes the paper's main contribution.

Authors: We agree that the abstract's central claim requires supporting evidence to be fully substantiated. The current manuscript is a system-description paper that focuses on the platform's architecture and feature set rather than empirical benchmarks. In the revised version we will add a new 'Deployment and Overhead' section that reports concrete setup-time measurements for a standard DA task, qualitative configuration-effort comparisons with MTurk and Appen, and internal deployment logs. We will also release the code and deployment scripts upon acceptance so that readers can verify the overhead claims directly. revision: yes
Referee: [Abstract] The feature enumeration (DA/ESA/MQM protocols, document context, attention checks, ESAAI pre-annotations, static/dynamic assignment) is presented without implementation details or any quantitative validation of overhead. Because the manuscript supplies only protocol descriptions, the load-bearing claim that these features can be delivered without introducing new operational costs remains untested.

Authors: We acknowledge that the manuscript currently provides high-level protocol descriptions without accompanying implementation specifics or overhead metrics. In revision we will expand the 'Implementation' and 'Features' sections to detail how each capability is realized (for example, the data model used for document-level context and the integration of attention checks). Any available quantitative overhead figures from our development and pilot deployments will be reported; where such data are not yet available we will explicitly note the limitation and outline planned validation experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive tool-introduction paper with no derivations or fitted quantities

full rationale

The manuscript is a pure software description introducing Pearmut features (DA/ESA/MQM protocols, document context, attention checks, ESAAI pre-annotations, assignment strategies). It contains no equations, no parameter fitting, no predictions of quantities, and no derivation chain. Central claims are implemented-feature enumerations rather than results derived from inputs. No self-citations are load-bearing for any mathematical or predictive step. The absence of any reduction-to-inputs structure yields score 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the existence and usability of the Pearmut platform itself. No free parameters, mathematical axioms, or invented entities are mentioned in the abstract.

pith-pipeline@v0.9.0 · 5434 in / 1031 out tokens · 54893 ms · 2026-05-16T17:51:57.366455+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Pearmut, a lightweight yet feature-rich platform that makes end-to-end human evaluation as easy to run as automatic evaluation. ... implements standard evaluation protocols, including DA, ESA, and MQM
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pearmut occupies the middle ground: lightweight like Potato, but domain-specialized like Appraise.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Accurate Evaluation of Segment level Ma chine Translation Metrics . In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 1183–1191, Denver, Colorado. Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013a. Continuous Measurement Scales...

work page 2015
[2]

In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech nologies, pages 4110–4124, Online

Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech nologies, pages 4110–4124, Online. Leonard Kleinrock. 1975. Queueing Systems, Volume 1: Theory. WileyInterscience. Tom Kocmi, Ekaterina Artemova, Eleftherios Avramidis, Rache...

work page 2021
[3]

Scientific Credibility of Machine Translation Research: A Meta Evaluation of 769 Papers . In Proceedings of the 59th Annual Meeting of the Asso ciation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 7297– 7306, Online. Nitika Mathur, Timothy Baldwin, and Trevor Cohn

work page
[4]

In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2860–2865, Copenhagen, Denmark

Sequence Effects in Crowdsourced Annota tions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2860–2865, Copenhagen, Denmark. Jekaterina Novikova, Ondřej Dušek, and Verena Rieser

work page 2017
[5]

RankME: Reliable Human Ratings for Natural Language Generation . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) , pages 72–78, New Orleans, Louisiana. Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano, ...

work page 2018
[6]

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs . arXiv: 2512.16378 [cs.CL]. Jiaxin Pei, Aparna Ananthasubramaniam, Xingyao Wang, Naitian Zhou, Apostolos Dedeloudis, Jackson Sargent, and David Jurgens. 2022. POTATO: The Portable Text Annotation Tool . In Proceedings of the 2022 Conference on Empirical Methods in Nat ural...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

server serving static files and API requests,

work page
[8]

frontend annotation templates,

work page
[9]

instructions

frontend dashboard for monitoring. In contrast to other platforms, Pearmut does not use serverrendered templates. Instead, it serves a static page with an accompanying user side code that queries the server for the data. This leads to higher responsiveness and allows for more flexibility when implementing new protocols. The serverside is built with Fas...

work page 1936
[10]

How easy was the tool to use?

work page
[11]

How customizable is the tool?

work page
[12]

How fitting is the tool for translation evaluation?

work page
[13]

tgt” key as:

How likely would you use the tool for your next study of translation evaluation? All participants in this study were NLP researchers but without particular experience with any of the annotation platforms. The order of the platforms to setup was randomized. If a particular step was taking longer than 10 minutes, they could ask for guidance with the particu...

work page 2022
[14]

Internet

व दो क्य ? 1 के के छ पत्र क्य नेहा! सो.पते ? क्य 0 हाम शा जगहा रहा हा ? Bedtime story time! What’s up with T9 key boards? Why do 7 and 9 have 4 let ters, but others 3. Why those two? Why not assign 1 some letters? Is 0 always been space? C: Čas na príbeh spánku! Čo je hore s T9 klávesmi? Prečo robia 7 a 9 mať 4 listy, ale iné 3. Prečo tamtie dva? Preč...

work page 1975