arxiv: 2601.07338 · v2 · submitted 2026-01-12 · 💻 cs.CL

Recognition: no theorem link

Beyond Literal Mapping: Benchmarking and Improving Non-Literal Translation Evaluation

Yanzhi Tian , Cunxiang Wang , Zeming Liu , Heyan Huang , Wenbo Yu , Dawei Song , Jie Tang , Yuhang Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:27 UTC · model grok-4.3

classification 💻 cs.CL

keywords machine translation evaluationnon-literal translationLLM agentsmeta-evaluationhuman correlationMENT datasetagentic framework

0 comments

The pith

RATE agentic framework raises correlation with human judgments on non-literal MT by at least 3.2 points.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first builds MENT, a meta-evaluation set of 7,530 human-annotated non-literal translations drawn from four domains including social networks and literature. It shows that both classic MT metrics and direct LLM judges lose reliability on these cases because of literal bias, knowledge cutoffs, and scoring inconsistency. To fix the gap, the authors introduce RATE, an agentic system whose reflective core agent calls specialized sub-agents to inspect different facets of translation quality. Experiments confirm that RATE lifts combined system- and segment-level correlation with the human labels by at least 3.2 points over prior methods. The same framework also preserves strong performance when tested on ordinary general-domain translations.

Core claim

RATE, centered by a reflective Core Agent that dynamically invokes specialized sub-agents, achieves an improvement of at least 3.2 points in combined system- and segment-level correlation with human judgments compared with current methods on non-literal translations.

What carries the argument

The reflective Core Agent in RATE that orchestrates sub-agents to evaluate non-literal translation quality.

If this is right

Traditional automatic metrics systematically undervalue translations that use non-literal expressions.
Direct LLM judges remain limited by knowledge cutoffs and inconsistent scoring on complex language.
RATE delivers measurable gains while retaining effectiveness on standard general-domain MT evaluation.
Better evaluation signals can steer future MT systems toward higher fidelity on social and literary content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reflective-agent pattern could be tested on evaluation of figurative language in summarization or dialogue.
MENT supplies a reusable benchmark for any new metric that aims to measure semantic fidelity rather than surface overlap.
If the orchestration proves stable across languages, it could lower the cost of creating reliable MT quality signals without repeated large-scale human labeling.

Load-bearing premise

The human scores collected for MENT are reliable ground-truth labels for non-literal translation quality, and the agentic orchestration generalizes beyond the four domains and MT systems tested.

What would settle it

A fresh human annotation study on non-literal translations from an unseen domain or MT system in which RATE fails to deliver at least a 3.2-point correlation gain would falsify the central claim.

Figures

Figures reproduced from arXiv: 2601.07338 by Cunxiang Wang, Dawei Song, Heyan Huang, Jie Tang, Wenbo Yu, Yanzhi Tian, Yuhang Guo, Zeming Liu.

**Figure 2.** Figure 2: Overview of the data construction pipeline and final dataset visualization. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Heatmap of Pearson correlations for system [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the RATE framework. The Core [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of metrics performance with spe [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Temporal distribution of sub-agent invoking, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: The prompt of LLM preliminary filtering before manual inspection. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Human annotation criteria of translation quality. All recruited annotators hold degrees in translation and [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Human annotation steps of translation reference. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Sample (SNS domain, Zh-En) from MENT, the annotated data comprises a reference and scores of [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Sample (Cross-Culture domain, En-Zh) from MENT, the annotated data comprises a reference and scores [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Sample (Poetry domain, En-Zh) from MENT, the annotated data comprises a reference and scores of [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Sample (Literature domain, Zh-En) from MENT, the annotated data comprises a reference and scores of [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt of Core Agent (part 1), outlining the evaluation objectives. [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt of Core Agent (part 2), outlining protocols of sub-agents calling. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Prompt of Core Agent (part 3), outlining the evaluation procedure. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Prompt of Evaluation Agent. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Prompt of Search Agent, including the calling protocol of search engine, and the summarization of [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

**Figure 19.** Figure 19: Prompt of Comparison Agent. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_19.png] view at source ↗

**Figure 20.** Figure 20: Trajectory of RATE, illustrating the invoking of Search Agent to retrieve background knowledge, and [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

**Figure 21.** Figure 21: Trajectory of RATE, illustrating the Evaluation Agent fails to reach high confidence despite specific [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗

**Figure 22.** Figure 22: Prompt of GEMBA-MQM, we use GPT-4o as backbone model, and we align it with the original [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗

**Figure 23.** Figure 23: Prompt of GEMBA-DA, we use GPT-4o as backbone model, and we align it with the original imple [PITH_FULL_IMAGE:figures/full_fig_p032_23.png] view at source ↗

**Figure 24.** Figure 24: Prompt of EAPrompt, we use GPT-4o as backbone model, and we align it with the original two stages [PITH_FULL_IMAGE:figures/full_fig_p033_24.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have significantly advanced Machine Translation (MT), applying them to linguistically complex domains-such as Social Network Services, literature etc. In these scenarios, translations often require handling non-literal expressions, leading to the inaccuracy of MT metrics. To systematically investigate the reliability of MT metrics, we first curate a meta-evaluation dataset focused on non-literal translations, namely MENT. MENT encompasses four non-literal translation domains and features source sentences paired with translations from diverse MT systems, with 7,530 human-annotated scores on translation quality. Experimental results reveal the inaccuracies of traditional MT metrics and the limitations of LLM-as-a-Judge, particularly the knowledge cutoff and score inconsistency problem. To mitigate these limitations, we propose RATE, a novel agentic translation evaluation framework, centered by a reflective Core Agent that dynamically invokes specialized sub-agents. Experimental results indicate the efficacy of RATE, achieving an improvement of at least 3.2 points in combined system- and segment-level correlation with human judgments compared with current methods. Further experiments demonstrate the robustness of RATE to general-domain MT evaluation. Code and dataset are available at: https://github.com/BITHLP/RATE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MENT dataset and RATE agentic framework tackle non-literal MT evaluation with some gains, but missing IAA on the human scores undercuts the main claim.

read the letter

The paper's real contribution is the MENT dataset of 7,530 human-annotated non-literal translations across literature, social media, and two other domains, plus the RATE framework that uses a reflective core agent to call specialized sub-agents for evaluation. It shows standard metrics and plain LLM judges lose correlation on idioms and metaphors, which matches what people in the field already suspect. Releasing the code and data is helpful for anyone who wants to test this themselves. The 3.2-point combined correlation lift is the headline number, and the robustness checks on general-domain MT are a reasonable extra step. The agentic design makes sense as a way to reduce knowledge cutoff and score inconsistency without retraining. That said, the human annotations lack any reported inter-annotator agreement, which is a problem when the judgments are subjective by nature. Non-literal quality is not like BLEU on news text; different annotators can reasonably disagree on how well a metaphor landed. Without kappa or alpha numbers, the ground truth itself could be noisy enough to inflate or deflate the reported gains. The abstract also skips error bars, full ablation tables, and exact prompting details, so the 3.2-point figure is hard to assess from the summary alone. This work is mainly for the MT metrics subgroup that already cares about evaluation on creative or informal text. A reader building or benchmarking new metrics would get value from the dataset and the baseline comparisons. It is not reshaping the broader field, but the artifacts are new and the problem is real. I would send it to peer review so the methods section and annotation protocol can be checked properly rather than desk-rejecting it.

Referee Report

2 major / 2 minor

Summary. The paper curates MENT, a meta-evaluation dataset of non-literal translations across four domains (including literature and SNS) with 7,530 human-annotated quality scores. It demonstrates limitations of traditional MT metrics and LLM-as-a-Judge methods (knowledge cutoff and inconsistency). The authors introduce RATE, an agentic evaluation framework built around a reflective core agent that dynamically invokes specialized sub-agents, and report that RATE improves combined system- and segment-level correlation with human judgments by at least 3.2 points over current methods. The work also includes robustness checks on general-domain MT and releases code and data.

Significance. If the reported correlation gains hold under reliable ground truth, the paper would make a useful contribution to MT evaluation by providing a targeted benchmark for non-literal cases and an agentic method that mitigates LLM limitations. The release of MENT and RATE code supports reproducibility and further research on evaluation in complex linguistic domains.

major comments (2)

[MENT curation and annotation (abstract and §3)] MENT curation and annotation (abstract and §3): no inter-annotator agreement (kappa, alpha, or similar) is reported for the 7,530 human scores. Non-literal quality judgments are inherently subjective; without IAA or consistency checks the ground-truth reliability is unclear, directly undermining the central claim of a 3.2-point correlation improvement for RATE.
[RATE framework (abstract and §4)] RATE framework (abstract and §4): the description of the reflective core agent and sub-agent orchestration lacks concrete details on prompts, decision logic, invocation criteria, or ablation studies. Without these, it is impossible to verify whether the reported gains stem from the agentic design or from other factors such as prompt engineering or model choice.

minor comments (2)

[Abstract] Abstract: the phrase 'at least 3.2 points' should specify the exact baselines, the precise combined correlation metric, and whether the gain is statistically significant.
[Abstract] The paper states robustness to general-domain MT but provides no table or quantitative comparison in the abstract; a brief summary of those results would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of transparency and reliability that we will address in the revision. We respond to each major comment below.

read point-by-point responses

Referee: [MENT curation and annotation (abstract and §3)] MENT curation and annotation (abstract and §3): no inter-annotator agreement (kappa, alpha, or similar) is reported for the 7,530 human scores. Non-literal quality judgments are inherently subjective; without IAA or consistency checks the ground-truth reliability is unclear, directly undermining the central claim of a 3.2-point correlation improvement for RATE.

Authors: We agree that inter-annotator agreement is essential to establish the reliability of subjective quality judgments in non-literal translation evaluation. The original submission did not include these statistics. In the revised manuscript we will add IAA metrics (Krippendorff’s alpha and Fleiss’ kappa) computed over the multiple annotations collected for MENT, together with a brief description of the annotation guidelines and consistency checks. These additions will directly support the validity of the reported correlation improvements. revision: yes
Referee: [RATE framework (abstract and §4)] RATE framework (abstract and §4): the description of the reflective core agent and sub-agent orchestration lacks concrete details on prompts, decision logic, invocation criteria, or ablation studies. Without these, it is impossible to verify whether the reported gains stem from the agentic design or from other factors such as prompt engineering or model choice.

Authors: We acknowledge that greater detail is required to allow readers to attribute the observed gains specifically to the agentic architecture. In the revised Section 4 we will provide the full prompts for the core agent and each sub-agent, the exact decision logic and invocation criteria used by the reflective core, and a set of ablation studies that isolate the contribution of dynamic sub-agent orchestration from static prompting and model choice. These additions will make the source of the performance improvements verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured against independent human annotations

full rationale

The paper curates an external dataset MENT with 7,530 human-annotated quality scores and reports RATE's correlation improvements (at least 3.2 points) against those scores. No equations, fitted parameters, or self-citations are shown to reduce the reported gains to a definition or prior result by construction. The derivation chain consists of dataset creation followed by empirical benchmarking; the central claim remains falsifiable against the released annotations and code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that human quality judgments are reliable ground truth and on the empirical performance of the agentic system; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Human-annotated scores collected for MENT accurately reflect translation quality for non-literal expressions
All reported correlations are computed against these 7,530 annotations.

pith-pipeline@v0.9.0 · 5527 in / 1200 out tokens · 42704 ms · 2026-05-16T15:27:01.979421+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SiNFluD: Creating and Evaluating Figurative Language Dataset for Sindhi
cs.CL 2026-05 unverdicted novelty 6.0

SiNFluD is a new benchmark dataset for Sindhi figurative language classification, annotated by native speakers with 0.81 IAA and evaluated using transformer models where XLM-RoBERTa-XL performs best.
SiNFluD: Creating and Evaluating Figurative Language Dataset for Sindhi
cs.CL 2026-05 unverdicted novelty 6.0

SiNFluD is a novel benchmark dataset for Sindhi figurative language classification with inter-annotator agreement of 0.81 and baseline results where XLM-RoBERTa-XL performs best among tested models.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 12914– 12929, Singapore

Ties matter: Meta-evaluating modern metrics with pairwise accuracy and tie calibration. InPro- ceedings of the 2023 Conference on Empirical Meth- ods in Natural Language Processing, pages 12914– 12929, Singapore. Association for Computational Linguistics. Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Jian Wu, and Zuozhu Liu...

work page arXiv 2023
[2]

No Language Left Behind: Scaling Human-Centered Machine Translation

MetricX-23: The Google submission to the WMT 2023 metrics shared task. InProceedings of the Eighth Conference on Machine Translation, pages 756–767, Singapore. Association for Compu- tational Linguistics. Marzena Karpinska, Nishant Raj, Katherine Thai, Yix- iao Song, Ankita Gupta, and Mohit Iyyer. 2022. DEMETR: Diagnosing evaluation metrics for trans- lat...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

CMDAG: A Chinese metaphor dataset with an- notated grounds as CoT for boosting metaphor gener- ation. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3357–3366, Torino, Italia. ELRA and ICCL. Gemini Team. 2025. Gemini 2.5: Pushing the fron- tier with advance...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

train_batch_size

Hunyuan-mt technical report.Preprint, arXiv:2509.05209. Dawei Zhu, Sony Trenous, Xiaoyu Shen, Dietrich Klakow, Bill Byrne, and Eva Hasler. 2024a. A preference-driven paradigm for enhanced translation with large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Lang...

work page arXiv 2024
[5]

A.3 Calculation of IAA The IAA for each pair of annotators is determined as follows: Let Ki,j denote the set of segments commonly annotated by annotators i and j

The detail instruction of human annotation on reference is shown in Figure 9. A.3 Calculation of IAA The IAA for each pair of annotators is determined as follows: Let Ki,j denote the set of segments commonly annotated by annotators i and j. For a segment k∈ K i,j, let vi,k ∈N 10 represent the score vector assigned by annotator i to the 10 MT systems. The ...

work page 2024
[6]

How ridiculous!

After receiving the high-confidence assessment from Evaluation Agent, the Core Agent then ter- minates the reflective loop and outputs the final score. Trajectory 2In the evaluation of Translation Candidate 2 (Figure 20), instead of invoking a new Search Agent, the Core Agent leverages the back- ground knowledge accumulated from the previous trajectory th...

work page
[7]

Crucially, RATE consistently maintains superior performance over the baseline, confirming that our framework’s effectiveness holds across different backbone mod- els

As shown in the table, replacing the backbone model yields a slight performance improvement for both methods on the average meta scores. Crucially, RATE consistently maintains superior performance over the baseline, confirming that our framework’s effectiveness holds across different backbone mod- els. E Details of Experimental Setup In this section, we p...

work page 2023
[8]

Non-literal & Figurative Language Similes, metaphors, extended metaphors (Literature) Internet slang, memes, and non-standard expressions (SNS) Idioms or fixed expressions (e.g., Chengyu) that do not map directly (Culture)

work page
[9]

Implicit or Ambiguous Meaning Meaning that is implied rather than stated High-context internet language (e.g., abbreviations, acronyms) Ambiguity that is intentional or stylistically important

work page
[10]

Stylistic & Formal Complexity Poetic constraints: Rhyme schemes, strict meter, or rhythm (Poetry) Linguistic play: Homophonic puns, wordplay, or sound-based effects (SNS/Poetry) Unusual syntax, stream of consciousness, or distinct narrative voice

work page
[11]

Cultural or Contextual Dependency References to culture-specific entities (history, mythology, cuisine) lacking target equivalents Platform-specific conventions (e.g., RedNote style) Meaning that relies heavily on shared background knowledge

work page
[12]

transcreation

Risk of Meaning Loss High chance that a literal translation would distort the meaning or ruin the aesthetic effect Need for "transcreation" or structural restructuring rather than direct translation Output format: Return a single valid JSON object and nothing else. Do not use Markdown code blocks. { "score": <integer from 1 to 10>, "reasoning": "<2–4 sent...

work page
[13]

CRITICAL CHECK: Does the memory specifically explain the slang/idioms in *this* source text? If generic or irrelevant -> MUST call `search_agent`

Initial Analysis Phase (Pre-computation): CHECK MEMORY: Review the `[System Memory]` (if provided). CRITICAL CHECK: Does the memory specifically explain the slang/idioms in *this* source text? If generic or irrelevant -> MUST call `search_agent`. If complete -> Proceed to `general_evaluation_agent`

work page
[14]

anchor_memory_status

Feedback Analysis Phase (The Refinement Loop): PRIORITY 1: Handle Knowledge Gaps (Gap-Driven Refinement) Check `suspected_knowledge_gaps` from `general_evaluation_agent`. IF NOT EMPTY: This is a BLOCKING issue. ACTION: You MUST call `search_agent` for these specific terms. CONSTRAINT: Do not repeat identical failed searches. Refine queries or, if impossib...

work page
[15]

`synthetic_low_anchor`: A Score 1 translation (Literal/Wrong/Misses the slang)

work page
[16]

upgrade": The candidate is better than expected. Raise the score (e.g., 3 -> 4)

`synthetic_high_anchor`: A Score 4 translation (Perfect meaning and tone based on context). Step C: Analyze Feedback (The Decision) "upgrade": The candidate is better than expected. Raise the score (e.g., 3 -> 4). "downgrade": The candidate is worse. Lower the score (e.g., 3 -> 2). "adjust": The agent suggests a fine-grained score (e.g., 3.5). ACCEPT this...

work page
[17]

Ensure `final_score` reflects the adjustments (e.g., use 3.5 if suggested)

Final Decision Phase: If confidence is high and no unresolved gaps/conflicts exist, call `finish_evaluation`. Ensure `final_score` reflects the adjustments (e.g., use 3.5 if suggested). Constraints XML Format: Strictly use `<tool_call>` tags. No markdown code blocks. Context Preservation: Always pass accumulated `context_notes` to sub-agents. Comparison L...

work page
[19]

Target Text: The translation to evaluate

work page
[20]

wrong" to the point of being nonsense (Score 0/1), but it is not

Context Notes (Optional): Critical background information provided by the Core Agent (e.g., Check if slang is translated literally"). Evaluation Criteria (0-4 Scale) Score 0: Severe Knowledge Failure / Nonsense The translation contains severe errors or omissions in understanding and translating the knowledge contained in the source text. Criteria: The cor...

work page
[21]

Source Text: The original text (often containing slang, idioms, or high-context references)

work page
[22]

**You MUST treat this as the ultimate truth.**

Context Notes: Ground-truth explanations for terms in the source. **You MUST treat this as the ultimate truth.**

work page
[23]

Candidate A: Translation Option 1

work page
[24]

Evaluation Criteria (Hierarchy of Importance) Compare based on the following priority order

Candidate B: Translation Option 2. Evaluation Criteria (Hierarchy of Importance) Compare based on the following priority order. Do not prioritize fluency over accuracy

work page
[25]

Context Notes

Meaning & Slang Accuracy (Highest Priority): Does the translation correctly interpret the slang/idioms defined in "Context Notes"? If Candidate A translates the slang meaning while Candidate B translates it literally (losing the meaning), A WINS immediately

work page
[26]

Nuance & Tone: If both are accurate, which one better captures the original emotion (sarcasm, anger, humor, indifference)?

work page
[27]

winner":

Fluency & Grammar: Only if meaning and tone are equal, prefer the one with more natural target language phrasing. Output Format You must respond in strict JSON format: { "winner": "A" | "B" | "Tie", "rationale": "Concise comparison focusing on [Target Slang/Term]. Explain why the winner is better based on the hierarchy." } Figure 19: Prompt of Comparison ...

work page