pith. sign in

arxiv: 2508.17458 · v2 · submitted 2025-08-24 · 💻 cs.CL

Evaluating the Impact of Verbal Multiword Expressions on Machine Translation

Pith reviewed 2026-05-18 20:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords verbal multiword expressionsmachine translationVMWEstranslation qualityidiomslight verb constructionsverb-particle constructions
0
0 comments X

The pith

Verbal multiword expressions degrade machine translation quality primarily due to the expressions themselves.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how verbal multiword expressions affect machine translation from English into other languages. It shows that sentences containing verbal idioms, verb-particle constructions, and light verb constructions produce lower-quality output than sentences without them. The drop occurs even after accounting for general sentence features, pointing to the expressions as the direct source of the problem. This finding helps explain a persistent weakness in current translation systems when they encounter everyday idiomatic language.

Core claim

Our experimental results consistently show that VMWEs negatively affect translation quality, with deeper analysis indicating that this degradation is primarily attributable to the VMWE itself rather than general sentence-level difficulty.

What carries the argument

Side-by-side evaluation of translation quality on VMWE-containing sentences versus matched controls drawn from established multiword expression and machine translation datasets.

If this is right

  • Translation systems produce measurably worse output on sentences that contain verbal idioms, verb-particle constructions, or light verb constructions.
  • The quality loss traces to the multiword expression rather than broader sentence properties.
  • The released evaluation framework lets researchers test whether new models handle VMWEs more effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Better modeling of VMWEs could improve translation of natural spoken and written language.
  • The same isolation method could be applied to non-verbal multiword expressions or to other language pairs.
  • Fine-tuning translation models on VMWE-rich data offers a direct test of whether the observed gap can be closed.

Load-bearing premise

The chosen datasets and evaluation metrics successfully isolate the effect of VMWEs from other sources of translation difficulty such as sentence length, vocabulary rarity, or syntactic complexity.

What would settle it

A matched-pair analysis showing no quality difference between VMWE sentences and controls of equal length and complexity would contradict the claim that the expressions themselves drive the degradation.

Figures

Figures reproduced from arXiv: 2508.17458 by Linfeng Liu, Saptarshi Ghosh, Tianyu Jiang.

Figure 1
Figure 1. Figure 1: Comparison of the translation quality between sentences with and without VMWE, [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of the translation quality between sentences with and without VMWE, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Paraphrasing structure. Ori: QE score between the original sentence and its direct translation. Para: QE score between paraphrased sentence and its direct translation. Mix: QE score between the original sentence and the translation of the paraphrased sentence. MT System en-zh en-de en-ru en-cs Ori δmix δpara Ori δmix δpara Ori δmix δpara Ori δmix δpara VID Madlad 7.25 +1.77 +2.21 3.00 +0.24 +0.92 6.49 +0.8… view at source ↗
Figure 4
Figure 4. Figure 4: Ranking of MT systems based on the MetricX-24 QE scores. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Screenshot from Google Translate webpage, an English sentence with an idiom [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Screenshot from Google Translate webpage, an English sentence with an idiom “ [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Screenshot from Google Translate webpage, an English sentence with an idiom “ [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Screenshot from Google Translate webpage, an English sentence with an idiom “ [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Screenshot from Google Translate webpage, an English sentence with an idiom [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Screenshot from Google Translate webpage, an English sentence with an idiom [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Screenshot from Google Translate webpage, an English sentence with an idiom [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Ranking MT system’s on xCOMET QE scores for VMWE sentences. [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Ranking languages based on translation quality for sentences with VMWE. [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
read the original abstract

Verbal multiword expressions (VMWEs) remain difficult for machine translation because their meanings are often not recoverable from their component words. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and standard machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality, with deeper analysis indicating that this degradation is primarily attributable to the VMWE itself rather than general sentence-level difficulty. We release our code and evaluation framework to test new MT systems for the community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates the impact of verbal multiword expressions (VMWEs) — specifically verbal idioms, verb-particle constructions, and light verb constructions — on machine translation quality from English to multiple languages. Using established MWE and MT datasets along with standard MT systems, the authors report that VMWEs consistently degrade translation quality. Through deeper analysis, they conclude that this effect is primarily attributable to the VMWEs themselves rather than general sentence-level difficulty. The code and evaluation framework are released for community use.

Significance. This study addresses a practical challenge in machine translation regarding the handling of non-compositional expressions. If the attribution of quality degradation specifically to VMWEs is supported by adequate controls, the findings could inform the development of MT systems better equipped to handle idiomatic language. The release of code and framework strengthens the work by enabling reproducibility and further testing on new systems.

major comments (1)
  1. [Methods] The central claim that degradation is 'primarily attributable to the VMWE itself rather than general sentence-level difficulty' is load-bearing and requires explicit controls. The manuscript mentions an 'attempt to control for sentence-level difficulty' but provides no details on matching, stratification, or multivariate regression for confounders such as sentence length, lexical rarity, or syntactic complexity (Methods section). Without these, systematic differences between VMWE and non-VMWE items could explain the quality drop.
minor comments (1)
  1. [Abstract] The abstract would benefit from naming the specific target languages and MT systems evaluated to allow readers to assess the scope of the reported consistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the presentation of our methods.

read point-by-point responses
  1. Referee: [Methods] The central claim that degradation is 'primarily attributable to the VMWE itself rather than general sentence-level difficulty' is load-bearing and requires explicit controls. The manuscript mentions an 'attempt to control for sentence-level difficulty' but provides no details on matching, stratification, or multivariate regression for confounders such as sentence length, lexical rarity, or syntactic complexity (Methods section). Without these, systematic differences between VMWE and non-VMWE items could explain the quality drop.

    Authors: We agree that the central claim requires robust controls and that the Methods section would benefit from greater transparency. In the submitted manuscript we referenced an attempt to control for sentence-level difficulty through selection of comparable non-VMWE sentences from the same datasets, but we acknowledge that explicit details on matching procedures, stratification by length or complexity, or regression-based adjustment for lexical rarity were not provided. In the revised version we will expand the Methods section to describe these controls in full, including the specific criteria used for sentence matching and any additional statistical checks performed to isolate VMWE effects. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation study

full rationale

This paper performs an empirical evaluation of VMWE effects on MT quality using established external datasets and off-the-shelf translation systems. The central claim rests on experimental comparisons and deeper analysis of translation metrics, without any derivations, equations, fitted parameters, or predictions that reduce to author-defined inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing justifications for the results. The study is self-contained against independent benchmarks and standard evaluation practices.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that standard MT evaluation metrics and existing VMWE datasets can isolate expression-specific effects. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Established multiword expression datasets and standard machine translation benchmarks provide representative samples for measuring VMWE impact.
    The study relies on these resources to draw conclusions about translation quality.
  • domain assumption Translation quality metrics reflect the specific contribution of VMWEs when sentence-level difficulty is controlled.
    This underpins the claim that degradation is attributable to the VMWE itself.

pith-pipeline@v0.9.0 · 5644 in / 1281 out tokens · 34449 ms · 2026-05-18T20:44:58.481526+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 3 internal anchors

  1. [1]

    Mathieu Constant, Gül¸ sen Eryiˇgit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu

    URL https://arxiv.org/abs/2312.05187. Mathieu Constant, Gül¸ sen Eryiˇgit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. Survey: Multiword expression processing: A Survey. Computational Linguistics, 43(4):837–892, December 2017. doi: 10.1162/COLI_a_ 00302. URL https://aclanthology.org/J17-4005/. Silvio Ricardo ...

  2. [2]

    URL https://aclanthology.org/W19-6110/

    Linköping University Electronic Press. URL https://aclanthology.org/W19-6110/. Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, and Bin Wang. Multilingual machine translation with open large language models at practical scale: An empirical study, 2025. URL https://arxiv.org/abs/2502.02481. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms...

  3. [3]

    Marian: Fast Neural Machine Translation in C++

    doi: 10.1162/tacl_a_00683. URL https://aclanthology.org/2024.tacl-1.54/. Hessel Haagsma, Johan Bos, and Malvina Nissim. MAGPIE: A large corpus of potentially idiomatic expressions. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Mae- gaard, Joseph Mariani, Hé...

  4. [4]

    doi: 10.18653/v1/2023.wmt-1.1

    Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.1. URL https://aclanthology.org/2023.wmt-1.1/. Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondˇ rej Bojar, Anton Dvorkovich, et al. Findings of the WMT24 general machine translation shared task: The LLM era is here but MT is not solved yet. In Barry Haddow, Tom Kocmi, Philipp Koeh...

  5. [5]

    LLM tropes: Revealing fine-grained values and opinions in large language models

    Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp

  6. [6]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    URL https://aclanthology.org/2024.findings-emnlp.631. Shushen Manakhimova, Eleftherios Avramidis, Vivien Macketanz, Ekaterina Lapshinova- Koltunski, Sergei Bagdasarov, and Sebastian Möller. Linguistically motivated evaluation of the 2023 state-of-the-art machine translation: Can ChatGPT outperform NMT? In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christ...

  7. [7]

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    URL https://aclanthology.org/2020.coling-main.296/. Anita Rácz, István Nagy T., and Veronika Vincze. 4FX: Light verb constructions in a multilingual parallel corpus. In Nicoletta Calzolari, Khalid Choukri, Thierry De- clerck, Hrafn Loftsson, Bente Maegaard, et al. (eds.), Proceedings of the Ninth Interna- tional Conference on Language Resources and Evalua...

  8. [8]

    Huacheng Song and Hongzhi Xu

    URL https://arxiv.org/abs/2006.09479. Huacheng Song and Hongzhi Xu. A deep analysis of the impact of multiword expressions and named entities on Chinese-English machine translations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 6154–6165, Miami, Florida, USA, Novemb...

  9. [9]

    arXiv preprint arXiv:2010.11934 , year=

    Association for Computational Linguistics. URL https://aclanthology.org/2022. wmt-1.42/. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer, 2021. URL https://arxiv.org/abs/2010.11934. 15 Andrea Zaninello and Alexandra Birch...

  10. [10]

    Madlad400 (Kudugunta et al., 2023): A Google multilingual machine translation model, based on the T5 architecture (Raffel et al., 2023) that was trained on 250 billion tokens covering over 450 languages

  11. [11]

    SeamlessM4T (Communication et al., 2023): Meta AI’s massively multilingual and multimodal machine translation model, supporting an impressive range of translation capabilities with 96 languages for text input/output

  12. [12]

    M2M100 (Fan et al., 2020): A Meta-powered multilingual encoder-decoder model, primarily designed for translation tasks, supporting direct translation between 100 languages without requiring English as an intermediate language

  13. [13]

    Opus-MT (Tiedemann & Thottingal, 2020): Provides open translation services built on the Marian neural machine translation framework (Junczys-Dowmunt et al., 2018), trained on Opus data, and later converted to PyTorch models for the Hugging Face ecosystem

  14. [14]

    LLaMAX3 Alpaca (Lu et al., 2024): An LLM-based machine translation model, LLaMAX combines powerful multilingual translation capabilities with instruction-following abilities. This model extends Meta’s LLaMA 3 architecture (Grattafiori et al., 2024) to support translation between over 100 languages without sacrificing its ability to follow complex instructions

  15. [15]

    Phi-4-multimodal (Microsoft et al., 2025): Built upon the pretrained Phi-4-mini model, Phi-4-multi can processes text, image, and audio inputs to generate text outputs. While primarily known for its multimodal capabilities, it can handle translation tasks as part of its broader language understanding abilities, making it another LLM-based MT system for our task

  16. [16]

    GemmaX2 (Cui et al., 2025): A very recent multilingual LLM-based translation model, that achieved state-of-the-art performance across 28 languages. Based on Google’s Gemma2 architecture (Team et al., 2024), it consistently outperforms other LLM-based MT models like TowerInstruct and XALMA, achieving competitive results with Google Translate and GPT-4-turbo

  17. [17]

    Once you’re retired you have three months to get out

    Google Translate API3: A widely used multilingual neural machine translation service developed by Google, offering translations for 249 languages and language varieties as of March 2025. B Invalid Translations by LLM-Based MT Models During evaluation, we observed several instances of invalid outputs from LLM-based MT models. The most common cases where tr...

  18. [18]

    Is the second element a particle (e.g., ’up’, ’off’)? - No → Not VPC (A) - Yes → Continue

  19. [19]

    Does the remaining verb convey the same meaning as the full verb-particle phrase? - Yes → Not VPC (B) - No → Continue

    Remove the particle from the combination. Does the remaining verb convey the same meaning as the full verb-particle phrase? - Yes → Not VPC (B) - No → Continue

  20. [20]

    D: Valid VPC (Particle significantly alters meaning) Answer with reasoning and ’Final Answer: [answer]

    Does the inclusion of the particle create a non-compositional meaning that is significantly different from the verb’s original meaning? - No → Not VPC (C) - Yes → VPC (D) A: Not a particle B: Meaning remains similar without the particle C: Particle does not significantly alter the meaning "D: Valid VPC (Particle significantly alters meaning) Answer with r...

  21. [21]

    [CRAN] Contains cranberry word? Yes → VID No → Next test

  22. [22]

    [LEX] Regular replacement changes meaning? Yes → VID No → Next test

  23. [23]

    [MORPH] Morphological changes affect meaning? Yes → VID No → Next test

  24. [24]

    [MORPHSYNT] Morphosyntactic changes affect meaning? Yes → VID No → Next test

  25. [25]

    [SYNT] Syntactic changes affect meaning? Yes → VID No → Not VID Examples: - VID: ’kick the bucket’, ’let the cat out of the bag’ - Non-VID: ’take a walk’, ’make a decision’ Instructions:

  26. [26]

    Analyze each test sequentially

  27. [27]

    Provide brief reasoning for each test

  28. [28]

    gave a" is a light-verb construction (LVC), where

    Conclude with ’Final Answer: [Yes/No]’ Is this candidate a Verbal Idiom (VID)? Apply the decision tree. 27 Table 9: Prompts for VMWE candidate paraphrasing. VMWE Prompt LVC You are an expert in linguistics. Given a sentence containing a multi-word expres- sion (VMWE), a Light Verb Construct (LVC). Your task is to rephrase the sentence to remove the VMWE w...