Evaluating the Impact of Verbal Multiword Expressions on Machine Translation
Pith reviewed 2026-05-18 20:44 UTC · model grok-4.3
The pith
Verbal multiword expressions degrade machine translation quality primarily due to the expressions themselves.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our experimental results consistently show that VMWEs negatively affect translation quality, with deeper analysis indicating that this degradation is primarily attributable to the VMWE itself rather than general sentence-level difficulty.
What carries the argument
Side-by-side evaluation of translation quality on VMWE-containing sentences versus matched controls drawn from established multiword expression and machine translation datasets.
If this is right
- Translation systems produce measurably worse output on sentences that contain verbal idioms, verb-particle constructions, or light verb constructions.
- The quality loss traces to the multiword expression rather than broader sentence properties.
- The released evaluation framework lets researchers test whether new models handle VMWEs more effectively.
Where Pith is reading between the lines
- Better modeling of VMWEs could improve translation of natural spoken and written language.
- The same isolation method could be applied to non-verbal multiword expressions or to other language pairs.
- Fine-tuning translation models on VMWE-rich data offers a direct test of whether the observed gap can be closed.
Load-bearing premise
The chosen datasets and evaluation metrics successfully isolate the effect of VMWEs from other sources of translation difficulty such as sentence length, vocabulary rarity, or syntactic complexity.
What would settle it
A matched-pair analysis showing no quality difference between VMWE sentences and controls of equal length and complexity would contradict the claim that the expressions themselves drive the degradation.
Figures
read the original abstract
Verbal multiword expressions (VMWEs) remain difficult for machine translation because their meanings are often not recoverable from their component words. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and standard machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality, with deeper analysis indicating that this degradation is primarily attributable to the VMWE itself rather than general sentence-level difficulty. We release our code and evaluation framework to test new MT systems for the community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the impact of verbal multiword expressions (VMWEs) — specifically verbal idioms, verb-particle constructions, and light verb constructions — on machine translation quality from English to multiple languages. Using established MWE and MT datasets along with standard MT systems, the authors report that VMWEs consistently degrade translation quality. Through deeper analysis, they conclude that this effect is primarily attributable to the VMWEs themselves rather than general sentence-level difficulty. The code and evaluation framework are released for community use.
Significance. This study addresses a practical challenge in machine translation regarding the handling of non-compositional expressions. If the attribution of quality degradation specifically to VMWEs is supported by adequate controls, the findings could inform the development of MT systems better equipped to handle idiomatic language. The release of code and framework strengthens the work by enabling reproducibility and further testing on new systems.
major comments (1)
- [Methods] The central claim that degradation is 'primarily attributable to the VMWE itself rather than general sentence-level difficulty' is load-bearing and requires explicit controls. The manuscript mentions an 'attempt to control for sentence-level difficulty' but provides no details on matching, stratification, or multivariate regression for confounders such as sentence length, lexical rarity, or syntactic complexity (Methods section). Without these, systematic differences between VMWE and non-VMWE items could explain the quality drop.
minor comments (1)
- [Abstract] The abstract would benefit from naming the specific target languages and MT systems evaluated to allow readers to assess the scope of the reported consistency.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive feedback on our manuscript. We address the major comment below and will incorporate revisions to strengthen the presentation of our methods.
read point-by-point responses
-
Referee: [Methods] The central claim that degradation is 'primarily attributable to the VMWE itself rather than general sentence-level difficulty' is load-bearing and requires explicit controls. The manuscript mentions an 'attempt to control for sentence-level difficulty' but provides no details on matching, stratification, or multivariate regression for confounders such as sentence length, lexical rarity, or syntactic complexity (Methods section). Without these, systematic differences between VMWE and non-VMWE items could explain the quality drop.
Authors: We agree that the central claim requires robust controls and that the Methods section would benefit from greater transparency. In the submitted manuscript we referenced an attempt to control for sentence-level difficulty through selection of comparable non-VMWE sentences from the same datasets, but we acknowledge that explicit details on matching procedures, stratification by length or complexity, or regression-based adjustment for lexical rarity were not provided. In the revised version we will expand the Methods section to describe these controls in full, including the specific criteria used for sentence matching and any additional statistical checks performed to isolate VMWE effects. revision: yes
Circularity Check
No significant circularity in empirical evaluation study
full rationale
This paper performs an empirical evaluation of VMWE effects on MT quality using established external datasets and off-the-shelf translation systems. The central claim rests on experimental comparisons and deeper analysis of translation metrics, without any derivations, equations, fitted parameters, or predictions that reduce to author-defined inputs by construction. No self-citation chains or ansatzes are invoked as load-bearing justifications for the results. The study is self-contained against independent benchmarks and standard evaluation practices.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Established multiword expression datasets and standard machine translation benchmarks provide representative samples for measuring VMWE impact.
- domain assumption Translation quality metrics reflect the specific contribution of VMWEs when sentence-level difficulty is controlled.
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2312.05187. Mathieu Constant, Gül¸ sen Eryiˇgit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. Survey: Multiword expression processing: A Survey. Computational Linguistics, 43(4):837–892, December 2017. doi: 10.1162/COLI_a_ 00302. URL https://aclanthology.org/J17-4005/. Silvio Ricardo ...
-
[2]
URL https://aclanthology.org/W19-6110/
Linköping University Electronic Press. URL https://aclanthology.org/W19-6110/. Menglong Cui, Pengzhi Gao, Wei Liu, Jian Luan, and Bin Wang. Multilingual machine translation with open large language models at practical scale: An empirical study, 2025. URL https://arxiv.org/abs/2502.02481. DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms...
-
[3]
Marian: Fast Neural Machine Translation in C++
doi: 10.1162/tacl_a_00683. URL https://aclanthology.org/2024.tacl-1.54/. Hessel Haagsma, Johan Bos, and Malvina Nissim. MAGPIE: A large corpus of potentially idiomatic expressions. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Mae- gaard, Joseph Mariani, Hé...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/tacl_a_00683 2024
-
[4]
Association for Computational Linguistics. doi: 10.18653/v1/2023.wmt-1.1. URL https://aclanthology.org/2023.wmt-1.1/. Tom Kocmi, Eleftherios Avramidis, Rachel Bawden, Ondˇ rej Bojar, Anton Dvorkovich, et al. Findings of the WMT24 general machine translation shared task: The LLM era is here but MT is not solved yet. In Barry Haddow, Tom Kocmi, Philipp Koeh...
-
[5]
LLM tropes: Revealing fine-grained values and opinions in large language models
Association for Computational Linguistics. doi: 10.18653/v1/2024.findings-emnlp
-
[6]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
URL https://aclanthology.org/2024.findings-emnlp.631. Shushen Manakhimova, Eleftherios Avramidis, Vivien Macketanz, Ekaterina Lapshinova- Koltunski, Sergei Bagdasarov, and Sebastian Möller. Linguistically motivated evaluation of the 2023 state-of-the-art machine translation: Can ChatGPT outperform NMT? In Philipp Koehn, Barry Haddow, Tom Kocmi, and Christ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.wmt-1.23 2024
-
[7]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
URL https://aclanthology.org/2020.coling-main.296/. Anita Rácz, István Nagy T., and Veronika Vincze. 4FX: Light verb constructions in a multilingual parallel corpus. In Nicoletta Calzolari, Khalid Choukri, Thierry De- clerck, Hrafn Loftsson, Bente Maegaard, et al. (eds.), Proceedings of the Ninth Interna- tional Conference on Language Resources and Evalua...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2020.acl-main.259 2020
-
[8]
URL https://arxiv.org/abs/2006.09479. Huacheng Song and Hongzhi Xu. A deep analysis of the impact of multiword expressions and named entities on Chinese-English machine translations. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 6154–6165, Miami, Florida, USA, Novemb...
-
[9]
arXiv preprint arXiv:2010.11934 , year=
Association for Computational Linguistics. URL https://aclanthology.org/2022. wmt-1.42/. Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. mt5: A massively multilingual pre-trained text-to-text transformer, 2021. URL https://arxiv.org/abs/2010.11934. 15 Andrea Zaninello and Alexandra Birch...
-
[10]
Madlad400 (Kudugunta et al., 2023): A Google multilingual machine translation model, based on the T5 architecture (Raffel et al., 2023) that was trained on 250 billion tokens covering over 450 languages
work page 2023
-
[11]
SeamlessM4T (Communication et al., 2023): Meta AI’s massively multilingual and multimodal machine translation model, supporting an impressive range of translation capabilities with 96 languages for text input/output
work page 2023
-
[12]
M2M100 (Fan et al., 2020): A Meta-powered multilingual encoder-decoder model, primarily designed for translation tasks, supporting direct translation between 100 languages without requiring English as an intermediate language
work page 2020
-
[13]
Opus-MT (Tiedemann & Thottingal, 2020): Provides open translation services built on the Marian neural machine translation framework (Junczys-Dowmunt et al., 2018), trained on Opus data, and later converted to PyTorch models for the Hugging Face ecosystem
work page 2020
-
[14]
LLaMAX3 Alpaca (Lu et al., 2024): An LLM-based machine translation model, LLaMAX combines powerful multilingual translation capabilities with instruction-following abilities. This model extends Meta’s LLaMA 3 architecture (Grattafiori et al., 2024) to support translation between over 100 languages without sacrificing its ability to follow complex instructions
work page 2024
-
[15]
Phi-4-multimodal (Microsoft et al., 2025): Built upon the pretrained Phi-4-mini model, Phi-4-multi can processes text, image, and audio inputs to generate text outputs. While primarily known for its multimodal capabilities, it can handle translation tasks as part of its broader language understanding abilities, making it another LLM-based MT system for our task
work page 2025
-
[16]
GemmaX2 (Cui et al., 2025): A very recent multilingual LLM-based translation model, that achieved state-of-the-art performance across 28 languages. Based on Google’s Gemma2 architecture (Team et al., 2024), it consistently outperforms other LLM-based MT models like TowerInstruct and XALMA, achieving competitive results with Google Translate and GPT-4-turbo
work page 2025
-
[17]
Once you’re retired you have three months to get out
Google Translate API3: A widely used multilingual neural machine translation service developed by Google, offering translations for 249 languages and language varieties as of March 2025. B Invalid Translations by LLM-Based MT Models During evaluation, we observed several instances of invalid outputs from LLM-based MT models. The most common cases where tr...
work page 2025
-
[18]
Is the second element a particle (e.g., ’up’, ’off’)? - No → Not VPC (A) - Yes → Continue
-
[19]
Remove the particle from the combination. Does the remaining verb convey the same meaning as the full verb-particle phrase? - Yes → Not VPC (B) - No → Continue
-
[20]
Does the inclusion of the particle create a non-compositional meaning that is significantly different from the verb’s original meaning? - No → Not VPC (C) - Yes → VPC (D) A: Not a particle B: Meaning remains similar without the particle C: Particle does not significantly alter the meaning "D: Valid VPC (Particle significantly alters meaning) Answer with r...
-
[21]
[CRAN] Contains cranberry word? Yes → VID No → Next test
-
[22]
[LEX] Regular replacement changes meaning? Yes → VID No → Next test
-
[23]
[MORPH] Morphological changes affect meaning? Yes → VID No → Next test
-
[24]
[MORPHSYNT] Morphosyntactic changes affect meaning? Yes → VID No → Next test
-
[25]
[SYNT] Syntactic changes affect meaning? Yes → VID No → Not VID Examples: - VID: ’kick the bucket’, ’let the cat out of the bag’ - Non-VID: ’take a walk’, ’make a decision’ Instructions:
-
[26]
Analyze each test sequentially
-
[27]
Provide brief reasoning for each test
-
[28]
gave a" is a light-verb construction (LVC), where
Conclude with ’Final Answer: [Yes/No]’ Is this candidate a Verbal Idiom (VID)? Apply the decision tree. 27 Table 9: Prompts for VMWE candidate paraphrasing. VMWE Prompt LVC You are an expert in linguistics. Given a sentence containing a multi-word expres- sion (VMWE), a Light Verb Construct (LVC). Your task is to rephrase the sentence to remove the VMWE w...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.