pith. sign in

arxiv: 2606.18989 · v1 · pith:ZX5BN4ZLnew · submitted 2026-06-17 · 💻 cs.CL · cs.AI

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment

Pith reviewed 2026-06-26 20:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords cross-lingual idiom alignmentgloss-pivoted benchmarkmultiple-choice equivalencegloss-contrastive generationliteral translation biasLLM evaluationlow-resource languagessemantic pivot
0
0 comments X

The pith

English Wiktionary glosses improve cross-lingual idiom generation and matching for language models, though gains remain modest and literal biases persist.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a benchmark that anchors idioms from many languages to English glosses drawn from Wiktionary so that alignment can be measured in a controlled way. It runs two tests: one multiple-choice task that isolates error types and one generation task that contrasts model outputs when the gloss is supplied or withheld. Across tested models the gloss raises semantic similarity scores and reduces literal translations, yet absolute performance stays low, especially for low-resource target languages. The work therefore isolates the contribution of an explicit meaning pivot while documenting the remaining difficulty of non-compositional expressions.

Core claim

The paper claims that a gloss-pivoted benchmark enables reproducible cross-lingual idiom alignment evaluation, that supplying the English gloss during generation produces measurably better outputs under an embedding-based semantic proxy than the no-gloss condition, and that the resulting performance difference concentrates more in attention heads than across layers, with stronger gloss anchoring linked to higher-quality generations.

What carries the argument

The Gloss-Contrastive Generation protocol that compares model outputs on identical inputs with and without the English gloss to measure the effect of the explicit semantic pivot.

If this is right

  • Literal translation remains the dominant error mode and is only partially offset by gloss input.
  • Attention-head differences rather than layer-wide changes drive the observed gains in at least one tested model.
  • Generations that more strongly reference the gloss tend to receive higher semantic scores.
  • The high-confidence reference alignment set supports consistent future comparisons.
  • Large headroom remains in open-ended generation even after glosses are supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If gloss pivots help here, comparable meaning anchors could be tested on other figurative constructions such as metaphors or proverbs.
  • The concentration of effect in attention heads suggests targeted interventions at those positions might amplify the gloss benefit.
  • The modest absolute numbers imply that pairing glosses with additional signals such as example sentences could produce larger lifts.

Load-bearing premise

English Wiktionary glosses supply a reliable language-neutral description of idiom meaning that holds across different languages and cultures.

What would settle it

No measurable rise in embedding similarity or drop in literal-translation errors when the gloss is added, across a wider range of models and language pairs than tested here.

Figures

Figures reproduced from arXiv: 2606.18989 by Derek F. Wong, Fengying Ye, Lidia S. Chao, Runzhe Zhan, Yanming Sun, Zheqi Zhang.

Figure 1
Figure 1. Figure 1: Overview of the G-IdiomAlign construction pipeline. Using English glosses as a shared semantic pivot, [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Retention curves for OpenAI text-embedding [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task 1 outcomes by target-language regime. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an English gloss from Wiktionary. We further construct a high-confidence reference alignment set for reproducible evaluation. G-IdiomAlign supports two protocols: (1) a controlled Multiple-Choice Idiom Equivalence with typed distractors for error attribution; and (2) a Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs to isolate the effect of an explicit semantic pivot. Across diverse LLMs, a bias to literal translation is a dominant failure mode, especially when the target is a low-resource language. Glosses consistently improve Gloss-Contrastive Generation under an embedding-based semantic proxy, but performance remains modest, indicating substantial headroom in the open output space. Subsequent analysis on Qwen3-8B further suggests that cross-condition differences are concentrated more in attention heads than in layers, while better With-gloss generations coincide with stronger gloss anchoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces G-IdiomAlign, a gloss-pivoted benchmark for cross-lingual idiom alignment anchored by English Wiktionary glosses. It constructs a high-confidence reference alignment set and supports two protocols: Multiple-Choice Idiom Equivalence with typed distractors for error analysis, and Gloss-Contrastive Generation contrasting No-gloss and With-gloss inputs. Experiments across LLMs identify literal translation bias as a dominant failure mode (especially for low-resource targets), show consistent but modest gains from glosses under an embedding-based semantic proxy, and report that cross-condition differences in Qwen3-8B are concentrated more in attention heads than layers, with better With-gloss outputs linked to stronger gloss anchoring.

Significance. If the central findings hold, the reproducible high-confidence reference alignment set and the controlled protocols provide a useful standardized resource for evaluating cross-lingual idiom handling in LLMs. The explicit isolation of gloss effects and the attention-head analysis offer concrete directions for mitigating literal bias in non-compositional transfer. The benchmark construction itself, with its focus on error attribution via distractors, strengthens the empirical contribution to multilingual NLP.

major comments (2)
  1. [Benchmark Construction / Gloss-Contrastive Generation protocol] The claim that glosses improve Gloss-Contrastive Generation under the embedding proxy rests on English Wiktionary glosses serving as reliable language-neutral semantic pivots. No validation of gloss fidelity (e.g., human ratings of meaning preservation across target languages or comparison to native-speaker paraphrases) is described in the benchmark construction or evaluation sections; this is load-bearing because any English-centric bias in the glosses would directly confound the With-gloss vs. No-gloss contrast and the reported modest gains.
  2. [Benchmark Construction] Details on idiom selection criteria, the construction of the high-confidence reference alignment set, and the exact typing of distractors in the Multiple-Choice task are absent from the high-level description. These omissions prevent assessment of whether the observed literal bias is an artifact of data curation rather than a general LLM property.
minor comments (1)
  1. [Evaluation / Gloss-Contrastive Generation] The specific embedding model and similarity metric constituting the 'embedding-based semantic proxy' should be stated explicitly, along with any preprocessing steps, to allow exact reproduction of the generation evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below.

read point-by-point responses
  1. Referee: [Benchmark Construction / Gloss-Contrastive Generation protocol] The claim that glosses improve Gloss-Contrastive Generation under the embedding proxy rests on English Wiktionary glosses serving as reliable language-neutral semantic pivots. No validation of gloss fidelity (e.g., human ratings of meaning preservation across target languages or comparison to native-speaker paraphrases) is described in the benchmark construction or evaluation sections; this is load-bearing because any English-centric bias in the glosses would directly confound the With-gloss vs. No-gloss contrast and the reported modest gains.

    Authors: We agree this is a valid concern. The manuscript uses Wiktionary glosses as provided anchors without reporting cross-lingual fidelity validation. We will revise the benchmark construction section to explicitly discuss this assumption, note its potential to introduce English-centric bias, and add a limitations paragraph on the need for future human validation studies. The reported gains remain tied to the embedding proxy as stated. revision: partial

  2. Referee: [Benchmark Construction] Details on idiom selection criteria, the construction of the high-confidence reference alignment set, and the exact typing of distractors in the Multiple-Choice task are absent from the high-level description. These omissions prevent assessment of whether the observed literal bias is an artifact of data curation rather than a general LLM property.

    Authors: The full manuscript contains a benchmark construction section with these elements, but we acknowledge the descriptions are insufficiently detailed. We will expand it to specify idiom selection criteria (e.g., Wiktionary coverage and inclusion filters), the exact procedure for building the high-confidence reference alignment set, and the distractor typing scheme (literal, semantic, unrelated). This will enable readers to evaluate whether literal bias is curation-dependent. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation against external references

full rationale

The paper constructs a new benchmark (G-IdiomAlign) anchored to external Wiktionary glosses and evaluates LLMs on multiple-choice and generation tasks. No mathematical derivations, fitted parameters, predictions, or self-citation chains are present in the provided text. The central claims rest on empirical contrasts (with-gloss vs no-gloss) measured against independent embedding proxies and human references, with no reduction of outputs to inputs by construction. This is the expected non-finding for dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark paper with no free parameters, axioms, or invented entities; relies on standard assumptions about idiom non-compositionality and Wiktionary gloss quality.

pith-pipeline@v0.9.1-grok · 5743 in / 1004 out tokens · 14292 ms · 2026-06-26T20:37:52.982851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 16 canonical work pages

  1. [1]

    MAGPIE : A Large Corpus of Potentially Idiomatic Expressions

    Haagsma, Hessel and Bos, Johan and Nissim, Malvina. MAGPIE : A Large Corpus of Potentially Idiomatic Expressions. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  2. [2]

    Rolling the DICE on Idiomaticity: How LLM s Fail to Grasp Context

    Mi, Maggie and Villavicencio, Aline and Moosavi, Nafise Sadat. Rolling the DICE on Idiomaticity: How LLM s Fail to Grasp Context. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.362

  3. [3]

    M ulti C o PIE : A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation

    Sentsova, Uliana and Ciminari, Debora and Genabith, Josef Van and Espa \ n a-Bonet, Cristina. M ulti C o PIE : A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation. Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025). 2025. doi:10.18653/v1/2025.mwe-1.8

  4. [4]

    2006.09479 , archivePrefix=

    Prateek Saxena and Soma Paul , year=. 2006.09479 , archivePrefix=

  5. [5]

    A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models

    De Luca Fornaciari, Francesca and Altuna, Bego \ n a and Gonzalez-Dios, Itziar and Melero, Maite. A Hard Nut to Crack: Idiom Detection with Conversational Large Language Models. Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024). 2024. doi:10.18653/v1/2024.figlang-1.5

  6. [6]

    ID 10 M : Idiom Identification in 10 Languages

    Tedeschi, Simone and Martelli, Federico and Navigli, Roberto. ID 10 M : Idiom Identification in 10 Languages. Findings of the Association for Computational Linguistics: NAACL 2022. 2022. doi:10.18653/v1/2022.findings-naacl.208

  7. [7]

    Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions

    Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Waszczuk, Jakub and Candito, Marie and Vaidya, Ashwini and Barbu Mititelu, Verginica and Bhatia, Archna and I. Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions. Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons. 2020

  8. [8]

    CLIX : Cross-Lingual Explanations of Idiomatic Expressions

    Gluck, Aaron and Wense, Katharina Von Der and Pacheco, Maria Leonor. CLIX : Cross-Lingual Explanations of Idiomatic Expressions. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.233

  9. [9]

    LI dioms: A Multilingual Linked Idioms Data Set

    Moussallem, Diego and Sherif, Mohamed Ahmed and Esteves, Diego and Zampieri, Marcos and Ngonga Ngomo, Axel-Cyrille. LI dioms: A Multilingual Linked Idioms Data Set. Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018). 2018

  10. [10]

    CHENGYU - BENCH : Benchmarking Large Language Models for C hinese Idiom Understanding and Use

    Fu, Yicheng and Huang, Zhemin and Yang, Liuxin and Lu, Yumeng and Dai, Zhongdongming. CHENGYU - BENCH : Benchmarking Large Language Models for C hinese Idiom Understanding and Use. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.119

  11. [11]

    and Doh, Joon Young and Rodan, Eid and Zhu, Kevin and O ' Brien, Sean

    Donthi, Sundesh and Spencer, Maximilian and Patel, Om B. and Doh, Joon Young and Rodan, Eid and Zhu, Kevin and O ' Brien, Sean. Improving LLM Abilities in Idiomatic Translation. Proceedings of the First Workshop on Language Models for Low-Resource Languages. 2025

  12. [12]

    From Neural Machine Translation to Large Language Models: Analysing Translation Quality of Chinese Idioms , author=

  13. [13]

    Evaluating

    Cai Yang and Yao Dou and David Heineman and Xiaofeng Wu and Wei Xu , journal=. Evaluating. 2025 , volume=

  14. [14]

    Automating Idiom Translation with Cross-Lingual Natural Language Generation Grounded In Semantic Analyses Using Large Language Models

    Qian, Ming. Automating Idiom Translation with Cross-Lingual Natural Language Generation Grounded In Semantic Analyses Using Large Language Models. Proceedings of the 16th Conference of the Association for Machine Translation in the Americas (Volume 2: Presentations). 2024

  15. [15]

    Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting

    Liu, Emmy and Chaudhary, Aditi and Neubig, Graham. Crossing the Threshold: Idiomatic Machine Translation through Retrieval Augmentation and Loss Weighting. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.933

  16. [16]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Translate Meanings, Not Just Words:. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2024 , month=. doi:10.1609/aaai.v38i17.29817 , abstractNote=

  17. [17]

    Comparative Study of Multilingual Idioms and Similes in Large Language Models

    Khoshtab, Paria and Namazifard, Danial and Masoudi, Mostafa and Akhgary, Ali and Mahdizadeh Sani, Samin and Yaghoobzadeh, Yadollah. Comparative Study of Multilingual Idioms and Similes in Large Language Models. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  18. [18]

    On Bilingual Lexicon Induction with Large Language Models

    Li, Yaoyiran and Korhonen, Anna and Vuli \'c , Ivan. On Bilingual Lexicon Induction with Large Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.595

  19. [19]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Enhancing Bilingual Lexicon Induction via Bi-directional Translation Pair Retrieving , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2024 , month=. doi:10.1609/aaai.v38i16.29744 , abstractNote=

  20. [20]

    Investigating Idiomaticity in Word Representations

    He, Wei and Vieira, Tiago Kramer and Garcia, Marcos and Scarton, Carolina and Idiart, Marco and Villavicencio, Aline. Investigating Idiomaticity in Word Representations. Computational Linguistics. 2025. doi:10.1162/coli_a_00546

  21. [21]

    Evaluating Machine Translation Performance on C hinese Idioms with a Blacklist Method

    Shao, Yutong and Sennrich, Rico and Webber, Bonnie and Fancellu, Federico. Evaluating Machine Translation Performance on C hinese Idioms with a Blacklist Method. Proceedings of the Eleventh International Conference on Language Resources and Evaluation ( LREC 2018). 2018

  22. [22]

    2002 , organization=

    Sag, Ivan A and Baldwin, Timothy and Bond, Francis and Copestake, Ann and Flickinger, Dan , booktitle=. 2002 , organization=

  23. [23]

    A ttention is not E xplanation

    Jain, Sarthak and Wallace, Byron C. A ttention is not E xplanation. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1357

  24. [24]

    Contextualized Embeddings Encode Monolingual and Cross-lingual Knowledge of Idiomaticity

    Fakharian, Samin and Cook, Paul. Contextualized Embeddings Encode Monolingual and Cross-lingual Knowledge of Idiomaticity. Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021). 2021. doi:10.18653/v1/2021.mwe-1.4

  25. [25]

    CLCL : Non-compositional Expression Detection with Contrastive Learning and Curriculum Learning

    Zhou, Jianing and Zeng, Ziheng and Bhat, Suma. CLCL : Non-compositional Expression Detection with Contrastive Learning and Curriculum Learning. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.43

  26. [26]

    PIE : A Parallel Idiomatic Expression Corpus for Idiomatic Sentence Generation and Paraphrasing

    Zhou, Jianing and Gong, Hongyu and Bhat, Suma. PIE : A Parallel Idiomatic Expression Corpus for Idiomatic Sentence Generation and Paraphrasing. Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021). 2021. doi:10.18653/v1/2021.mwe-1.5

  27. [27]

    IEKG : A Commonsense Knowledge Graph for Idiomatic Expressions

    Zeng, Ziheng and Cheng, Kellen Tan and Nanniyur, Srihari Venkat and Zhou, Jianing and Bhat, Suma. IEKG : A Commonsense Knowledge Graph for Idiomatic Expressions. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.881

  28. [28]

    DeepSeek-AI and Aixin Liu and Aoxue Mei and Bangcai Lin and Bing Xue and Bingxuan Wang and Bingzheng Xu and Bochao Wu and Bowei Zhang and Chaofan Lin and Chen Dong and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenhao Xu and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Erhang Li and Fangqi Zhou and Fangyun Lin and Fucon...

  29. [29]

    2025 , journal =

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author =. 2025 , journal =

  30. [30]

    2024 , url=

    Un Ministral, des Ministraux , author=. 2024 , url=

  31. [31]

    2025 , eprint =

    Qwen3 Technical Report , author =. 2025 , eprint =

  32. [32]

    An Yang and Junyang Lin and Jingren Zhou and et al. , year=. Qwen3-Embedding-

  33. [33]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  34. [34]

    Chao and Derek F

    Fengying Ye and Shanshan Wang and Lidia S. Chao and Derek F. Wong. Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2026