pith. sign in

arxiv: 2605.22247 · v1 · pith:65YICGREnew · submitted 2026-05-21 · 💻 cs.CL

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

Pith reviewed 2026-05-22 05:57 UTC · model grok-4.3

classification 💻 cs.CL
keywords idiom retrievalsemantic abstractionfigurative languageembedding modelsretrieval benchmarkcore meaning annotationlanguage modelsidiomatic expressions
0
0 comments X

The pith

Embedding models fail to retrieve the same core meaning when expressed as idioms versus literal phrases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents IdioLink, a retrieval benchmark containing 10,700 documents and 2,140 queries built around 107 idioms used in both literal and figurative senses. Each document and query carries annotations marking the spans that convey the core meaning. Tests on strong embedding models including BGE, E5, Contriever, and Qwen show these systems perform poorly at matching divergent surface forms that share an underlying meaning. The models instead depend on topical overlap or shallow lexical signals. A sympathetic reader would care because many real-world tasks, from search to question answering, require models to abstract past wording to reach equivalent ideas.

Core claim

We introduce IdioLink to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. The benchmark spans 107 idioms with both literal and figurative uses, each document and query annotated with spans that convey the core meaning. Evaluation of current embedding baselines reveals that models struggle to retrieve equivalent meanings across divergent surface realizations and instead rely on topical and shallow semantic cues.

What carries the argument

IdioLink, a retrieval benchmark of idiomatic and literal expressions paired with core-meaning-span annotations that forces models to abstract beyond lexical overlap.

If this is right

  • Retrieval systems using current embeddings will miss relevant documents when queries or documents contain idioms.
  • Benchmark results indicate that semantic abstraction mechanisms beyond surface similarity must be developed.
  • IdioLink provides a concrete testbed for training or evaluating future models on figurative language.
  • Performance gaps suggest existing evaluation sets may overestimate model capability on non-literal input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on IdioLink would likely improve model robustness on related phenomena such as metaphors or sarcasm that also require abstraction.
  • Adding core-meaning supervision during pretraining could transfer to other retrieval tasks that involve paraphrasing.
  • Extending the benchmark to additional languages would reveal whether the observed gaps are language-specific or general.

Load-bearing premise

Human annotations correctly and without bias identify the core meaning spans that distinguish genuine semantic equivalence from mere topical similarity.

What would settle it

Run the models on IdioLink queries while also measuring performance on a matched control set of topical distractors that lack core-meaning overlap; high accuracy on core matches paired with low accuracy on topical controls would support the claim, while strong performance on both would falsify the reported struggle.

Figures

Figures reproduced from arXiv: 2605.22247 by Daniel Fadlon, Jiahuan Pei, Kai Golan Hashiloni, Kfir Bar, Lior Livyatan, Ofri Hefetz.

Figure 1
Figure 1. Figure 1: The four different document types for a given [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Sentence and span embedding. The latter employs mean-pooling over the embeddings of the span tokens only. Index and similarity calculation. At retrieval time, queries are encoded w/ or w/o instructions and represented using sentence or span embed￾dings, while indexed documents are encoded with￾out instructions as full text, reflecting a realistic retrieval setting without access to annotated spans. 4.2 Zer… view at source ↗
Figure 3
Figure 3. Figure 3: Performance across our four inference-time configurations, where fine-tuning is done with no instructions [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces IdioLink, a retrieval benchmark with 10,700 documents and 2,140 queries spanning 107 idioms that have both literal and figurative uses. Each document and query is annotated with spans conveying the core meaning. Strong embedding baselines (BGE, E5, Contriever, Qwen) are evaluated, with the central claim that current models struggle to retrieve equivalent meanings across divergent surface realizations and instead rely on topical and shallow semantic cues.

Significance. If the central claim holds after addressing the annotation and reporting gaps, the work would be significant for the field by providing a targeted testbed that isolates semantic abstraction failures in idiom handling. The explicit core-meaning span annotations are a clear strength, enabling more precise diagnosis of whether retrieval failures stem from surface cues rather than meaning equivalence.

major comments (3)
  1. [Dataset construction] Dataset construction section: the paper reports 10,700 documents and 107 idioms but supplies no details on collection, filtering, or controls for topical/lexical bias. This is load-bearing for the claim that models rely on shallow cues, because without such details it is impossible to rule out that the benchmark itself introduces regularities that make shallow matching artificially easy.
  2. [Annotation process] Annotation process (core-meaning spans): no inter-annotator agreement, adjudication protocol, or control experiments (e.g., span-swapping while preserving meaning) are reported. This directly undermines the central claim, as annotator bias toward surface or topical overlap could produce the observed performance gaps even if models were capable of true semantic abstraction.
  3. [Evaluation and results] Evaluation and results section: the abstract states the main finding that models struggle and rely on shallow cues, yet the provided text contains no quantitative retrieval metrics, error analysis, or breakdown by idiom type. Without these, the claim that performance reflects inability to abstract beyond surface form cannot be verified.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including one or two key quantitative results (e.g., top-1 or MRR scores for the strongest baseline) to support the qualitative claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review. The comments highlight important areas for improving clarity and rigor in our presentation of the IdioLink benchmark. We address each major comment below and have revised the manuscript to incorporate additional details and analyses where feasible.

read point-by-point responses
  1. Referee: [Dataset construction] Dataset construction section: the paper reports 10,700 documents and 107 idioms but supplies no details on collection, filtering, or controls for topical/lexical bias. This is load-bearing for the claim that models rely on shallow cues, because without such details it is impossible to rule out that the benchmark itself introduces regularities that make shallow matching artificially easy.

    Authors: We agree that explicit details on dataset construction are necessary to support the claim regarding shallow cue reliance. In the revised manuscript we have expanded the Dataset Construction section to describe idiom selection from standard linguistic resources, document sourcing via balanced web queries targeting both literal and figurative contexts for each idiom, filtering steps for quality and relevance, and controls such as topic diversification across documents and lexical overlap minimization outside the idiom expressions themselves. We also added supporting analysis showing that topical regularities alone do not explain the performance patterns observed. revision: yes

  2. Referee: [Annotation process] Annotation process (core-meaning spans): no inter-annotator agreement, adjudication protocol, or control experiments (e.g., span-swapping while preserving meaning) are reported. This directly undermines the central claim, as annotator bias toward surface or topical overlap could produce the observed performance gaps even if models were capable of true semantic abstraction.

    Authors: We acknowledge that these methodological details were insufficiently reported. The revised manuscript now includes a new subsection detailing the annotation guidelines, inter-annotator agreement statistics computed over a sampled subset, the adjudication process for resolving disagreements, and results from control experiments that test annotation robustness by altering surface forms while holding core meaning constant. These additions help demonstrate that the observed model failures are not artifacts of annotation bias. revision: yes

  3. Referee: [Evaluation and results] Evaluation and results section: the abstract states the main finding that models struggle and rely on shallow cues, yet the provided text contains no quantitative retrieval metrics, error analysis, or breakdown by idiom type. Without these, the claim that performance reflects inability to abstract beyond surface form cannot be verified.

    Authors: We apologize if the quantitative results were not prominent enough in the reviewed version. The manuscript contains retrieval metrics (including Recall@k and nDCG) for the evaluated models; we have now expanded the Evaluation section with a dedicated error analysis subsection and breakdowns by idiom properties such as frequency and semantic decomposability. These additions provide direct evidence linking performance gaps to difficulties with meaning abstraction rather than surface cues. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction and baseline evaluation are self-contained.

full rationale

The paper constructs IdioLink as an external retrieval benchmark consisting of 10,700 documents, 2,140 queries, and core-meaning span annotations over 107 idioms, then reports empirical performance of independent embedding models (BGE, E5, Contriever, Qwen) on it. No derivations, equations, parameter fitting, or self-citations are invoked to generate the central claims; the observed gaps in cross-surface retrieval are presented as direct measurements against the newly introduced dataset rather than reductions to prior fitted values or author-defined uniqueness results. The work therefore remains externally falsifiable through the released annotations and queries without any load-bearing step that collapses back to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the standard NLP domain assumption that idioms are non-compositional and require abstraction beyond surface form; it introduces no free parameters, no new mathematical axioms, and no invented entities.

axioms (1)
  • domain assumption Idioms pose a fundamental challenge because their meaning cannot be inferred from surface form alone
    Directly stated in the opening sentence of the abstract as the motivation for the benchmark.

pith-pipeline@v0.9.0 · 5697 in / 1205 out tokens · 37511 ms · 2026-05-22T05:57:05.341457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 11 internal anchors

  1. [1]

    u seyin An l C akmak, G \

    Do g ukan Arslan, H \"u seyin An l C akmak, G \"u l s en Eryi g it, and Joakim Nivre. 2025. https://doi.org/10.18653/v1/2025.mwe-1.4 Using LLM s to advance idiom corpus construction . In Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025), pages 21--31, Albuquerque, New Mexico, U.S.A. Association for Computational Linguistics

  2. [2]

    Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. 2023. https://doi.org/10.18653/v1/2023.findings-acl.225 Task-aware retrieval with instructions . In Findings of the Association for Computational Linguistics: ACL 2023, pages 3650--3675, Toronto, Canada. Association for Computationa...

  3. [3]

    G \"o zde Aslanta s and Tunga Gungor. 2026. https://doi.org/10.18653/v1/2026.sigturk-1.4 A unified T urkic idiom understanding benchmark: Idiom detection and semantic retrieval across five T urkic languages . In Proceedings of the Second Workshop Natural Language Processing for T urkic Languages ( SIGTURK 2026) , pages 38--51, Rabat, Morocco. Association ...

  4. [4]

    Subalalitha

    J Briskilal and C.N. Subalalitha. 2022. https://doi.org/10.1016/j.ipm.2021.102756 An ensemble model for classifying idioms and literal texts using BERT and RoBERTa . Information Processing & Management, 59(1):102756

  5. [5]

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M 3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 2318--2335, Bangkok,...

  6. [6]

    Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, and Jy yong Sohn. 2024. https://arxiv.org/abs/2412.03223 Linq-Embed-Mistral technical report . Preprint, arXiv:2412.03223

  7. [7]

    Mathieu Constant, G \"u l s en Eryiǧit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017. https://doi.org/10.1162/COLI_a_00302 S urvey: Multiword expression processing: A S urvey . Computational Linguistics, 43(4):837--892

  8. [8]

    Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2007. https://aclanthology.org/W07-1106/ Pulling their weight: Exploiting syntactic forms for the automatic identification of idiomatic expressions in context . In Proceedings of the Workshop on A Broader Perspective on Multiword Expressions, pages 41--48, Prague, Czech Republic. Association for Computation...

  9. [9]

    Paul Cook, Afsaneh Fazly, and Suzanne Stevenson. 2008. The VNC - T okens dataset. In Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions, pages 19--22

  10. [10]

    Francesca De Luca Fornaciari, Bego \ n a Altuna, Itziar Gonzalez-Dios, and Maite Melero. 2024. https://doi.org/10.18653/v1/2024.figlang-1.5 A hard nut to crack: Idiom detection with conversational large language models . In Proceedings of the 4th Workshop on Figurative Language Processing (FigLang 2024), pages 35--44, Mexico City, Mexico (Hybrid). Associa...

  11. [11]

    GülŞen Eryiğit, Ali Şentaş, and Johanna Monti. 2022. https://doi.org/10.1017/s1351324921000401 Gamified crowdsourcing for idiom corpora construction . Natural Language Engineering, 29(4):909–941

  12. [12]

    Afsaneh Fazly, Paul Cook, and Suzanne Stevenson. 2009. https://doi.org/10.1162/coli.08-010-R1-07-048 Unsupervised type and token identification of idiomatic expressions . Computational Linguistics, 35(1):61--103

  13. [13]

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.552 S im CSE : Simple contrastive learning of sentence embeddings . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894--6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

  14. [14]

    Waseem Gharbieh, Virendra Bhavsar, and Paul Cook. 2016. https://doi.org/10.18653/v1/W16-1817 A word embedding approach to identifying verb-noun idiomatic combinations . In Proceedings of the 12th Workshop on Multiword Expressions, pages 112--118, Berlin, Germany. Association for Computational Linguistics

  15. [15]

    Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao. 2025. https://arxiv.org/abs/2409.04701 Late chunking: Contextual chunk embeddings using long-context embedding models . Preprint, arXiv:2409.04701

  16. [16]

    Hessel Haagsma, Johan Bos, and Malvina Nissim. 2020. https://aclanthology.org/2020.lrec-1.35/ MAGPIE : A large corpus of potentially idiomatic expressions . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 279--287, Marseille, France. European Language Resources Association

  17. [17]

    Lovisa Hagström, Youna Kim, Haeun Yu, Sang goo Lee, Richard Johansson, Hyunsoo Cho, and Isabelle Augenstein. 2026. https://arxiv.org/abs/2505.16518 CUB : Benchmarking context utilisation techniques for language models . Preprint, arXiv:2505.16518

  18. [18]

    Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. https://aclanthology.org/2020.mwe-1.6/ A lpha MWE : Construction of multilingual parallel corpora with MWE annotations . In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 44--57, online. Association for Computational Linguistics

  19. [19]

    Kazi Saidul Hasan and Vincent Ng. 2014. https://doi.org/10.3115/v1/P14-1119 Automatic keyphrase extraction: A survey of the state of the art . In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1262--1273, Baltimore, Maryland. Association for Computational Linguistics

  20. [20]

    Kai Golan Hashiloni, Ofri Hefetz, and Kfir Bar. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1213 Easy as PIE ? identifying multi-word expressions with LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 23771--23790, Suzhou, China. Association for Computational Linguistics

  21. [21]

    Wei He, Marco Idiart, Carolina Scarton, and Aline Villavicencio. 2024. https://doi.org/10.18653/v1/2024.findings-acl.741 Enhancing idiomatic representation in multiple languages via an adaptive contrastive triplet loss . In Findings of the Association for Computational Linguistics: ACL 2024, pages 12473--12485, Bangkok, Thailand. Association for Computati...

  22. [22]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022. https://arxiv.org/abs/2112.09118 Unsupervised dense information retrieval with contrastive learning . Preprint, arXiv:2112.09118

  23. [23]

    Jackendoff

    Ray S. Jackendoff. 1997. The Architecture of the Language Faculty, volume 28 of Linguistic Inquiry Monographs. MIT Press, Cambridge, MA; London, England

  24. [24]

    Rohan Jha, Bo Wang, Michael G \"u nther, Georgios Mastrapas, Saba Sturua, Isabelle Mohr, Andreas Koukounas, Mohammad Kalim Akram, Nan Wang, and Han Xiao. 2024. https://doi.org/10.18653/v1/2024.mrl-1.11 J ina- C ol BERT -v2: A general-purpose multilingual late interaction retriever . In Proceedings of the Fourth Workshop on Multilingual Representation Lear...

  25. [25]

    Greg Kamradt. 2024. https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb 5 levels of text splitting . GitHub repository. Accessed: 2025-12-27

  26. [26]

    Jenna Kanerva, Hanna Kitti, Li-Hsin Chang, Teemu Vahtola, Mathias Creutz, and Filip Ginter. 2025. https://doi.org/10.1007/s10579-023-09715-7 Semantic search as extractive paraphrase span detection . Language Resources and Evaluation, 59(1):257--276

  27. [27]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.550 Dense passage retrieval for open-domain question answering . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769--6781, Online. Ass...

  28. [28]

    Jisu Kim, Youngwoo Shin, Uiji Hwang, Jihun Choi, Richeng Xuan, and Taeuk Kim. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1099 Memorization or reasoning? exploring the idiom understanding of LLM s . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 21678--21699, Suzhou, China. Association for Computatio...

  29. [29]

    Ioannis Korkontzelos, Torsten Zesch, Fabio Massimo Zanzotto, and Chris Biemann. 2013. https://aclanthology.org/S13-2007/ S em E val-2013 task 5: Evaluating phrasal semantics . In Second Joint Conference on Lexical and Computational Semantics (* SEM ), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation ( S em E val 2013) , p...

  30. [30]

    Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024. https://arxiv.org/abs/2403.20327 Gecko:...

  31. [31]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. https://arxiv.org/abs/2005.11401 Retrieval-augmented generation for knowledge-intensive NLP tasks . Preprint, arXiv:2005.11401

  32. [32]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. https://arxiv.org/abs/2308.03281 Towards general text embeddings with multi-stage contrastive learning . Preprint, arXiv:2308.03281

  33. [33]

    Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, and Xilun Chen. 2025. https://doi.org/10.18653/v1/2025.acl-long.1457 DRAMA : Diverse augmentation from large language models to smaller dense retrievers . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30170--30186,...

  34. [34]

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. 2024. https://www.salesforce.com/blog/sfr-embedding/ SFR-Embedding-Mistral :enhance text retrieval with transfer learning . Salesforce AI Research Blog

  35. [35]

    Maggie Mi, Aline Villavicencio, and Nafise Sadat Moosavi. 2025. https://doi.org/10.18653/v1/2025.acl-long.362 Rolling the DICE on idiomaticity: How LLM s fail to grasp context . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7314--7332, Vienna, Austria. Association for Computationa...

  36. [36]

    Pu Miao, Zeyao Du, and Junlin Zhang. 2023. https://doi.org/10.1145/3583780.3614833 Deb CSE : Rethinking unsupervised contrastive sentence embedding learning in the debiasing perspective . In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, (CIKM 23), page 1847–1856. ACM

  37. [37]

    Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. 2025. https://arxiv.org/abs/2402.09906 Generative representational instruction tuning . Preprint, arXiv:2402.09906

  38. [38]

    Zach Nussbaum and Brandon Duderstadt. 2025. https://arxiv.org/abs/2502.07972 Training sparse mixture of experts text embedding models . Preprint, arXiv:2502.07972

  39. [39]

    Seoyoon Park, Hyeji Choi, Minseon Kim, Subin An, Xiaonan Wang, Gyuri Choi, and Hansaem Kim. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1540 FLUID QA : A multilingual benchmark for figurative language usage in dialogue across E nglish, C hinese, and K orean . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, ...

  40. [40]

    Dylan Phelps, Thomas Pickard, Maggie Mi, Edward Gow-Smith, and Aline Villavicencio. 2024. https://aclanthology.org/2024.mwe-1.22/ Sign of the times: Evaluating the use of large language models for idiomaticity detection . In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 178--187, T...

  41. [41]

    Pollio, John M

    Howard R. Pollio, John M. Barlow, Howard J. Fine, and Marilyn R. Pollio. 1977. Psychology and the Poetics of Growth: Figurative Language in Psychology, Psychotherapy, and Education. Lawrence Erlbaum, Hillsdale, NJ

  42. [42]

    Jipeng Qiang, Yang Li, Chaowei Zhang, Yun Li, Yi Zhu, Yunhao Yuan, and Xindong Wu. 2023. https://doi.org/10.1162/tacl_a_00572 C hinese idiom paraphrasing . Transactions of the Association for Computational Linguistics, 11:740--754

  43. [43]

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115

  44. [44]

    Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna Bhatia, Uxoa I \ n urrieta, Voula Giouli, Tunga G \"u ng \"o r, Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Renata Ramisch, Sara Stymne, Abigail Walsh, and Hongzhi Xu. 2020. https://aclanthology.org/2020.mwe-1.14/...

  45. [45]

    Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence embeddings using S iamese BERT -networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982--3992, Hong Kong, Chi...

  46. [46]

    Stephen Robertson and Hugo Zaragoza. 2009. https://doi.org/10.1561/1500000019 The probabilistic relevance framework: Bm25 and beyond . Found. Trends Inf. Retr., 3(4):333–389

  47. [47]

    Keshav Santhanam, Omar Khattab, Jon Saad-Falcon, Christopher Potts, and Matei Zaharia. 2022. https://doi.org/10.18653/v1/2022.naacl-main.272 C ol BERT v2: Effective and efficient retrieval via lightweight late interaction . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language ...

  48. [48]

    Agata Savary, Cherifa Ben Khelil, Carlos Ramisch, Voula Giouli, Verginica Barbu Mititelu, Najet Hadj Mohamed, Cvetana Krstev, Chaya Liebeskind, Hongzhi Xu, Sara Stymne, Tunga G \"u ng \"o r, Thomas Pickard, Bruno Guillaume, Eduard Bej c ek, Archna Bhatia, Marie Candito, Polona Gantar, Uxoa I \ n urrieta, Albert Gatt, and 9 others. 2023. https://doi.org/10...

  49. [49]

    Prateek Saxena and Soma Paul. 2020. https://arxiv.org/abs/2006.09479 EPIE dataset: A corpus for possible idiomatic expressions . Preprint, arXiv:2006.09479

  50. [50]

    Manon Scholivet, Agata Savary, Carlos Ramisch, Eric Bilinski, Takuya Nakamura, Maria Mitrofan, and Vasile Pais. 2026. https://doi.org/10.18653/v1/2026.mwe-1.33 Edition 2.0 of the PARSEME shared task on multilingual identification and paraphrasing of multiword expressions . In Proceedings of the 22nd Workshop on Multiword Expressions ( MWE 2026) , pages 25...

  51. [51]

    Zhan Shi, Guoyin Wang, Ke Bai, Jiwei Li, Xiang Li, Qingjun Cui, Belinda Zeng, Trishul Chilimbi, and Xiaodan Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.448 O ss CSE : Overcoming surface structure bias in contrastive learning for unsupervised sentence embedding . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

  52. [52]

    Caroline Sporleder, Linlin Li, Philip Gorinski, and Xaver Koch. 2010. https://aclanthology.org/L10-1425/ Idioms in context: The IDIX corpus . In Proceedings of the Seventh International Conference on Language Resources and Evaluation ( LREC '10) , Valletta, Malta. European Language Resources Association (ELRA)

  53. [53]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. https://doi.org/10.18653/v1/2023.findings-acl.71 One embedder, any task: Instruction-finetuned text embeddings . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1102--1121, Toronto, Cana...

  54. [54]

    Shiva Taslimipoor, Sara Bahaadini, and Ekaterina Kochmar. 2020. https://aclanthology.org/2020.mwe-1.19/ MTLB - STRUCT @parseme 2020: Capturing unseen multiword expressions using multi-task learning and pre-trained masked language models . In Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 142--148, online. Associa...

  55. [55]

    Harish Tayyar Madabushi, Edward Gow-Smith, Marcos Garcia, Carolina Scarton, Marco Idiart, and Aline Villavicencio. 2022. https://doi.org/10.18653/v1/2022.semeval-1.13 S em E val-2022 task 2: Multilingual idiomaticity detection and sentence embedding . In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 107--121, ...

  56. [56]

    Simone Tedeschi, Federico Martelli, and Roberto Navigli. 2022. https://doi.org/10.18653/v1/2022.findings-naacl.208 ID 10 M : Idiom identification in 10 languages . In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2715--2726, Seattle, United States. Association for Computational Linguistics

  57. [57]

    Su Nam Kim Timothy Baldwin. 2010. Handbook of Natural Language Processing, chapter 2:267-292

  58. [58]

    Hao Wang, Yangguang Li, Zhen Huang, Yong Dou, Lingpeng Kong, and Jing Shao. 2022. https://arxiv.org/abs/2201.05979 Sncse: Contrastive learning for unsupervised sentence embedding with soft negative samples . Preprint, arXiv:2201.05979

  59. [59]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024 a . https://arxiv.org/abs/2212.03533 Text embeddings by weakly-supervised contrastive pre-training . Preprint, arXiv:2212.03533

  60. [60]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.642 Improving text embeddings with large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11897--11916, Bangkok, Thailand. Associatio...

  61. [61]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 c . https://arxiv.org/abs/2402.05672 Multilingual e5 text embeddings: A technical report . Preprint, arXiv:2402.05672

  62. [62]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. https://arxiv.org/abs/2203.11171 Self- C onsistency improves chain of thought reasoning in language models . Preprint, arXiv:2203.11171

  63. [63]

    Uriel Weinreich. 1969. Problems in the analysis of idioms. In Problems in the Analysis of Idioms, pages 23--82. University of California Press, Berkeley

  64. [64]

    Orion Weller, Benjamin Chang, Sean MacAvaney, Kyle Lo, Arman Cohan, Benjamin Van Durme, Dawn Lawrie, and Luca Soldaini. 2025. https://doi.org/10.18653/v1/2025.naacl-long.597 F ollow IR : Evaluating and teaching information retrieval models to follow instructions . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associat...

  65. [65]

    Di Wu, Da Yin, and Kai-Wei Chang. 2024. https://doi.org/10.18653/v1/2024.findings-acl.117 KPE val: Towards fine-grained semantic-based keyphrase evaluation . In Findings of the Association for Computational Linguistics: ACL 2024, pages 1959--1981, Bangkok, Thailand. Association for Computational Linguistics

  66. [66]

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. 2024. https://arxiv.org/abs/2309.07597 C-pack: Packed resources for general chinese embeddings . Preprint, arXiv:2309.07597

  67. [67]

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, and 43 others. 2024. https://arxiv.org/abs/2407.10671 Qwen2 technical report . Preprint, arXiv:2407.10671

  68. [68]

    Ziheng Zeng and Suma Bhat. 2021. https://doi.org/10.1162/tacl_a_00442 Idiomatic expression identification using semantic compatibility . Transactions of the Association for Computational Linguistics, 9:1546--1562

  69. [69]

    Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. 2025 a . https://arxiv.org/abs/2412.19048 Jasper and stella: distillation of sota embedding models . Preprint, arXiv:2412.19048

  70. [70]

    Xin Zhang, Yanzhao Zhang, Wen Xie, Dingkun Long, Mingxin Li, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2025 b . https://openreview.net/forum?id=NC6G1KCxlt Phased training for LLM -powered text retrieval models beyond data scaling . In Second Conference on Language Modeling

  71. [71]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025 c . https://arxiv.org/abs/2506.05176 Qwen3 embedding: Advancing text embedding and reranking through foundation models . Preprint, arXiv:2506.05176

  72. [72]

    Jianing Zhou, Hongyu Gong, and Suma Bhat. 2021 a . https://doi.org/10.18653/v1/2021.mwe-1.5 PIE : A parallel idiomatic expression corpus for idiomatic sentence generation and paraphrasing . In Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021), pages 33--48, Online. Association for Computational Linguistics

  73. [73]

    Jianing Zhou, Ziheng Zeng, Hongyu Gong, and Suma Bhat. 2021 b . https://arxiv.org/abs/2112.08592 Idiomatic expression paraphrasing without strong supervision . Preprint, arXiv:2112.08592