WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

Jose Camacho-Collados; Mohammad Taher Pilehvar

arxiv: 1808.09121 · v3 · pith:IF4PNJ2Onew · submitted 2018-08-28 · 💻 cs.CL

WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

Mohammad Taher Pilehvar , Jose Camacho-Collados This is my paper

classification 💻 cs.CL

keywords datasetevaluationwordwordsaddresscontext-sensitivedynamicembeddings

0 comments

read the original abstract

By design, word embeddings are unable to model the dynamic nature of words' semantics, i.e., the property of words to correspond to potentially different meanings. To address this limitation, dozens of specialized meaning representation techniques such as sense or contextualized embeddings have been proposed. However, despite the popularity of research on this topic, very few evaluation benchmarks exist that specifically focus on the dynamic semantics of words. In this paper we show that existing models have surpassed the performance ceiling of the standard evaluation dataset for the purpose, i.e., Stanford Contextual Word Similarity, and highlight its shortcomings. To address the lack of a suitable benchmark, we put forward a large-scale Word in Context dataset, called WiC, based on annotations curated by experts, for generic evaluation of context-sensitive representations. WiC is released in https://pilehvar.github.io/wic/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
OLMo: Accelerating the Science of Language Models
cs.CL 2024-02 accept novelty 7.0

OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.
The Power of Scale for Parameter-Efficient Prompt Tuning
cs.CL 2021-04 unverdicted novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
Lessons from the Trenches on Reproducible Evaluation of Language Models
cs.CL 2024-05 accept novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
cs.CL 2019-05 accept novelty 6.0

SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution
cs.AI 2025-08 unverdicted novelty 5.0

SMoE substitutes low-importance experts with cached similar ones in MoE inference on edge devices to achieve 48% lower decoding latency and over 60% cache hit rate with nearly lossless accuracy.