A Mutual Information Maximization Perspective of Language Representation Learning

Cyprien de Masson d'Autume; Dani Yogatama; Lei Yu; Lingpeng Kong; Wang Ling; Zihang Dai

arxiv: 1910.08350 · v2 · pith:L2HKZUJSnew · submitted 2019-10-18 · 💻 cs.CL · cs.LG

A Mutual Information Maximization Perspective of Language Representation Learning

Lingpeng Kong , Cyprien de Masson d'Autume , Wang Ling , Lei Yu , Zihang Dai , Dani Yogatama This is my paper

classification 💻 cs.CL cs.LG

keywords informationmethodsmutualrepresentationlearningsentencewordcomputer

0 comments

read the original abstract

We show state-of-the-art word representation learning methods maximize an objective function that is a lower bound on the mutual information between different parts of a word sequence (i.e., a sentence). Our formulation provides an alternative perspective that unifies classical word embedding models (e.g., Skip-gram) and modern contextual embeddings (e.g., BERT, XLNet). In addition to enhancing our theoretical understanding of these methods, our derivation leads to a principled framework that can be used to construct new self-supervised tasks. We provide an example by drawing inspirations from related methods based on mutual information maximization that have been successful in computer vision, and introduce a simple self-supervised objective that maximizes the mutual information between a global sentence representation and n-grams in the sentence. Our analysis offers a holistic view of representation learning methods to transfer knowledge and translate progress across multiple domains (e.g., natural language processing, computer vision, audio processing).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Understanding the Robustness of Distributed Self-Supervised Learning Frameworks Against Non-IID Data
cs.LG 2026-07 unverdicted novelty 4.0

Abstract-only report: theoretical comparison finds MIM more robust than CL to non-IID data in D-SSL and robustness scales with connectivity; MAR loss proposed as practical application.