Supervised Contextual Embeddings for Transfer Learning in Natural Language Processing Tasks

Aditya Siddhant; Anthony Tomasic; Matthias Grabmair; Mihir Kale; Radhika Parik; Sreyashi Nag

arxiv: 1906.12039 · v1 · pith:KE57YIUVnew · submitted 2019-06-28 · 💻 cs.CL · cs.LG

Supervised Contextual Embeddings for Transfer Learning in Natural Language Processing Tasks

Mihir Kale , Aditya Siddhant , Sreyashi Nag , Radhika Parik , Matthias Grabmair , Anthony Tomasic This is my paper

Pith reviewed 2026-05-25 14:13 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords supervised embeddingstransfer learningcontextual embeddingslow-resource NLPcross-task transfercross-domain transfercross-lingual transfer

0 comments

The pith

Representations extracted from supervised pre-trained models enrich word embeddings with task and domain knowledge that aids transfer learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests the idea of pulling embeddings from already-trained supervised models rather than building them only through unsupervised language modeling. Experiments across different tasks, domains, and languages show these embeddings improve results most clearly when labeled data is scarce. The size of the improvement still varies with the particular task and domain. A reader would care because many practical NLP problems face data shortages, so any method that reuses existing supervised models could lower the cost of starting new applications.

Core claim

The central claim is that representations taken from multiple pre-trained supervised models supply task-specific and domain-specific knowledge missing from standard unsupervised embeddings, and that these supervised embeddings produce measurable gains in cross-task, cross-domain, and cross-lingual transfer, with the largest effects appearing in low-resource conditions.

What carries the argument

Extracting contextual representations from multiple pre-trained supervised models and using them to enrich word embeddings.

If this is right

The supervised embeddings deliver their largest benefit when labeled data for the target task is limited.
The magnitude of improvement depends on the specific task and domain pair.
The same supervised representations also support cross-lingual transfer.
Multiple supervised models can be combined to produce the enriched embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach suggests that task-specific supervision encodes information that general unsupervised pre-training misses.
It could lower the labeled-data requirement for deploying NLP systems in new domains.
Future work might test whether the same supervised-extraction step improves performance when added to much larger contemporary language models.

Load-bearing premise

The knowledge captured inside supervised pre-trained models is both transferable to new settings and not already present in unsupervised embeddings.

What would settle it

Re-running the cross-task and cross-domain experiments in low-resource conditions and observing no accuracy gains or outright losses when swapping in the supervised embeddings would falsify the central claim.

read the original abstract

Pre-trained word embeddings are the primary method for transfer learning in several Natural Language Processing (NLP) tasks. Recent works have focused on using unsupervised techniques such as language modeling to obtain these embeddings. In contrast, this work focuses on extracting representations from multiple pre-trained supervised models, which enriches word embeddings with task and domain specific knowledge. Experiments performed in cross-task, cross-domain and cross-lingual settings indicate that such supervised embeddings are helpful, especially in the low-resource setting, but the extent of gains is dependent on the nature of the task and domain. We make our code publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Supervised embeddings from pre-trained models add some value in low-resource transfer settings but the gains are task-dependent and the abstract leaves the size of improvements unclear.

read the letter

The paper's main claim is that representations pulled from multiple supervised pre-trained models carry task and domain knowledge that standard unsupervised embeddings miss, and this helps most in low-resource cross-task, cross-domain, and cross-lingual transfer. They test the idea directly against unsupervised baselines and note that benefits depend on the specific task and domain. The public code release is a clear plus for anyone wanting to check or extend the work. The experiments cover a useful range of transfer settings, which is the part that actually moves the needle beyond the basic idea of using supervised sources. The central assumption—that supervised models hold transferable knowledge not already in unsupervised embeddings—gets tested in the reported setups without obvious circularity or mismatched controls. The main soft spot is that the abstract gives no numbers, baselines, or statistical details, so the strength of the evidence is hard to judge from the summary alone. If the full paper shows consistent gains with proper controls, the result is straightforward and usable. This is the kind of paper that matters for people building embeddings for specialized or low-resource NLP applications. It is worth sending to peer review because the question is practical, the comparison is empirical, and the code is available.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes extracting contextual embeddings from multiple pre-trained supervised models to enrich standard word embeddings with task- and domain-specific knowledge. It evaluates this approach via experiments in cross-task, cross-domain, and cross-lingual transfer settings, claiming that the supervised embeddings are helpful (especially in low-resource regimes) but that the magnitude of gains depends on the nature of the task and domain. The authors release their code publicly.

Significance. If the experimental results hold after proper controls and statistical reporting, the work would usefully demonstrate that supervised pre-training can inject transferable task/domain knowledge beyond what unsupervised methods (e.g., language modeling) capture, with particular value in low-resource transfer. The public code release is a clear strength for reproducibility.

major comments (1)

[Abstract] Abstract: the claim that 'experiments performed in cross-task, cross-domain and cross-lingual settings indicate that such supervised embeddings are helpful, especially in the low-resource setting' is asserted without any quantitative results, baselines, effect sizes, statistical tests, or data-exclusion criteria visible in the provided text, preventing verification that the data support the stated claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review. The sole major comment concerns the level of detail in the abstract; we address it directly below. The manuscript body contains the requested quantitative details, baselines, and evaluation criteria.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'experiments performed in cross-task, cross-domain and cross-lingual settings indicate that such supervised embeddings are helpful, especially in the low-resource setting' is asserted without any quantitative results, baselines, effect sizes, statistical tests, or data-exclusion criteria visible in the provided text, preventing verification that the data support the stated claim.

Authors: The abstract is intended as a concise summary of conclusions drawn from the full set of experiments. Quantitative results (including performance deltas versus unsupervised baselines such as word2vec and ELMo, effect sizes broken down by resource level, and dataset details) appear in Sections 4–6, with explicit descriptions of training/test splits, low-resource subsampling procedures, and evaluation metrics. No statistical significance tests were reported in the original submission; we can add them if required. We are willing to expand the abstract with one or two representative numbers (e.g., average F1 gain in low-resource cross-task settings) while respecting length constraints. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical comparison only

full rationale

The paper reports experiments comparing embeddings extracted from supervised pre-trained models against standard unsupervised embeddings across cross-task, cross-domain, and cross-lingual transfer settings. No derivation chain, first-principles predictions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described claims. The work is self-contained as an empirical evaluation whose results are falsifiable via replication on public code; no step reduces by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Ledger populated from abstract alone; full paper may introduce additional modeling choices or data assumptions.

axioms (1)

domain assumption Supervised pre-trained models capture task and domain specific knowledge in their internal representations that transfers to new settings.
This premise underpins why supervised embeddings are expected to outperform or complement unsupervised ones.

pith-pipeline@v0.9.0 · 5640 in / 1287 out tokens · 45975 ms · 2026-05-25T14:13:56.046432+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 3 internal anchors

[1]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. In Association for Computational Linguistics (ACL)

work page 2017
[2]

Joshua Coates and Danushka Bollegala. 2018. Frustratingly easy meta-embedding -- computing meta-embeddings by averaging source word embeddings. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)

work page 2018
[3]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \"i c Barrault, and Antoine Bordes. 2017 a . Supervised learning of universal sentence representations from natural language inference data. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2017
[4]

Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Herv \'e J \'e gou. 2017 b . Word translation without parallel data. In International Conference on Machine Learning (ICLR)

work page 2017
[5]

Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing. In International Conference on Machine Learning (ICLR)

work page 2016
[6]

Kazuma Hashimoto, Yoshimasa Tsuruoka, Richard Socher, et al. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2017
[7]

Zettlemoyer

Luheng He, Kenton Lee, Mike Lewis, and Luke S. Zettlemoyer. 2017. Deep semantic role labeling: What works and what's next. In Association for Computational Linguistics (ACL)

work page 2017
[8]

Young-Bum Kim, Karl Stratos, and Dongchan Kim. 2017. Domain attention with an ensemble of experts. In Association for Computational Linguistics (ACL)

work page 2017
[9]

Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2017
[10]

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Neural Information Processing Systems (NIPS)

work page 2017
[11]

Todor Mihaylov, Zornitsa Kozareva, and Anette Frank. 2017. Neural skill transfer from supervised language tasks to reading comprehension. arXiv preprint arXiv:1711.03754

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How transferable are neural networks in nlp applications? In Empirical Methods in Natural Language Processing (EMNLP)

work page 2016
[13]

Phoebe Mulcaire, Swabha Swayamdipta, and Noah Smith. 2018. Polyglot semantic role labeling. arXiv preprint arXiv:1805.11598

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Empirical methods in natural language processing (EMNLP)

work page 2014
[15]

Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Association for Computational Linguistics (ACL)

work page 2017
[16]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)

work page 2018
[17]

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task

work page 2012
[18]

Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal span-based neural constituency parser. In Association for Computational Linguistics (ACL)

work page 2017
[20]

Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2018
[21]

Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Conference on Natural language learning (CoNLL)

work page 2003
[22]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page
[23]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[1] [1]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. In Association for Computational Linguistics (ACL)

work page 2017

[2] [2]

Joshua Coates and Danushka Bollegala. 2018. Frustratingly easy meta-embedding -- computing meta-embeddings by averaging source word embeddings. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)

work page 2018

[3] [3]

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \"i c Barrault, and Antoine Bordes. 2017 a . Supervised learning of universal sentence representations from natural language inference data. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2017

[4] [4]

Alexis Conneau, Guillaume Lample, Marc'Aurelio Ranzato, Ludovic Denoyer, and Herv \'e J \'e gou. 2017 b . Word translation without parallel data. In International Conference on Machine Learning (ICLR)

work page 2017

[5] [5]

Timothy Dozat and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing. In International Conference on Machine Learning (ICLR)

work page 2016

[6] [6]

Kazuma Hashimoto, Yoshimasa Tsuruoka, Richard Socher, et al. 2017. A joint many-task model: Growing a neural network for multiple nlp tasks. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2017

[7] [7]

Zettlemoyer

Luheng He, Kenton Lee, Mike Lewis, and Luke S. Zettlemoyer. 2017. Deep semantic role labeling: What works and what's next. In Association for Computational Linguistics (ACL)

work page 2017

[8] [8]

Young-Bum Kim, Karl Stratos, and Dongchan Kim. 2017. Domain attention with an ensemble of experts. In Association for Computational Linguistics (ACL)

work page 2017

[9] [9]

Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2017

[10] [10]

Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Neural Information Processing Systems (NIPS)

work page 2017

[11] [11]

Todor Mihaylov, Zornitsa Kozareva, and Anette Frank. 2017. Neural skill transfer from supervised language tasks to reading comprehension. arXiv preprint arXiv:1711.03754

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. 2016. How transferable are neural networks in nlp applications? In Empirical Methods in Natural Language Processing (EMNLP)

work page 2016

[13] [13]

Phoebe Mulcaire, Swabha Swayamdipta, and Noah Smith. 2018. Polyglot semantic role labeling. arXiv preprint arXiv:1805.11598

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Empirical methods in natural language processing (EMNLP)

work page 2014

[15] [15]

Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Association for Computational Linguistics (ACL)

work page 2017

[16] [16]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)

work page 2018

[17] [17]

Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task

work page 2012

[18] [18]

Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Mitchell Stern, Jacob Andreas, and Dan Klein. 2017. A minimal span-based neural constituency parser. In Association for Computational Linguistics (ACL)

work page 2017

[20] [20]

Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2018

[21] [21]

Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Conference on Natural language learning (CoNLL)

work page 2003

[22] [22]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

work page

[23] [23]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page