pith. sign in

arxiv: 1906.09535 · v1 · pith:CFRSTWHRnew · submitted 2019-06-23 · 💻 cs.CL

Variational Sequential Labelers for Semi-Supervised Learning

Pith reviewed 2026-05-25 18:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords semi-supervised learningsequence labelingvariational methodslatent variablesmultitask learningword predictiondiscriminative models
0
0 comments X

The pith

A family of multitask variational methods combines latent-variable generative models with discriminative labelers for semi-supervised sequence labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces variational methods that pair a generative model using latent variables to predict words from context with a discriminative labeler. This setup allows the model to leverage unlabeled data by transferring information through the shared latent space. The approach explores different latent variable structures, including hierarchical ones that separate label and word information. If successful, this would mean better performance on sequence labeling tasks like part-of-speech tagging and named entity recognition when labeled data is limited. A sympathetic reader would care because many NLP applications suffer from scarce annotations.

Core claim

Our model family consists of a latent-variable generative model and a discriminative labeler. The generative models use latent variables to define the conditional probability of a word given its context. The labeler helps inject discriminative information into the latent space. We explore several latent variable configurations, including ones with hierarchical structure, which enables the model to account for both label-specific and word-specific information. Our models consistently outperform standard sequential baselines on 8 sequence labeling datasets, and improve further with unlabeled data.

What carries the argument

Multitask variational setup with latent-variable generative word prediction model and discriminative labeler that share latent space to transfer information.

If this is right

  • Models outperform standard sequential baselines on eight sequence labeling datasets.
  • Performance improves when additional unlabeled data is used.
  • Hierarchical latent variable configurations account for both label-specific and word-specific information.
  • The variational methods enable semi-supervised learning by combining generative and discriminative objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such models could potentially be applied to other sequence tasks beyond labeling, like parsing or translation.
  • Integrating these variational labelers with modern pre-trained language models might yield additional gains.
  • The hierarchical structure might help in tasks with fine-grained label distinctions.

Load-bearing premise

The variational approximation and latent variable configurations are sufficient to transfer useful information from the generative word-prediction objective to the discriminative labeling task.

What would settle it

If experiments on the eight sequence labeling datasets show no consistent outperformance over standard baselines or no further improvement with unlabeled data, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 1906.09535 by Karen Livescu, Kevin Gimpel, Mingda Chen, Qingming Tang.

Figure 1
Figure 1. Figure 1: Variational sequential labelers. The first row [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of attaching classification loss to [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE visualization of Gaussian latent vari [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Twitter dev accuracies (%) when varying the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

We introduce a family of multitask variational methods for semi-supervised sequence labeling. Our model family consists of a latent-variable generative model and a discriminative labeler. The generative models use latent variables to define the conditional probability of a word given its context, drawing inspiration from word prediction objectives commonly used in learning word embeddings. The labeler helps inject discriminative information into the latent space. We explore several latent variable configurations, including ones with hierarchical structure, which enables the model to account for both label-specific and word-specific information. Our models consistently outperform standard sequential baselines on 8 sequence labeling datasets, and improve further with unlabeled data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a family of multitask variational methods for semi-supervised sequence labeling consisting of a latent-variable generative model (inspired by word-prediction objectives) paired with a discriminative labeler. The generative models use latent variables, including hierarchical configurations, to define p(word|context) while the labeler injects discriminative signal into the latent space. The central empirical claim is that the resulting models consistently outperform standard sequential baselines on 8 sequence labeling datasets and improve further when unlabeled data is incorporated.

Significance. If the variational components and latent configurations are shown to be responsible for the gains (rather than the multitask auxiliary objective alone), the approach could offer a practical way to combine generative word-level modeling with discriminative labeling for improved semi-supervised performance on sequence tasks.

major comments (1)
  1. The experimental evaluation (as summarized in the abstract) reports consistent outperformance and gains from unlabeled data but provides no ablation that replaces the variational latent generative model with a deterministic auxiliary language-modeling objective sharing the same parameters. Without this control, it is not possible to isolate whether improvements arise from the variational approximation and hierarchical latents or simply from the multitask word-prediction auxiliary task; this directly bears on the central claim that the variational latent-variable setup transfers useful information to the labeler.
minor comments (1)
  1. The abstract and introduction would benefit from explicit statements of the exact baseline architectures and whether any of them already incorporate auxiliary language-modeling losses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the detailed review and the constructive suggestion to strengthen the isolation of the variational components. We address the major comment below.

read point-by-point responses
  1. Referee: The experimental evaluation (as summarized in the abstract) reports consistent outperformance and gains from unlabeled data but provides no ablation that replaces the variational latent generative model with a deterministic auxiliary language-modeling objective sharing the same parameters. Without this control, it is not possible to isolate whether improvements arise from the variational approximation and hierarchical latents or simply from the multitask word-prediction auxiliary task; this directly bears on the central claim that the variational latent-variable setup transfers useful information to the labeler.

    Authors: We agree that the requested ablation would help isolate whether gains arise specifically from the variational approximation and latent structure rather than the multitask auxiliary objective alone. The manuscript does not contain this control. The design of the model family centers on latent variables precisely to create a space into which the labeler can inject discriminative signal, with hierarchical configurations separating label-specific and word-specific factors; a purely deterministic auxiliary would lack this mechanism. Nevertheless, to directly address the concern and provide clearer evidence for the central claim, we will add the suggested ablation (a deterministic auxiliary LM sharing parameters with the labeler) to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical model comparisons

full rationale

The paper defines a new family of multitask variational sequence labelers with explicit generative and discriminative components, explores latent variable configurations, and reports empirical performance on 8 public datasets. No derivation chain, uniqueness theorem, or prediction is claimed; the central results consist of direct experimental comparisons to baselines, with no fitted parameters renamed as predictions and no load-bearing self-citations. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard variational modeling assumptions.

pith-pipeline@v0.9.0 · 5628 in / 961 out tokens · 35185 ms · 2026-05-25T18:01:34.871528+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := #2 '...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Altun, D

    Y. Altun, D. McAllester, and M. Belkin. 2006. Maximum margin semi-supervised learning for structured variables. In Y. Weiss, B. Sch\" o lkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 33--40. MIT Press

  4. [4]

    Isabelle Augenstein and Anders S gaard. 2017. Multi-task learning of keyphrase boundary classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 341--346. Association for Computational Linguistics

  5. [5]

    Joachim Bingel and Anders S gaard. 2017. Identifying beneficial task relations for multi-task learning in deep neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 164--169. Association for Computational Linguistics

  6. [6]

    Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio

    Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 10--21. Association for Computational Linguistics

  7. [7]

    Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014

  8. [8]

    Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. 2015. A recurrent latent variable model for sequential data. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2980--2988. Curran Associates, Inc

  9. [9]

    Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug):2493--2537

  10. [10]

    Marco Fraccaro, S ren Kaae S nderby, Ulrich Paquet, and Ole Winther. 2016. Sequential neural models with stochastic layers. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2199--2207. Curran Associates, Inc

  11. [11]

    Kevin Gimpel, Nathan Schneider, Brendan O'Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. 2011. Part-of-speech tagging for twitter: Annotation, features, and experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Tech...

  12. [12]

    Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Rezende, and Daan Wierstra. 2015. Draw: A recurrent neural network for image generation. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1462--1471, Lille, France. PMLR

  13. [13]

    Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1587--1596, International Convention Centre, Sydney, Australia. PMLR

  14. [14]

    Feng Jiao, Shaojun Wang, Chi-Hoon Lee, Russell Greiner, and Dale Schuurmans. 2006. Semi-supervised conditional random fields for improved sequence segmentation and labeling. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, pages 209--216, Sydney, Aust...

  15. [15]

    Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semi-supervised learning with deep generative models. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3581--3589. Curran Associates, Inc

  16. [16]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)

  17. [17]

    Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of NAACL-HLT, pages 260--270

  18. [18]

    Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. 2016. Auxiliary deep generative models. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1445--1453, New York, New York, USA. PMLR

  19. [19]

    Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579--2605

  20. [20]

    Mann and Andrew McCallum

    Gideon S. Mann and Andrew McCallum. 2008. Generalized expectation criteria for semi-supervised learning of conditional random fields. In Proceedings of ACL-08: HLT, pages 870--878. Association for Computational Linguistics

  21. [21]

    a ckstr \

    Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar T \"a ckstr \"o m, Claudia Bedini, N \'u ria Bertomeu Castell \'o , and Jungmee Lee. 2013. Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for ...

  22. [22]

    Yishu Miao, Lei Yu, and Phil Blunsom. 2016. Neural variational inference for text processing. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1727--1736, New York, New York, USA. PMLR

  23. [23]

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  24. [24]

    Scott Miller, Jethran Guinness, and Alex Zamanian. 2004. Name tagging with word clusters and discriminative training. In HLT-NAACL 2004: Main Proceedings, pages 337--342, Boston, Massachusetts, USA. Association for Computational Linguistics

  25. [25]

    Andriy Mnih and Karol Gregor. 2014. Neural variational inference and learning in belief networks. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 1791--1799, Bejing, China. PMLR

  26. [26]

    Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807--814

  27. [27]

    Olutobi Owoputi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 380--390. Associa...

  28. [28]

    Yookoon Park, Jaemin Cho, and Gunhee Kim. 2018. A hierarchical latent structure for variational conversation modeling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1792--1801. Association for Computational Linguistics

  29. [29]

    Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532--1543. Association for Computational Linguistics

  30. [30]

    Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1756--1765. Association for Computational Linguistics

  31. [31]

    Barbara Plank, Anders S gaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 412--418. Association for Computational Linguistics

  32. [32]

    Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Morency Collins, and Trevor Darrell. 2007. Hidden conditional random fields. IEEE transactions on pattern analysis and machine intelligence, 29(10)

  33. [33]

    Marek Rei. 2017. Semi-supervised multitask learning for sequence labeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2121--2130. Association for Computational Linguistics

  34. [34]

    Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1530--1538, Lille, France. PMLR

  35. [35]

    Ororbia, Joelle Pineau, and Aaron Courville

    Iulian Vlad Serban, Alexander G. Ororbia, Joelle Pineau, and Aaron Courville. 2017. Piecewise latent variables for neural variational text processing. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 422--432. Association for Computational Linguistics

  36. [36]

    Anders S gaard. 2011. Semi-supervised condensed nearest neighbor for part-of-speech tagging. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 48--52, Portland, Oregon, USA. Association for Computational Linguistics

  37. [37]

    Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3483--3491. Curran Associates, Inc

  38. [38]

    Amarnag Subramanya, Slav Petrov, and Fernando Pereira. 2010. Efficient graph-based semi-supervised learning of structured tagging models. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 167--176, Cambridge, MA. Association for Computational Linguistics

  39. [39]

    Xu Sun and Jun'ichi Tsujii. 2009. Sequential labeling with latent variables: An exact inference algorithm and its efficient approximation. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), pages 772--780. Association for Computational Linguistics

  40. [40]

    Erik F Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142--147. Association for Computational Linguistics

  41. [41]

    Jakub Tomczak and Max Welling. 2018. Vae with a vampprior. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, volume 84 of Proceedings of Machine Learning Research, pages 1214--1223, Playa Blanca, Lanzarote, Canary Islands. PMLR

  42. [42]

    Zhilin Yang, Ruslan Salakhutdinov, and William W Cohen. 2017 a . Transfer learning for sequence tagging with hierarchical recurrent networks. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017)

  43. [43]

    Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. 2017 b . Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3881--3890, International Convention Centre, Sydney, Austr...

  44. [44]

    Biao Zhang, Deyi Xiong, jinsong su, Hong Duan, and Min Zhang. 2016. Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 521--530. Association for Computational Linguistics

  45. [45]

    Xiao Zhang, Yong Jiang, Hao Peng, Kewei Tu, and Dan Goldwasser. 2017. Semi-supervised structured prediction with neural crf autoencoder. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1701--1711. Association for Computational Linguistics

  46. [46]

    Chunting Zhou and Graham Neubig. 2017. Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 310--320. Association for Computational Linguistics